Active data acquisition

Next: Storage Up: Data Acquisition Previous: Passive data acquisition Contents

Active data acquisition

This method involves actively collecting the material the collection should be composed of and that should be preserved. If publicly available material is acquired, the publishers can, but do not necessarily need to know that their work has been entered in an archive. Furthermore, a partnership with the publisher can be actively initiated after having rated his work worth integrating in the archive. This would be beneficial in order to compile specific metadata to be enclosed with the collection item. Yet, contact with the publisher is not essential when actively collecting documents.

Basically, two different types of active data acquisition can be distinguished, being

manual gathering of the data, and
automatic, bulk collection.

Manual collection requires human personnel to collect every file and enter it in the archive. The scope of the resulting collection is very narrow regarding the potential of the Internet information space. Yet, using this method produces consistent collection items and allows a good follow-up of site evolution.

Contrary to a manual approach is the strategy of bulk collection, i.e. of collecting open-access material as automatically as possible, with the most popular source being the Internet. Thereby, a bulk of data is gathered, which is widely distributed and highly representative of the Internet information space. Following this process, a comprehensive, navigable archive is built.

When using bulk collection the acquisition of the material is conducted by so-called web-crawlers, such as those used by current Internet search engines. Starting from a number of sites, they move to other sites following the links they find. Due to the highly interlinked structure of documents on the Internet, these robots are able to harvest autonomously a considerable portion of the web. Yet, sites, which are not part of the initial setting and are not linked to from any site, will not be collected. This part of the Internet, known as the deep Web, remains out of reach. Thus, any solution will never be complete but only far-reaching.

Furthermore, trying to get as big a portion of the open-access material as possible makes it hard to get it frequently. This is mainly due to the large amounts of data to be collected, which can range in the fields of more than 1 terabyte for a single snapshot. Thus, a single snapshot may take any time up to several months. At the same time, data in the Internet characteristically has a high volatility [Ger00]. To monitor each and every file and retrieve it, if it has changed, is technically not realisable due to a restricted bandwidth. Therefore, it is inevitable that intermediate versions of documents are missed out. In fact, most will be lost at all.

Yet, missing a certain percentage of files is no problem at all. When collecting documents from the web, the primary goal is not so much to save the pure facts wrapped up in all the open-access data to be found. The motivation is rather to convey an overall impression of the look and feel this material offers, its place in the large, inter-linked network and how it evolves. Taking this one step further, it calls into question the continuous downloading of each new or changed file that can be found. Actually, it should suffice to make a snapshot of the Internet at certain intervals, say, every half a year. To sum up, two different strategies can be identified when collecting open-access material from the Internet. Files can be continuously downloaded whenever they are found to have changed or a new one is added, or sweeps over the whole scope can be performed at a given frequency.

None of these two methods, neither manual collection nor bulk collection seems fully satisfactory. However, they are not mutually exclusive. Rather, they may complement each other [ML01]. A combination could benefit from the advantages of both strategies. Automatic harvesting is used to obtain a broad coverage. It establishes a basis for letting the archive convey a picture on how browsing the Internet was at the times the data was entered in the archive. Yet, sites with a daily or weekly update will be traced rather poorly and pages not reachable for the crawlers will be totally missed out. They, thus, might be monitored by a separate, more frequent crawling process, allowing, for example, daily downloads for fast changing sites. Furthermore, happenings of special interest can be watched more closely. Occasions calling for a special monitoring might be political events such as elections, topics arousing emotional and wide-spread public debate, large events or art festivals. In any such situation sites alluding to the very issue are manually selected and will receive a special focus. Downloading the very sites more frequently and caring for their consistency in the collection will produce a comprehensive coverage of such events.

The terrorist attack in the United States on September 11^th, 2001, was such an occasion. The Internet Archive in cooperation with the Library of Congress were quickly to react. Almost immediately a project ² was started with the goal to identify, archive and annotate relevant web-sites with content related to the assault. "Thereby, scientists want to counter the paradox that content in the Internet is fugitive on one hand, on the other a huge collection of documents of the times." [Net01]

Automatically collecting open-access material following an active strategy entails having to care for the consistency of each collection item integrating the original document in the technical environment of the archive, which could raise various problems for preservation and access. If, for instance, the type of the digital material cannot be handled by the system in this respect, relief can only be produced in two ways. Either the data has to be converted to another format, which entails violating the document's authenticity. Alternatively, access provision and means of preservation must be adapted such that the new data type can be processed. Such operational problems can be expected to come up again and again, since the Internet is a very dynamic information space making it impossible to anticipate all probable situations. The more automatically the data is retrieved, the more likely these inconsistencies remain undetected resulting in defective data in the archive.

When applying an automatic procedure using web-crawlers to retrieve the documents, the question of authenticity is raised. This stems from the way files are collected in the first place. Whenever web-crawlers find any referenced object like an in-line image or a new web-page linked to in an acquired file, they put the reference in a queue. This material is not acquired immediately, but only after a certain time span has passed in order not to overload the web-servers. As more and more references are found, the queue grows, in fact, it can have millions of entries waiting to be harvested. Due to bandwidth limitations, it takes a long time to acquire the masses of data involved. For example, obtaining in a sweep the material available in a national web-space having the size of Sweden or Austria takes several month. For these reasons, there could be a considerable lapse of time between the discovery of a file and its actual retrieval.

Consider a site having a very fluctuating content, such as a newspaper. As a matter of fact, the files could be downloaded such that the archive holds an article several weeks older than the title page. Taking this further, it could indeed happen, that an in-line picture is registered in the archive to belong to a page but actually it was part of a previous version of the very page. Therefore, technical limitations not only have serious implications on the consistency of the archive, but also on the authenticity of the documents. These problems are hard to do away with. In the case of in-line images a solution would be a prioritised download, downloading the picture as soon as possible after the page.

Furthermore, controlling and filtering the content is more difficult when using automatic means for data acquisition. There also exist web-sites, that must be deemed morally reprehensive or which are even prohibited by the law. This includes, for instance, web-sites with Nazi propaganda or child pornography. These sites cannot be detected automatically, hence, they will be included in the archive along with all the other documents. Thereby, the archive may contain offensive material without being aware of it. This constitutes a severe legislative problem and has to be handled with great sensitivity considering future users of the archive.

Footnotes

... terrorist attack in the United States on September 11^th, 2001 ²: http://september11.archive.org

Next: Storage Up: Data Acquisition Previous: Passive data acquisition Contents

Andreas Aschenbrenner