next up previous contents
Next: Adapting the Combine-crawler Up: AOLA - The Austrian Previous: Running the Nedlib-crawler   Contents


Combine crawl

For the second test run in June 2001 we used the Combine-crawler24. Initially, this tool was designed for indexing purposes by the university of Lund, Sweden, in the scope of the DESIRE-project25 funded by the European Commission. However, the Kulturarw3-project at the Swedish National Library (Kungliga Biblioteket) adapted the indexer such that it could be used for web-archiving. Due to the public availability of the source code, the improvements were incorporated directly therein. Ever since these innovations have been implemented about five years ago, the tool is successfully applied and has already created a repository of considerable size.

Most of the robot is written in Perl5, except for some small modules, that are written in the programming language C++. The Berkeley-DB26 database is used for the internal queues. The tool is designed to be distributable, which implies that it scales well for large tasks. Since it is built by putting together relatively small building blocks, it is a flexible tool, the modules of which can be modified even while the system is running.

In the forefront, we visited members of the Kulturarw3-project in Stockholm. We had close contact to Allan Arvidson, the project leader, and we benefitted from his experience.

Since the robot was originally designed to be an indexer, so far, not all functionality desirable for an archiving system could be included. One drawback represents the fact, that in-line pictures are not harvested immediately together with the file in which they are referenced to. This causes considerations as far as authenticity of downloaded web-pages is concerned (cf. Section 2.2.2). (Yet, this problem also occurs with the Nedlib-crawler, that was specifically designed for archivation purposes.)

To get to know the new tool, we again performed small scale test runs. The different character of the robot compared to the Nedlib-crawler is obvious. Combine is much more flexible, allowing intervention while the system is running. At the same time, it demands a greater effort to set up. For instance, Cron-jobs27 have to be installed, that regularly feed URLs extracted from downloaded files back into the system.

The storage concepts used (cf. Section 2.2.2) produces a well structured repository. Yet, it can not be realised properly, if constraints in storage space make it impossible to have a complete run on hard-disk. For our attempt to sweep the Austrian web-space, the available hardware was not capacious enough. Therefore, the acquired data had to be transferred to final storage on tape again and again, before all files belonging to a server have actually been retrieved. Subsequently, the ordered structure can not be sustained, since files from the same server will end up on different tapes.

The second run was launched on June 4th, 2001. As a result of this second run a repository holding 115 gigabyte of data was created that was acquired at a rate of about 7 gigabyte per day. This includes more than 2,8 million pages from about 45.000 sites. Due to insufficient hardware equipment the run had to be stopped early on June 21th. Because of a malfunctioning hard-disk some data was lost. Otherwise, a snapshot of the Austrian web-space could have been realised. However, about one tenth of a complete sweep was performed, estimated by an assumed analogy with Sweden's web-space.



Footnotes

...Combine-crawler24
http://www.lub.lu.se/combine/
... DESIRE-project25
Development of a European Service for Information on Research and Education;
http://www.desire.org
...Berkeley-DB26
http://www.sleepycat.com
...Cron-jobs27
Cron is a background process, a so-called 'daemon', that executes programs at regular intervals (e.g., every minute, day, week, or month). At what times, which programs are to be run can be defined in a table, the 'crontab'.


Subsections
next up previous contents
Next: Adapting the Combine-crawler Up: AOLA - The Austrian Previous: Running the Nedlib-crawler   Contents
Andreas Aschenbrenner