next up previous contents
Next: Adapting the Nedlib-crawler Up: AOLA - The Austrian Previous: System setup   Contents


Nedlib crawl

We started experimental data collection using the Nedlib-crawler. The Finish National Library prompted the Finnish Center for Scientific Computing (CSC) to develop this tool. Based on the specifications written jointly by the Nedlib partners (cf. Section 3.5), the crawler was constructed from scratch, since making adaptations to indexing crawlers to accommodate archiving features was deemed too difficult to accomplish [Hak01].

Being freely available in the public domain23, the tool is under further refinement. Written entirely in the C programming language, it uses a MySQL relational database as a supplement. As a sophisticated feature the tool incorporates monitoring of the web-servers in order not to overload them with repeated requests. Even though the crawler is basically composed of several modules, it is constructed in a rather inflexible fashion.

After performing some initial small-scale tests, we quickly learned that the storage format used for archiving (cf. Section 2.3.3) is impractical. The files are simply put in a directory having no structure whatsoever. For the sake of a well-sorted collection, facilitating resource retrieval as well as administration in the long term taking the application of preservation strategies into account, a sound storage hierarchy should be aimed at.

Also, we feel that the way in-line pictures are handled is not comprehensive enough for an archiving robot. The identification of pictures is very limited, since pictures are only recognised if they have one of some extensions the files commonly have. Designing this process such that the recognition of pictures is based on the MIME-type provided by the web-server yields a more comprehensive solution.

Pictures are taken even if they reside on "foreign" hosts. However, they are not prioritised for download, but they are simply put in the queue with all the other documents. This causes problems as far as the authenticity of a web-page is concerned. In fact, it could happen, that the text and the corresponding in-line picture do not belong together, since the latter was downloaded several days, or even weeks, later (cf. Section 2.2.2).

After we have realised the changes and slightly extended the logging mechanisms, we performed further test runs. It turned out that the system configuration was instable. This was due to the combination of a software RAID system with the XFS file-system. For this reason, we were forced to control the hard-disks individually, and switch between them manually. However, this requires pausing the crawler every time storage on one disk is depleted.

Finally, taking a snapshot of the Austrian web-space we started the actual crawl on May 7th, 2001. Harvester processes, in charge of downloading files from URLs they get from the Scheduler module, tended to die again and again. Since they had to be restarted manually, this strained performance significantly. Therefore, we installed a process, that took on this job by regularly restarting all Harvester modules. Furthermore, small bugs we discovered during the crawl were fixed, such as the parsing of downloaded files, which produced core dumps on specific URL definitions.

However, more severe proved to be the fact, that the Nedlib-crawler checks only after a download whether this file has been acquired before. The file is not entered in the archive if a previous version already exists. Thus, multiple downloading of a single file occurs, degrading not only the performance of the crawler but also straining data traffic of the web-servers. After having received complaints from several service providers, underlining that their data traffic was blocked by the crawler, we had to abort this try after ten days on May 16th, 2001.

In this first test run about 666.000 unique URLs were harvested from 1.210 different sites. All in all 8,3 gigabyte of data were stored at a rate of about 1 gigabyte per day. We experienced that, basically, the Nedlib-crawler is constructed such that the requirements for the AOLA-project are met. Yet, the preliminary version we worked with proved not to be stable enough. We reported our modifications to the developer of the crawler for incorporation in the tool. A new version has been released recently, yet could not be incorporated in our experiments anymore.



Footnotes

... [Nedlib-crawler] available in the public domain23
http://www.csc.fi/sovellus/nedlib/


Subsections
next up previous contents
Next: Adapting the Nedlib-crawler Up: AOLA - The Austrian Previous: System setup   Contents
Andreas Aschenbrenner