next up previous contents
Next: Nedlib crawl Up: AOLA - The Austrian Previous: Other Internet sources   Contents


System setup

The system of the AOLA-project is based on a Linux operating system. Storage space comprises three 80 gigabyte hard-disks. Once the acquired material is ready for long-term storage, it is transfered to tape using a six-fold tape-drive.

Initially we planned to install a software RAID system21, a storage management with the ability to combine several physical disks into one larger, virtual device. Applying such a configuration improves performance and at the same time makes the handling more convenient. Thereby, we intended to take advantage of a capacious buffer space comprising nearly 240 gigabytes before transferring the data to the final tape storage. Yet, this innovation in combination with the XFS file-system22 we used turned out to randomly overwrite data. Therefore, we were forced to abandon the software RAID.

However, controlling the hard-disks individually entails switching between them manually, which requires pausing the crawler every time storage on one disk is depleted. Due to this limitation we designed operations of the system when performing a crawl such that one hard-disk is used solely for the program-files of the crawler and as a buffer. Data is downloaded to one of the other two hard-disks as long as there is space available. As soon as storage on the very disk is used up, it is switched to the remaining disk. While again downloading data to this fresh disk, the acquired material on the filled disk is processed. First, statistics are compiled, and then the documents are compressed using the buffer as a temporary storage. After the collection items have been written to tape, data on the corresponding disk can be erased again. These cycles are repeated, switching between the two hard-disks. Both, the Nedlib-crawler and Combine were configured such that this scheme of operation could be applied.

Generally spoken, it is tried to adhere to publicly available software, since independence from commercial providers is deemed important in such a long-term project. Additionally, this offers the possibility for close cooperation with other projects in this field. Therefore, the crawlers and other tools used for performing the snapshots of the Austrian web-space are also freely available. Furthermore, we made sure that the source code of the programs we apply is available, which we consider essential for a project that has no off-the-shelf solution but still requires research efforts.



Footnotes

... RAID system21
short for Redundant Array of Inexpensive Disks
... XFS file-system22
high-performance file-system, very scalable due to 64-bit addressing;
by Silicon Graphics, Inc. (SGI)

next up previous contents
Next: Nedlib crawl Up: AOLA - The Austrian Previous: Other Internet sources   Contents
Andreas Aschenbrenner