next up previous contents
Next: Running the Nedlib-crawler Up: Nedlib crawl Previous: Nedlib crawl   Contents


Adapting the Nedlib-crawler

In order to integrate the tool into our system configuration, we changed the place where the harvested files are put. Originally, each day a directory, that has the very date as name, was created to be root of the storage hierarchy. This very directory was automatically placed where the tool is located, at the same level as the directory for the binary files, the sources, and others more.

Since our system configuration required the program-files of the Nedlib-crawler to be on a different hard-disk than the archive, we had to adapt the code. For this reason, another parameter was defined that specifies the location, where the downloaded material should be stored at.

However, the basic storage concept remained the same. Below the location, that was specified by use of the new parameter, are the directories for each day harvesting and below them the directories with the index, as described in Section 2.3.3. This running number is increased and consequently the retrieved material is put into a new directory, after 2.000 files have been collected, in order not to let a single directory become too big. Consequently, the actual size of a directory with a running number can be predicted only roughly. This, in turn, makes the packing of the files and the subsequent transfer to tape more difficult.

To make up for this shortcoming we rebuilt the archiving module, such that directories are changed after they have reached a certain size in bytes, not after a fixed number of files have been put therein. When files are moved to tape archives, size, and thus directories per tape are easier to determine.

In-line images should be handled prioritised and must be downloaded even if the web-server they reside on is not within the allowed scope. This is implemented, by identifying pictures based on a list of extensions, the corresponding files could conceivably have. Yet, these extensions are written directly into the code and they comprise only .gif, .jpeg, and .tiff. Since the most important extension is .jpg, we extended the list slightly. However, a more flexible method should be adopted for this purpose in the future.

Furthermore, we felt that the decision making process of whether an URL should be allowed to be harvested was inconsistent. Based on three tables it should be, in principle, possible to define this quite granularly. Yet, entries had only the status "allowed" or "disallowed", and were structured such that the handling was not intuitive. Even worse, some specifications could not be made at all.

The three tables intended for this are called 'domains', 'hosts', and 'restrictions'. With the broadest scope possible, the 'domains'-table takes the national domain, .at in the case of Austria. Additionally, second-level domains that are not to be found under the national code, yet, are of interest to the very country can be entered here. We, for instance, defined here amongst others austria.cz.

As indicated by the name, the 'hosts'-table is directed at specific web-servers. Those are specified here, that are registered under a foreign domain, but still part of the national web-space. www.artmagazine.cc is part of the Austrian web-space, for example. Alternatively, those hosts can be defined here that would be allowed according to the 'domains'-table, yet, must not be taken for any reason. To distinguish between the allowed hosts and those that must not be taken, a further column of the table taking barely the values 'Y' and 'N' exists.

Lastly, the 'restrictions'-table defines very granularly, whether files on a specific host are allowed to be acquired. Files of a countries interest residing on a foreign web-server are intended to be specified here. www.embassyworld.com, for example, holds a file called /embassy/austria.htm we decided to include. www.lonelyplanet.com, as another example, hosts a special about Austria at the path /destinations/europe/austria/.

According to the old decision process of the Nedlib-crawler whether or not a file is allowed to be taken (as depicted in Figure 4.1), it is impossible to register something in the 'restrictions'-table and at the same time expect files to be downloaded that are under the national domain registered in the 'domains'-table. In that case, "Is there any allowing Rule?" is answered with "Yes", and subsequently all files that are not explicitly entered in the 'restrictions'-table, yet, registered under the national domain are discarded. That is certainly not our intention, thus, we reconstructed the decision tree (as shown in Figure 4.1), taking into account that entries could also be not specified besides allowed and not allowed.


Figure 4.1: Accepting an URL in the Nedlib-crawler
Accepting URL - Nedlib Accepting URL - Nedlib

Further, we extended the logging mechanisms of the Nedlib-crawler in order to have more possibilities to control its doings. Supervising disk-space is a further functionality we added, which guarantees, that the robot is automatically paused after a certain amount of data has been acquired. At the same time an e-mail is sent, informing the operator that the available space is exhausted. Additionally, we implemented a function that writes an index-file to allow efficient retrieval of the archived documents.

During the actual run trying to make a sweep of the whole Austrian web-space, we became aware of links within a web-page the Nedlib-crawler was not able to parse. Those considered URLs that were passed parameters, such as www.lion.cc?name=AOLA. We corrected the parsing such that it could handle this type of references. Yet, new types will certainly come up, other file formats will be introduced. Thus, a parser must be constructed very flexible, in order to make continuous adaptations possible keeping the tool up-to-date.


next up previous contents
Next: Running the Nedlib-crawler Up: Nedlib crawl Previous: Nedlib crawl   Contents
Andreas Aschenbrenner