Running the Combine-crawler

Next: Evaluation of the harvested Up: Combine crawl Previous: Adapting the Combine-crawler Contents

Running the Combine-crawler

Due to the very flexible design concept the Combine-crawler is based on, the three modules - Cabin, Harvester, and Harvesting Database (cf. Figure 4.2) - are started independently. Usually, multiple instances of harvesters and parsers are launched, that can even be distributed among a collection of hosts in a network. All those processes can be stopped and restarted at any time, which makes the system very robust.

Communication between the server daemons and clients is realised via files as well as network connections. Each of the components can be modified or even replaced as long as they adhere to the defined protocols.

General configuration is done in the file etc/combine.conf, offering a variety of possible manipulations. The scope of the collection is delimited by the files etc/config_allow and etc/config_exclude. Each line therein defines a host or a path to be allowed respectively excluded using regular expressions. For example, the national domain of Austria is given by the line "HOST:
.at$" and, obviously, entered in the file etc/config_allow. A path not to be collected such as "^ www
.host
.at/infinite/recursion/infinite/" would, analogically, be specified in etc/config_exclude. Similarly, file-extensions typically held by pictures can be listed in the file etc/config_pics.

Filtering the newly found references with respect to the defined scope, and registering the URLs to harvest is done by use of an additional utility, which is not automatically executed. Rather, a Cron-job should be installed for this task. Also, other supervising functionality should be implemented this way, such as guaranteeing that the tool is paused when the storage space is exhausted.

The database used for controlling the URLs to download caused problems. One queue grows without limits, as it does not delete entries after they have been processed. Yet, the Perl-interface to the database appears not to be robust enough for handling such large amounts of data. Consequently, the scheduler becomes rather slow. For this reason, it is necessary to dump the queue and restart the module from time to time.

Logging facilities are fairly comprehensive and follow two mechanisms. On the one hand, entries for the log can be written to a local file. Alternatively, a daemon can be contacted that gathers all logging messages from the distributed components at a common location. Due to an object that takes on the communication with the logging mechanisms, it can easily be extended at any point.

Generally spoken, the Combine-crawler demands care, due to its very modular and open design. However, the enhanced control and flexibility thereby offered make up for this additional effort.

Next: Evaluation of the harvested Up: Combine crawl Previous: Adapting the Combine-crawler Contents

Andreas Aschenbrenner