next up previous contents
Next: Combine crawl Up: Nedlib crawl Previous: Adapting the Nedlib-crawler   Contents


Running the Nedlib-crawler

Before starting the Nedlib-crawler, the environment variable NEDLIB_ROOT has to be set to the installation directory of the robot. Still, it is necessary to change to the binaries directory at $NEDLIB_ROOT/bin/, since the very variable is not consistently used. Subsequently, the tool can be started with a single command, having delimited the scope before-hand, as described in the previous Section 4.4.1. Operations of the crawler can be paused or stopped using a telnet connection, or simply by executing the corresponding utility function.

System design depends to a great extent on the database. Primarily, three ways the database was used can be discerned: (1) for configuration purposes, (2) as a long-term storage for indexing-data and metadata, and (3) as a temporary storage holding information needed for the operations of the robot.

Besides the three tables - 'domains', 'hosts', and 'restrictions' - needed for delimiting the scope of a crawl, the robot can be controlled by definitions in the table 'config'. This configuration table contains three columns with a single row. The field in the column 'robotrules' defines whether or not the Nedlib-tool should obey robot exclusion rules. Another option is offered by the column 'ftpsupport'. Support for the FTP-protocol can be switched on by setting the corresponding field to 'Y'. Lastly, 'maxdepth' takes the number of directories that are to be followed down in the depth of the storage hierarchy at any host. This is a possibility to avoid infinite recursions.

Furthermore, metadata of the archived documents is stored in a table ('documents'), one contains the URLs of collected documents ('urls'), another the MD5 checksum of a URL ('knownurls'), a table for logging messages ('logtable'), information about the harvesting rounds ('timespace'), along with 13 other tables that hold information temporarily, which is necessary for the operations of the robot.

Due to the rather monolithic and intricate design of the Nedlib-crawler, low-level control is rather limited. Yet, if everything works out cleanly, the robot is very convenient to work with.


next up previous contents
Next: Combine crawl Up: Nedlib crawl Previous: Adapting the Nedlib-crawler   Contents
Andreas Aschenbrenner