next up previous contents
Next: The Database Up: Modules Previous: The Harvester   Contents


The Parser

Having received a downloaded web-page from the Harvester, the Parser will scrutinise it for any indications suggesting an interaction interface - in the case of HTML-pages, this is the 'form'-tag. An interaction form typically consists of three main components. Firstly, the (1) name of the URL the query will be directed at is defined. The (2) type of transmission of the query represents a further component. This information will be important for the Harvester mainly. Foremost, an interaction form is described by (3) a number of fields the user is able to modify in order to formulate a question. The query string is composed of the values given to those field.

The latter characteristic goes far deeper than the former two. On the most superficial level the parsed file consists of a number of interaction forms - including zero, for a static type of document. Each of these forms consists of a number of fields - obviously more than zero in this case. These fields have a name, are of a certain type and can have a preset value, e.g. a field called loc denoting a city such as Vienna.

Having extracted the three characteristic components, the Parser assembled enough information in order to uniquely identify previously saved requests in the Database, the next module.

Furthermore, the structure of the dialog between user and server is set clear, and, as a matter of consequence, also the structure of a query. Up to now, however, there are no actions taken to enable an automatic understanding of newly encountered forms. Additional information has to be extracted to enable an interpretation of what the fields are actually there for. The information required depends primarily on how this interpretation process is approached, which will be the focus in the module Categoriser (cf. Section 5.2.4) in a rather theoretical manner. A practical suggestion for a method and the information it requires is presented in Section 5.3.2.


next up previous contents
Next: The Database Up: Modules Previous: The Harvester   Contents
Andreas Aschenbrenner