next up previous contents
Next: Modules Up: AutoRetrieval-Introduction Previous: AutoRetrieval-Introduction   Contents


Outline of the task

When acquiring dynamic web-pages, first, an interaction has to be identified. Normally, the course of an interaction is the user posing a query at a web-server, which will answer with a corresponding dynamic web-page. Queries consist of correct combinations of possible values, which the user can define in an interaction form. Hence, the task is to generate these values.

Foremost, the range of the retrieval has to be determined, in setting up a policy. On the one hand, one could attempt to download all web-pages resulting from possible queries. This poses a serious technical problem. As the database cannot be viewed directly, one will never be sure whether the gathered data is complete, let alone the question on how the data is extracted. This involves, obviously, penetrating the web-server the database is on with repeated requests for service. On the other hand, the actual intention of building an archive is to give future generations an impression on how the Internet looked in our days. Therefore, the emphasis is on the way the information is presented rather than on the data as such. As a matter of consequence acquiring a few, expressive probes of how the trail of navigation carries on after the dialog between the user and the server is an absolutely sufficient approach, though still challenging at this time.

Not only the amount of data, that builds up the Internet, is continuously growing [Tel01], but also the percentage of dynamic sites can be expected to increase. For this reason, it is absolutely indispensable to make the generation of a request as automatic as possible. Anything, that requires an operator to give the values for the input fields of the interaction process explicitly and one by one, can only be considered a simple tool. While it accelerates the work, it never reduces the amount of work, which can eventually only be handled by massive manpower requirements.

The objective pursued, hence, is a means to identify a web-page with an interface for interaction, extract the dialog fields, and automatically fill them with appropriate values. Subsequently, the dynamic request can be sent to the server and its answer can be obtained.

In the following this task will be structured by breaking it down in its components. After this, experiences with algorithms used in a prototype developed in the course of this thesis are presented.


next up previous contents
Next: Modules Up: AutoRetrieval-Introduction Previous: AutoRetrieval-Introduction   Contents
Andreas Aschenbrenner