next up previous contents
Next: Figures - Automatic Retrieval Up: Automatic Retrieval of Interactive Previous: Value Selection   Contents


Further Improvements

The single steps in generating the query can be performed multiple times. For categorisation a k-nearest neighbour approach with k greater than 1 can be chosen, to try on various interpretations. The value selection can be reiterated. The Referee can then assess, which of those extracted answers is best. Errors caused by minor misinterpretations can thereby be diminished. It has to be taken care, of course, that the web-server the request is aimed at is not affected by these actions. Sending variation on queries hundreds of times will definitely raise the quality of the result. At the same time the service will not only be loaded with requests, but also permanent junk data might be left behind.

Up to now the correctness of the Database was never put into question. All categories for the various fields of the user dialogs were taken to be correct. As new data is added to the Database automatically, misclassifications have to be expected, however good the method used is. This fact must be taken into account for the categorisation of new interaction forms, but also it should be considered to revise the information in the Database again and again as the data available grows. Not only better results for generating this query will be achieved, but better results for any generation process - be it categorisation or the selection of values - using this form as a pattern.

Further approaches for extending Categorisation should be considered. Basic categorisation of the isolated fields can be improved over the constant, empiric values of likelihood given now. This can be achieved by making this step a separate learning step, categorising a field, just because of the information extracted by the Parser. Thereby, interpretation done for a whole form might be considerably facilitated.

Finding a method of creating new patterns for forms will be a major task to come. The nearest neighbour method could be updated, such that a field is assigned to a category, even if this category is not part of the nearest neighbours key-vector, if the initial values of likelihood suggest so. Introducing a threshold for this feature should be enough. First, it must be verified, of course, that there exist enough variations on key-vectors, i.e. different types of forms, such that this additional sophistication is indeed necessary and justifiable.

A wholly different representation should be considered for the categorisation process instead of the nearest neighbour approach. Making the guidelines for categorisation to be learned explicit, they can be expressed as rules saying which categories tend to occur together. To give an example, one could assume, that a field, into which the user is supposed to write his house number, occurs frequently together with a field for the street name and another one for the persons name. Also, the number of fields might play a role. A method handling these properties and other heuristics could be realised by making use of Inductive Logic Programming [NCdW97], that induces logic theories from examples and background knowledge.

Recent initiatives in creating metadata for the Internet could considerably facilitate the task of automatically categorising interactive fields. Though it is perhaps too optimistic to expect element sets for interactive forms specifying the seeked individual categories exactly, it is already of help to know the context of the user dialog. For instance, if it is known, that a specific site belongs to an on-line shop, it can be expected that a dialog is a product ordering form rather than one for on-line voting of governmental petitions. However, this requires developments such as RDF (Resouce Description Framework)29 using XML (Extensible Markup Language)30 to be widely accepted and prevalent.


Footnotes

... RDF (Resouce Description Framework)29
http://www.w3.org/RDF/
... XML (Extensible Markup Language)30
http://www.w3.org/XML/

next up previous contents
Next: Lessons Learned Up: Automatic Retrieval of Interactive Previous: Value Selection   Contents
Andreas Aschenbrenner