next up previous contents
Next: Comparison of New Forms Up: Categorisation Previous: Categorisation   Contents


Interpreting Fields Separately

For getting a first impression on what the fields are actually for, the information extracted by Parser is used. Thus, we have of each field its name, type, possible value, maybe a default value available. If the name of a field is the same as the label of a category, this is a strong indication that this field belongs to the very category. Since sometimes instructions, what the user should write in a field, are given in the fields themselves, the same holds for a default value to some extent. Most of the time, however, a fields purpose is described just before or after it, just like "Enter your name here:" . After all, the user has to understand the meaning of a field himself. For this reason a portion of a fields surrounding text has to be provided as additional information by the Parser.

To form the whole picture, a number of catchwords were identified, which point to a specific category. To offer a tangible example, categories together with catchwords, that can be defined using regular expressions, are listed in Table 5.3. The name, value, and the surrounding text are searched for these words. Based on the findings a default probability for every category is assigned to each field, isolated from the context of other fields. Consequently, we obtain a list of categories and their likelihood assigned to each field.

Table 5.2 indicates some conceivable values for an initial probability and the situation they are used in. For example, if one of the catchwords as described before appears in the fields name, a probability of 0.7 is given for the corresponding category. Going a bit deeper in this we could, for instance, stress on a rather problematic situation: if there are two fields and in between them is the word "e-mail" as the only indicator for a category, we certainly can't assign one of the fields a hundred percent likelihood that it belongs to the very category. If the text before a field contains the word "name" followed by a colon, it is more probable the field is of the category "name" than if the word occurs after the very field.


next up previous contents
Next: Comparison of New Forms Up: Categorisation Previous: Categorisation   Contents
Andreas Aschenbrenner