next up previous contents
Next: Categorisation Up: Comparison of Forms Previous: Comparison of Forms   Contents


Comparison-Algorithm

Every field can be assigned to one of n manually defined categories. Thus, every form is represented as a vector of n dimensions, where every dimension refers to a specific category. This key-vector has a value of 1.0 in the i-th dimension, if the form contains one field with the category i.

Since the categories are predefined, we have to be aware, that there are fields, which cannot be put in one of those, be it because they are not recognised or because they just do not fit in any. A special undefined category will be defined for those fields. Also, hidden fields that cannot be assigned to a category, will be put in this category.

After having transformed the representation of a form into a vector, we can compare two forms by comparing their respective key-vectors. The undefined-category is not contained in this vector. The dot product is used as a measure of distance between two forms:

\hat{D}_{AB} = \frac{ \vec{a} \cdot \vec{b} }{ \vert\vec{a}\vert \cdot \vert\vec{b}\vert }
\vec{a} ... key-vector form A
\vec{b} ... key-vector form B
The closer the index of distance \hat{D}_{AB} is to 1.0 , the more similar are the two forms. Similarly, if the index is 0.0 , the documents have nothing in common.


next up previous contents
Next: Categorisation Up: Comparison of Forms Previous: Comparison of Forms   Contents
Andreas Aschenbrenner