Department of Software Technology
Vienna University of Technology


The SOMLib Digital Library - Text Representation

Overview

In order to be able to automatically organize documents by content, a numerical representation of the various documents in a collection needs to be created. This representation is created by full-text indexing the documents.

Template Vector

A list of all words occurring in the document collection, called template vector, is created.
To obtain a better representation of content, words can be reduced to their rough stems by removing the most common suffixes, such as '-ed', '-ing' or plural (i.e. trailing) '-s'. While some of these primitive stemming rules may produce errroneous results for some words, these do usually not effect the content representation, as e.g. falsely removed trailing '-s' (i.e. not-plural trailing '-s') will be removed at every occurrence of a word ending in '-s', thus stemming it again to a single word stem.
Please note, that during this vector creation procedure, no manually constructed stop-word list is being used, allowing for automatic processing of diverse document collections.

As the result of the wordlist creation process, i.e. the 'template vector' commonly consists of a very huge number of words (about 100.000 words, depending on the document collection), this list should be pruned to a subset relevant for content-based classification of the documents.
This can be achieved by removing all words that appear in a very large number (say more than 90%) of documents, as these words do not contribute to content separation. This usually includes mostly words frequently referred to as 'stop words', such as articles, pronouns etc. ('the', 'is', 'are', 'he', 'she'), but also of very collection-specific words such as 'computer' in a collection of documents covering only computer-related topics. Generally, however, the number of words removed by this criterion is rather small (~10-50).
Furthermore, we can remove all words that appear in only very few documents, as they would provide only a very fine-grained content separation between the respective documents. This is not useful when we want to organize documents by overall content, i.e. classifying them according to their subject matters. By removing words appearing in less than, say, 3 documents, we also get rid of spelling errors. This reduction commonly reduces the word list a lot, resulting in lists of about 5.000 - 15.000 words.
The resulting word list is the pruned template vector for the respective document collection.

Individual Vectors

In the second step, a vector description following the template vector for the respective ollection is created for each document. Basically, every document is described by the words occuring in it. In the most simple form of document representation, this can be simply a binary indication of the fact whether a word is present or not, leading to a corresponding vector representation of 0 or 1. this type of representation is usually referred to as binary representation.
A more sophisticated representation can be obtained by counting the number of occurrences of each word in the template vector in every document, yielding a word-histogramm representation for each document. This representation is also referred to as term frequency (tf) representation.
A variety of different measures can be used to determine the importance of every word in each document based on the term frequencies. The most common choice for diong this is the so-called term frequency times inverse document frequency (tf x idf) representation, where the importance of each word is judged with respect to the number of documents it appears in. A word is the more important, the more often it occurs in one single document and the less often it occurs in the colection as such. Very frequent words, that appear in almost all documents (but too seldom to be removed during the vector reduction process) thus receive a very low weight.
In its simple form, the tf x idf representation for every word could thus be obtained by dividing the term frequency of a word in a given document by the document frequency, i.e. the number of documents this word appears in. However, today most systems use an enhanced version of the simple tf x idf representation, weighting each word by multiplying the term frequency with the natural logarithm of (N / df), where N is the number of documents in the colection and df is the document frequency, i.e. the number of documents the respective word appears in.
Last, but not least, the feature vectors may be normalized to unit length to make up for different document lengths in collections.

Experimental Document Collections and Vector Representations

For our experiments we use a number of different document archives of different sizes. These include, amongst others, the following (some of the document collections are available on-line for downloading).

Up to the SOMLib Digital Library Homepage
Comments: rauber@ifs.tuwien.ac.at