The agency provides its news articles in several languages, such as Russian, English, French, German, Arabic, providing an ideal setting for multilingual experiments using the SOMLib system.
For our experiments we use the subset of articles of the period between March 1st and March 14th 2001 in Russia (Latin encoding), English, French and German.
Not all articles are available in all languages, thus forming a non-parallel corpus, with the most extensive set being available in Russian (1387 articles), followed by the English collection (973 documents), French (712 documents), and the German article set (485 documents) for the two-week period.
All HTML tags were removed from the articles to obtain text-only representations suitable for content analysis by the SOMLib system.
To be used for map training, a vector-space representation of the single documents is created by full-text indexing.
For each document collection a list of all words appearing in the respective collection is extracted while applying some basic word stemming techniques.
Words that do not contribute to contents description are removed from these lists.
Instead of defining language or content specific stop word lists, we rather discard terms that appear in more than 10% or in less than 0.7%.
We thus end up with a vector dimensionalities of 2869, 2163, 2271, and 1979 for the Russian, English, French, and German set, respectively.
The individual documents are then represented by feature vectors using a tf x idf, i.e. term frequency times inverse document frequency, weighting scheme as described by Salton. This weighting scheme assigns high values to terms that are important as to describe and discriminate between the documents.
The listing below provides the template vector, i.e. the list of words used for document representation, the list of removed "stop-words" as well as the feature vectors used for training the maps for each of the 4 languages used in the experiments.
Trained Growing Hierarchical Self-Organizing Maps with Labels for Individual Languages:
With documents collections of this size, providing single, flat SOMs does not offer a convenient interface anymore as the resulting maps would become too large.
We thus use our new Growing Hierarchical Self-Organizing Map (GHSOM) model to create a hierarchical representation of the document archive.
Based on the document description as outlined above, we trained a growing hierarchical self-organizing map for each language set separately to represent the contents of the document archive. Using the labelSOM method, characteristic keywords were automatically extracted from the trained maps, describing the various topical clusters.
The quality of the labels describing the topics of the various clusters can be significantly improved by, on the one hand, using phrases, rather than keywords, as descriptors, as well as by haing the importance of a label reflected in a visual manner. These ideas are implemented in the following representation of the maps, using the KEA keyphrase extraction tool, and grey shadings for the weights of the different labels. The maps are identical to the ones provided above.
Multilingual View and Access to the Collection
In order to allow browsing and explorative access to a multilingual document collection - as opposed to Cross-Language Information Retrieval (CLIR) - we need to provide a uniform interface to the collection inthe users target language, which still provides topologically sorted acess to the data.
For the given experimental setting we thus automatically translated all documents into English using AltaVista's Babelfish as our translation engine.
(No translation for the russian articles could be obtained due to different encoding requirements for the cyrilic alphabet. We thus restrict our further experiments to the English, French and German collections.)
The translated article sets were parsed together, with the resulting feature vectors being used to train a single GHSOM system.
This single GHSOM now incorporates a topically organized view of a multilingual collection.