Department of Software Technology
Vienna University of Technology

The SOMLib Digital Library - Experiments - Russian Information Agency Nowosti (RIAN)

Overview

For the experiments described below we use a collection of news articles from the Russian Information Agency Nowosti (RIAN) .

On this page:

Data
Textrepresentation
Growing Hierarchical Self-Organizing Maps with Labels for Individual Languages
Multilingual Topic Hierarchy using Automatic Translation

Data

The agency provides its news articles in several languages, such as Russian, English, French, German, Arabic, providing an ideal setting for multilingual experiments using the SOMLib system. For our experiments we use the subset of articles of the period between March 1st and March 14th 2001 in Russia (Latin encoding), English, French and German. Not all articles are available in all languages, thus forming a non-parallel corpus, with the most extensive set being available in Russian (1387 articles), followed by the English collection (973 documents), French (712 documents), and the German article set (485 documents) for the two-week period. All HTML tags were removed from the articles to obtain text-only representations suitable for content analysis by the SOMLib system.

Text Representation

To be used for map training, a vector-space representation of the single documents is created by full-text indexing. For each document collection a list of all words appearing in the respective collection is extracted while applying some basic word stemming techniques. Words that do not contribute to contents description are removed from these lists. Instead of defining language or content specific stop word lists, we rather discard terms that appear in more than 10% or in less than 0.7%. We thus end up with a vector dimensionalities of 2869, 2163, 2271, and 1979 for the Russian, English, French, and German set, respectively.
The individual documents are then represented by feature vectors using a tf x idf, i.e. term frequency times inverse document frequency, weighting scheme as described by Salton. This weighting scheme assigns high values to terms that are important as to describe and discriminate between the documents.
The listing below provides the template vector, i.e. the list of words used for document representation, the list of removed "stop-words" as well as the feature vectors used for training the maps for each of the 4 languages used in the experiments.

Russian:
- Template Vector: List of the 2.869 words, i.e. index terms used for representing the content of the documents
  plain text (106 KB), gnu-zipped (34 KB)
- Removed Words: List of words removed because they appear either too frequently, i.e. in more than 10% or in less than 0.7%, indicated by a High and Low tag in the file.
  plain text (456 KB) gnu-zipped (156 KB)
- Input Vectors: List of vectors used for training the SOMs, i.e. a list of 1.387 vectors of dimensionality 2.869, weighted by the tf x idf weighting scheme.
  gnu-zipped (298 KB), normalized to unit length, gnu-zipped (477 KB),
English:
- Template Vector: List of the 2.163 words, i.e. index terms used for representing the content of the documents
  plain text (74 KB), gnu-zipped (25 KB)
- Removed Words: List of words removed because they appear either too frequently, i.e. in more than 10% or in less than 0.7%, indicated by a High and Low tag in the file.
  plain text (126 KB) gnu-zipped (46 KB)
- Input Vectors: List of vectors used for training the SOMs, i.e. a list of 973 vectors of dimensionality 2.163, weighted by the tf x idf weighting scheme.
  gnu-zipped (170 KB), normalized to unit length, gnu-zipped (284 KB),
French:
- Template Vector: List of the 2.271 words, i.e. index terms used for representing the content of the documents
  plain text (76 KB), gnu-zipped (24 KB)
- Removed Words: List of words removed because they appear either too frequently, i.e. in more than 10% or in less than 0.7%, indicated by a High and Low tag in the file.
  plain text (117 KB) gnu-zipped (40 KB)
- Input Vectors: List of vectors used for training the SOMs, i.e. a list of 712 vectors of dimensionality 2.271, weighted by the tf x idf weighting scheme.
  gnu-zipped (125 KB), normalized to unit length, gnu-zipped (206 KB),
German:
- Template Vector: List of the 1.979 words, i.e. index terms used for representing the content of the documents
  plain text (62 KB), gnu-zipped (20 KB)
- Removed Words: List of words removed because they appear either too frequently, i.e. in more than 10% or in less than 0.7%, indicated by a High and Low tag in the file.
  plain text (136 KB) gnu-zipped (45 KB)
- Input Vectors: List of vectors used for training the SOMs, i.e. a list of 485 vectors of dimensionality 1.979, weighted by the tf x idf weighting scheme.
  gnu-zipped (71 KB), normalized to unit length, gnu-zipped (119 KB),

Trained Growing Hierarchical Self-Organizing Maps with Labels for Individual Languages:

With documents collections of this size, providing single, flat SOMs does not offer a convenient interface anymore as the resulting maps would become too large. We thus use our new Growing Hierarchical Self-Organizing Map (GHSOM) model to create a hierarchical representation of the document archive. Based on the document description as outlined above, we trained a growing hierarchical self-organizing map for each language set separately to represent the contents of the document archive. Using the labelSOM method, characteristic keywords were automatically extracted from the trained maps, describing the various topical clusters.

The quality of the labels describing the topics of the various clusters can be significantly improved by, on the one hand, using phrases, rather than keywords, as descriptors, as well as by haing the importance of a label reflected in a visual manner. These ideas are implemented in the following representation of the maps, using the KEA keyphrase extraction tool, and grey shadings for the weights of the different labels. The maps are identical to the ones provided above.

English Collection with KEA labels

Multilingual View and Access to the Collection

In order to allow browsing and explorative access to a multilingual document collection - as opposed to Cross-Language Information Retrieval (CLIR) - we need to provide a uniform interface to the collection inthe users target language, which still provides topologically sorted acess to the data. For the given experimental setting we thus automatically translated all documents into English using AltaVista's Babelfish as our translation engine. (No translation for the russian articles could be obtained due to different encoding requirements for the cyrilic alphabet. We thus restrict our further experiments to the English, French and German collections.) The translated article sets were parsed together, with the resulting feature vectors being used to train a single GHSOM system. This single GHSOM now incorporates a topically organized view of a multilingual collection.

Feature Vectors: (2.170 documents)
- Template Vector: List of the 1.959 words, i.e. index terms used for representing the content of the documents
  plain text (74 KB), gnu-zipped (26 KB)
- Removed Words: List of words removed because they appear either too frequently, i.e. in more than 10% or in less than 0.7%, indicated by a High and Low tag in the file.
  plain text (224 KB) gnu-zipped (85 KB)
- Input Vectors: List of vectors used for training the SOMs, i.e. a list of 2.170 vectors of dimensionality 1.959, weighted by the tf x idf weighting scheme.
  gnu-zipped (373 KB), normalized to unit length, gnu-zipped (617 KB),
GHSOM Topic Hierarchy:
- Translated English Collection GHSOM with conventional LabelSOM labels
- Translated English Collection with KEA labels

Up to the SOMLib Digital Library Homepage
Comments: rauber@ifs.tuwien.ac.at