Department of Software Technology
Vienna University of Technology


The SOMLib Digital Library - Experiments - IFS Abstracts Collection

Overview

Here we present the experimental results for the SOMLib Digital Library System using the ifs abstracts collection. It is a very small collection of abstracts of scientific publications of our department. It is mainly used to test and demonstrate the features of the SOMLib library using a simple and easy-to-understand document collection.

On this page:
  1. Data
  2. Textrepresentation
  3. Trained Self-Organizing Maps
  4. SOMs labeled with the LabelSOM method

Data

The ifs abstracts collection consists of 50 abstracts, amounting to a total of 102KB of ASCII text files, not all of which have been used for all experiments. The 50 abstracts were randomly sampled from the list of publications of the department available at the time when the document set was created.

Text Representation

Parsing these files results in a pruned template vector of about 400 to 500 words, depending on the sophistication of the word stemming and the degree of pruning of the full template vector. A variety of different representations of the importance of single words has been experimented with. The most appropriate representation in terms of content representation has shown to be the tf x idf (term frequency times inverse document frequency) representation, with the vectors usually being normalized to unit length before SOM training.

Trained Self-Organizing Maps

A 7 x 7 SOM is trained with the scientific abstracts data. It is intended to provide a clustering of the documents based on contents similar to the organization of documents in a conventional library. The units are labeled with the names of the document vectors, which consist of the first 3 letters of the author's name followed by the short name of the conference or workshop the paper was published at. Without any additional knowledge on either the conferences or the authors, the given representation is hard to interpret, although we might draw some conclusions on the cluster structure by considering the authors names as indicators. Due to the small size of the data collection, interpreting the resulting SOM is rather intuitive and can be achieved by reading the various abstracts mapped onto the units of the map. The abstracts can be accessed by clicking on the article names on the map.

Labeled Self-Organizing Maps

The trained SOMs are labeled automatically using the LabelSOM method. The various labels can then be used to identify clusters within the map by identifying regions which are labeled with identical keywords. Having a set of 10 labels automatically assigned to the the single nodes in the figure leaves us with a somewhat clearer picture of the underlying text archive and allows us to understand the reasons for a certain cluster assignment as well as identify overlapping topics and areas of interest within the document collection. For example, in the upper left corner we find a group of nodes sharing labels like skeletal plans, clinical, guideline, patient, health which deal with the development and representation of skeletal plans for medical applications. Another homogeneous cluster can be found in the upper right corner which is identified by labels like gait, pattern, malfunction and deals with the analysis of human gait patterns to identify malfunctions and supporting diagnosis and therapy. A set of nodes in the lower left corner of the map is identified by a group of labels containing among others software, process, reuse and identifies a group of papers dealing with software process models and software reuse. This is followed by a large cluster to the right labeled with cluster, intuitive, document, archive, text, input containing papers on cluster visualization and its application in the context of document archives. Further clusters can be identified in the center of the map on plan validation, and quality analysis, neural networks, etc.

Up to the SOMLib Digital Library Homepage
Comments: rauber@ifs.tuwien.ac.at