IFS LogoDepartment of Software Technology
Vienna University of Technology


The SOMLib Digital Library - Experiments - Time Magazine Article Collection

Overview

Here we present the experimental results for the SOMLib Digital Library System using the Time Magazine article collection. It is a collection of 420 articles from the 1960's covering news from politics to social gossip. For a quick glance at experimental results, take a look at this 10 x 15 map of the articles, with a detailed description and interactive map to be explored. step-by-step experimental results are presented in more detail below.

On this page:
  1. Data
  2. Textrepresentation
  3. Trained Self-Organizing Maps
  4. Integration of Distributed Self-Organizing Maps
  5. SOMs labeled with the LabelSOM method
  6. Labeled GHSOM Hierarchies
  7. libViewer Representation

1. Data

The Time Magazine article collection consists of 420 articles from the TIME Magazine from the 1960's, or a total of 1.550 KB of text.

2. Text Representation

Parsing these files results in a pruned template vector of about 6000 words, depending on the sophistication of the word stemming and the degree of pruning of the full template vector.

3. Trained Self-Organizing Maps

A 10 x 15 SOM is trained to cluster the various news articles by topic on the map. The clustering can be veryfied by reading the news articles located on identical or neighboring units in the the map provided below. For example, all articles mapped onto the units in the lower left corner of the map deal with problems in South Vietnam, with some units representing articles on the Vietnam War and other units covering the government crackdown on buddhist monks. As another example, consider articles T024, T096, T242, T461, which are located onto one single unit in the first row of the map, and which all deal with the relationship between India and Pakistan and the Kashmir conflict.

4. Integration of Distributed Self-Organizing Maps

For a second set of experiments, the Time Magazine Article collection was split into 6 independent subsets of articles to simulate the subsequent release of various editions. Each subset was then parsed separately and used to train a single map. We again find a similar topical organization of the various documents on the individual maps. Let us, for example, consider the first of the 6 SOMs trained with the subsets of the TIME Magazine articles. For example, on unit (0/0) we find article T042 entitled The View from Lenin Hills dealing with a discussion between Nikita Khrushchev and Soviet artists at the Lenin Hill Reception Palace, next to article T018 - Who's in Charge Here? about the failure of Khrushchev's virgin land plan for agriculture on unit (1/0) or T032 - Party Time on unit (0/1) on the New Year's Eve party at the Kremlin. On the opposite corner of the map on unit (5/9) we find documents dealing with the problems of the reintegration of Kolwezi into the Congo discussed at a meeting between officials in article T065 - Tea and Harmony, next to three articles on unit (4/9) (T021, T048, T058 entitled The India-Rubber Man; Round 3; and Tshombe's Twilight), providing more detailed information on the background of the Congo troubles. Other groups of documents found on this map deal, for example, with the war in Vietnam, the relation between India, Pakistan and China etc. We leave it up to the reader to explore the other topical sections found in this and the remaining library maps.
Next, the weight vector structures of these independent maps were merged to create a uniform weight vector structure, allowing the integartion of the maps by training one single SOM using the weight vectors of the individual maps as input. The integrating SOM, instead of representing the document vectors on its units, lists the units of the 6 individual SOMs, which in turn represent the corresponding articles. Again, we find the topology preserving mapping capabilities of the SOM as in the previous example which was trained directly using the document description vectors. We now find the units describing the documents T042, T032, T018 on the Soviet Union, previously located in the upper left corner of the first map mapped onto units (8/14) and (9/14) of the integrating map. Taking a look at the articles mapped onto these units we find, that they all cover topics related to the Soviet Union. For unit (9/14) these are articles T229 - Russia: A Senior Citizen, T542 - Russia: Better Things for Better Living through Chemistry; T539 - Russia: Something for the Soil. Mapped on unit (8/14) we find 8 units from 4 different maps representing a total of 9 documents all dealing with the Soviet Union, surrounded by further units representing units and thus articles on this topic. Other document clusters identified on the individual maps can be found like, e.g. the cluster on the Vietnam war in the lower right corner of the map.

5. Labeled Self-Organizing Maps

The trained SOMs are labeled automatically using the LabelSOM method. The various labels can then be used to identify clusters within the map by identifying regions which are labeled with identical keywords. Having a set of 10 labels automatically assigned to the the single nodes in the figure leaves us with a somewhat clearer picture of the underlying text archive and allows us to understand the reasons for a certain cluster assignment as well as identify overlapping topics and areas of interest within the document collection.
Below we present the results for the distributed document collection labeled with the LabelSOM method. For the first map we find unit (0/0) and neighboring units, located in the cluster of articles on the Soviet Union, labeled with, amongst others, khrushchev and nikita. The rest of the labels give more detailed information on the documents on the respective units, e.g. boss, land, committee on unit (1/0). For the second cluster discussed before, we find in the lower right corner unit (4/9) labeled kolwezi, katanga, tshombe, elisabethville, naming the locations and key players of the 3 articles on this unit.
For the integrating map representing all documents in the collection based on the distributed maps, the labels are of similar quality. In the lower right corner we find a group of units sharing labels like nikita, khrushchev, moscow, russia, clearly characterizing the Soviet Union article cluster identified before. Each unit has more detailed labels like farm, chemical on unit (9/14) or peking, chinese on unit (7/14) dealing with Russian-Chinese relationships. This marks the overlap of two clusters of documents, namely a Soviet Union cluster and a cluster with articles on China-related matters located right next to it. This arrangement of articles is typical for the topical mapping of the SOM. Again, we can use the labels shared by neighboring units to identify and separate clusters in the map.

6. Labeled GHSOM Hierarchies

Using the Growing Hierarchical SOM (GHSOM), the hierarchical structure of topics in the document archive can be detected. Below we present some results using the GHSOM to cluster the TIME Magazine article collection. the two experiments differ in the setting of the two parameters guiding the training process of the GHSOM, which are t1 and t2, defining the absolute granularity of data representation on the one hand, and the depth/width trade-off on the other hand.

7. libViewer Representation of the TIME Magazine Article Collection

Below we present some sceenshots of the libViewer representation of the TIME Magazin article collection. As mentioned above, hardly any metadata information is provided with the TIME Magazine article collection as such, apart from the filenames and obviously the corresponding file sizes. Some articles list a date, although not all of them do have this attribute. The same applies for a kind of geographical description, as some of the articles, but by no means all of them, start with a country or region name. Unfortunately, no other information concerning the resort or the author is available. We thus decided to extract or ``create'' some additional meta data attributes that could easily be assumed available for a collection like this, as follows:

While additional attributes such as resort information could be hand-crafted, we refrained from doing so, as assignments would be more or less arbitrarily and hard to justify apart from the topical classification performed by the SOM. Furthermore, this set of attributes suffices to create a rather nice-looking and intuitively interpretable representation of the TIME Magazine article collection. Based on the limited number of attributes a mapping was designed to provide a graphical representation of the articles.


Table: Mapping of TIME Magazine metadata onto libViewer metaphors
title text on spine
  link to file
author text on spine
size spine width
date position within shelf
  dust level
region logo
usage well-thumbed flag
description
cluster color
topic section location


The textual metaphors title, author and description are assigned as usual, allowing the user to readily compare the libViewer representation with the previous representation of the TIME Magazine article collection SOM's by comparing the document numbers. The size of the articles is again mapped onto the spine width to make articles of different length easily distinguishable. The same default mapping is performed for the well-thumbed attribute. The artificially created date attribute is mapped onto both the position within each shelf as well as onto the dust level, with older articles being pushed to the back of the shelf and having more dust settled on them. The region attribute extracted from the articles is used to set a country flag logo on the spine, or, as for regions, a textual logo describing the region, such as mid. East for the middle East or SE Asia for South East Asia. The assignment of articles to the corresponding shelves obviously derives from the SOMLib classification as presented in the preceding chapters with the labels determined by the LabelSOM method being depicted as shelf labels. To allow convenient comparison we furthermore added the shelf location to the set of labels, again starting with shelf id (0/0) in the upper left corner, making the shelf numbering identical to the numbering used for identifying the units in the SOM so far. Last, but not least, we colored the documents according to their cluster membership. Based on the cluster identification presented with the labeling we assigned all articles that are part of the same cluster an identical color, where the actual assignment of colors to clusters was chosen arbitrarily.

As can easily be seen from this discussion, different mappings are possible and might prove even better suited for a given text collection or for a specific type of usage. We will thus use the mapping defined above for the initial experiments presentation of the TIME Magazine article collection, followed by an analysis of how changing specific mappings influences both the resulting visualization as well as the information gained with differing representations.

libViewer representation of the TIME Magazine, Distant view         libViewer representation of the TIME Magazine, Distant view

The figures depict the lower part of the TIME Magazine article collection SOM from the preceding chapters using the libViewer representation metaphors. Using this representation we obtain a good overview of the various topical sections in the library, as clusters of documents on identical topics are assigned the same color. For example, we find the documents on the war in Vietnam, located in the lower left corner of the library, to be colored yellow, and we can immediately see the amount of library space they occupy, as well as the fact that they are located next to a cluster of green documents on Africa. The fact, that these documents cover the war in Vietnam, or African matters respectively, can be told from the labels ( south, viet, saigon etc.) on the bookshelves in that area.

If we move further to the right we find the cluster of documents on Vietnam and Africa to continue up to shelves in columns 4 and 6 respectively, where a small cluster of pink documents on Tunisia and Algeria indicates the shift to the section of blue documents covering Middle-East topics to the right. Please note that the actual color used does not indicate any special meaning other than that documents having the same color cover the same topic. This allows the user to decide whether a whole area of the bookshelf is of any interest to her of him after having scanned the labels of one of the shelves or after having taken a look at one of the documents in the specific section. A somewhat different approach was chosen for the very small clusters of topics, each of which inhabit only one single shelf. Coloring these documents following the same principle would result in an overload in colors. We thus decided to color all documents, that are not part of any larger cluster than their own single shelf, grey. Although this initially raises the impression of one large section of coherent grey documents, this metaphor turns out to be learned easily, posing no major problems to users, as they merely took it for a kind of section on ``other topics''.

libViewer representation of the TIME Magazine, Distant view

Still further to the right we reach the lower right corner of the SOMLib library map, where the rest of the section of Middle-East documents is located, next to the red documents on the Profumo-Keeler scandal and British politics in general.

A few additional pieces of information can be noticed with the distant representation of the TIME Magazine article collection. One is the differing position of documents within the library, which is used as an indicator of their age, with older documents being pushed further to the back. For example, we can easily see, that on shelf (0/13) in the lower left corner we have two newer and two older documents on the political situation in Vietnam, or that, for example, on the lower right corner unit (9/14) all documents date from the same period. Another feature visible even from the distant view is the size of the articles, with some documents having smaller spines than others, as for example on shelf (2/14). Again, information on any selected document is depicted in the status bar of the system,

libViewer representation of the TIME Magazine, close-up view       libViewer representation of the TIME Magazine, close-up view

The additional metaphors only become visible as we zoom into the library. Starting again with the units on the lower left corner of the map we now find more labels immediately available from the shelf. Furthermore, we find the flag or region indicator depicted as a logo on the spine of every document, together with the dust metaphor being a somewhat stronger indicator of a document's age than merely the shelf position. We also now have the textual information on the spine available, listing the ``general topic'' TIME Magazine 1960's plus the actual article title in the form of its filename.

libViewer representation of the TIME Magazine, close-up view       libViewer representation of the TIME Magazine, close-up view

The remaining figures depict different library sections of the map visited while walking along the shelves, such as moving up from the Vietnam section to the African section. If we instead move to the right, we arrive at a shelf containing more documents from the Vietnam cluster, whereas moving further up and to the right we eventually reach the area of the SOMLib library where the Austrian documents are located (shelf (3/10)).

libViewer representation of the TIME Magazine, close-up view       libViewer representation of the TIME Magazine, close-up view

Walking along the lower edge of the library Shelves, we find the pink documents on Tunesia next to the blue section containing Middle East related documents, above which we find several shelves each representing a different topic and thus not being assigned any specific color to set themselves apart).

Continuing along the lowest row of shelves we arrive at the bottom right corner of the map, where we find still more documents on the Middle East, with, for example, older documents located in the lower shelf (9/14), whereas the two documents in the shelf above (9/13) are newer ones. Taking a look at the shelves 3 rows up we find a new topical section to start, namely red books from the British cluster with the dominant Profumo-Keeler scandal. All documents in this cluster now have the British Union Jack Flag assigned as country logo. We further find on shelf (9/11) a rather short, new document (T529) next to two older and longer documents (T342, T354).



Up to the SOMLib Digital Library Homepage
Comments: rauber@ifs.tuwien.ac.at