Next: Conclusions Up: Organization of Distributed Digital Previous: The Architecture

Application Scenario

As a digital library testbed we use the 1990 edition of the CIA Worldfactbook which represents a text collection describing 245 countries, regions and islands of the world by their geographical, economical and political characteristics. To be used as input for training a SOM, each document is described by a feature vector based on a word histogram of the complete text corpus that is weighted following a simple $tf \times idf$ weighting scheme [7], i.e. term frequency times inverse document frequency. The indexing process identified 1056 unique terms excluding terms appearing in less than 15 or more than 220 documents, since they do not contribute to contents differentiation, leaving us with an input data set consisting of 245 feature vectors with 1056 elements.

For creating a number of first level SOMs the whole set of 245 documents was randomly split into 5 parts with each set comprising 50 documents, i.e. 5 documents are represented twice in different sets. Next, we independently trained 5 maps consisting of $7 \times 7$ nodes using these testsets. Due to space limitations we can only present two examples out of the complete set of 5 first-level maps in Figure 1. The remaining 3 maps, however, look quite similar in terms of their overall organization. Each field in the maps represents a node labeled with the names of the countries for which it is the best-matching representative, i.e. the winner. Units that were not winner for any country appear as empty fields.

**Figure:** First Level Maps: 2 first-level maps out of a set of 5 SOMs consisting of $7 \times 7$ nodes each

Each of these 5 maps represents in itself a topologically ordered mapping of the corresponding documents, which means that countries considered similar to each other in terms of the facts given in the country description of the CIA Worldfactbook, are located on the same node or close to each other. In the lower left corner of the first map, for example, we find a number of nodes representing south american countries, which are followed by an european and developed countries area to the right. A cluster of asian and african countries is situated above the south american cluster. Another interesting cluster in the upper middle of the first map is represented by the arctic and antarctic oceans, followed by the antarctic continent and a number of islands. Another european cluster can be found in the upper left part of the map on the right hand side of Figure 1. Similar clusters may be found in all the other maps, e.g. clusters of eastern european countries, countries of the arabic hemisphere. Note, however, that the clustering provided by the mapping does not necessarily represent a geographical structuring of regions. Rather, the countries are organized on the map according to their overall similarity based on the descriptions in the CIA Worldfactbook.

In a second step, these 5 maps are integrated into one single SOM consisting of $7 \times 7$ nodes to represent the whole document collection. Thus we obtain a mapping of all nodes of the 5 lower level maps onto the nodes of the higher level SOM. The idea behind this approach is based on the fact, that nodes in the various lower level SOMs representing similar documents (e.g. the nodes representing oceans which are distributed across 3 lower level maps in our experimental setup) should be mapped onto one node in the higher level SOM, i.e. we should expect one cluster for every region described in the document collection. The corresponding higher level map is given in Figure 2. Note, that the main clusters are clearly visible from the map representation due to the accumulation of country descriptions on the cluster center nodes, with the nodes of the lower-level maps being mapped according to a higher level of abstraction. Thus we find a cluster of african countries in the upper left part of the map, with its center on the second node of the first row, being followed by a node representing south american countries in the middle of the first row. In the upper right corner of the map we find a node representing western european and developed countries, followed again by a node representing the former communist hemisphere below. Mind, that the documents used in these examples were taken from the 1990 edition of the CIA Worldfactbook, prior to the `fall' of the communist hemisphere.

**Figure:** Second Level Map: $7 \times 7$ SOM integrating the 5 SOMs of Figure 1

We further find several clusters of islands as well as a single node representing the oceans mentioned before, situated in the lower right part of the map. Please note that the countries that were present in the testbed twice (Austria, Comoros, Iceland, Japan and Mozambique) are now all mapped onto identical nodes. Generally, the nodes in the various maps representing highly similar or identical information, are now mapped onto the same area in the higher level representation of the text collection. Thus we find the higher level map to form an orderly mapping of all the input data used for training the single lower-order maps, with these maps now being integrated at a higher level of abstraction.

Next: Conclusions Up: Organization of Distributed Digital Previous: The Architecture

Andreas RAUBER
1998-11-02