next up previous
Next: An atlas of the Up: CIA's view of the Previous: Document representation

   
A map of the world

Based on the document description as outlined above, we first trained a $10 \times 10$ self-organizing map to represent the contents of the document archive. Figure 3 gives a graphical representation of the training result. For ease of identifying the various rows of units in the graphical representation, we separated these rows by horizontal lines. Each unit is either marked by a number of countries (or regions) or by a dot. The name of a country appears if this unit serves as the winner for that particular country (or more precisely for the input vector representing that country). A dot appears if the unit is never selected as winner for any document.


  
Figure: $10 \times 10$ map of the world
\begin{figure}\begin{center}
\leavevmode
\epsfxsize=110mm
\epsffile{worldmap02.eps}
\end{center}\end{figure}

Figure 3 shows that the self-organizing map was quite successful in arranging the various input data according to their mutual similarity. It should be obvious that in general countries belonging to similar geographical regions are rather similar with respect to the different categories described in the CIA World Factbook. These geographical regions can be found in the two-dimensional map display as well. In order to ease the interpretation of the self-organizing map's training result, we have marked several regions manually. For example, the area on the left hand side of the map is allocated for documents describing various islands. We should note, that the CIA World Factbook contains a large number of descriptions of islands. It is interesting to note, that the description of the oceans can be found in a map region neighboring the area of islands.

In the lower center of the map we find the European countries. The cluster representing these countries is further decomposed into a cluster of small countries, e.g. San Marino and Liechtenstein, a cluster of Western European countries, and finally a cluster of Eastern European countries. The latter cluster is represented by a single unit in the last row of the output space. This unit has as neighbors other countries that are usually attributed as belonging to the Communist hemisphere, e.g. Cuba, North Korea, Albania, and Soviet Union. At this point it is important to recall that our document archive is the 1990 edition of the CIA World Factbook. Thus, the descriptions refer to a time before the ``fall'' of the Communist hemisphere.

Other clusters of interest are the region containing countries from Latin America (lower right of the map), the cluster containing Arab countries (middle right of the map), or the cluster of African countries (upper right of the map).

Overall, the representation of the document space is highly successful in that similar documents are located close to one another. Thus, it is easy to find an orientation in this document space. The negative point, however, is that each document is represented on the very same map. Since the self-organizing map represents a very high-dimensional data space (959 index terms) within a two-dimensional display it is only natural that some information gets lost during the mapping process. As a consequence, it is rather difficult to identify the various clusters. Imagine Figure 3 without the dashed lines indicating cluster boundaries. Without this information it is only possible to identify, say, African countries when prior information about the document collection is available.


next up previous
Next: An atlas of the Up: CIA's view of the Previous: Document representation
Andreas RAUBER
1998-09-10