next up previous
Next: Topology preserving self-organizing networks Up: CIA's view of the Previous: CIA's view of the

Introduction

Today's information age may be characterized by constant massive production and dissemination of written information. Powerful tools for exploring, searching, and organizing this mass of information are needed. Particularly the aspect of exploration has found only limited attention. Current information retrieval technology still relies on systems that retrieve documents based on the similarity between keyword-based document and query representations.

The map metaphor for displaying the contents of a document library in a two-dimensional map has gained some interest [3,4,5,6] Maps are used to visualize the similarity between documents in terms of distances within the two-dimensional map display. Hence, similar documents may be found in neighboring regions of the map display.

The map metaphor, however, has its limitations in that each document is represented within a single two-dimensional map. Since the documents are described in a very high-dimensional feature space, the two-dimensional map representation has necessarily some imprecisions. In much the same way as we are showing the world on different pages in an atlas where each page contains a map showing some portion of the world at some specific resolution, we suggest to use a kind of atlas for document space representation. A page of this atlas shows a portion of the library at some resolution while omitting other parts of the library. As long as general maps that provide an overview of the whole library are available, the user can find her way along the library chosing maps that provide the most detailed view of the area of particular interest.

A comparison with traditional document archives reveals that these archives are usually organized into hierarchies according to the subject matter of the various documents. This observation has stimulated research in the direction of using hierarchical clustering techniques based on statistical cluster analysis. The specific strengths and weaknesses of these approaches are well explored [9,10].

In this paper we argue in favor of establishing a hierarchical organization of the document space based on an unsupervised neural network. More precisely, we show the effects of using the hierarchical feature map [8] for text archive organization. The distinguished feature of this model is its layered architecture where each layer consists of a number of independent self-organizing maps [1]. The training process results in a hierarchical arrangement of the document collection where self-organizing maps from higher layers of the hierarchy are used to represent the overall organizational principles of the document archive while maps from lower layers of the hierarchy are used to provide fine-grained distinction between individual documents. Such an organization thus comes close to what we would usually expect from conventional libraries. As an important benefit from the unsupervised training process we have to note that the library organization is derived solely from the document representation. No semantic labeling such as labels of subject matters and the like is necessary.

The remainder of this paper is organized as follows. In Section 2 we give a brief description of the architectures and the training rules of the neural networks used in this study. Section 3 is dedicated to a description of the text documents that constitute our experimental document library. Sections 4 and 5 provide the experimental results from document classification. The former describes the results from using the self-organizing map, i.e. library organization according to the map metaphor. The latter gives results from using the hierarchical feature map, i.e. library organization according to the atlas metaphor. Finally, in Section 6 we give some conclusions.


next up previous
Next: Topology preserving self-organizing networks Up: CIA's view of the Previous: CIA's view of the
Andreas RAUBER
1998-09-10