next_group up previous


Automatically Detecting and Organizing Documents into Topic Hierarchies: A Neural Network Based Approach to Bookshelf Creation and Arrangement

Andreas Rauber, Michael Dittenbach, Dieter Merkl

Department of Software Technology, Vienna University of Technology
Favoritenstr. 9 - 11 / 188, A-1040 Wien, Austria
{andi, mbach, dieter}@ifs.tuwien.ac.at

Abstract:

With the increasing amount of information available in electronic document collections, methods for organizing these collections to allow topic-oriented browsing and orientation gain importance. The SOMLib Digital Library System provides such an organization based on the self-organizing map, a popular neural network model. In this paper, we present the GHSOM, which, based on the same concepts, allows an automatic hierarchical decomposition and organization of documents, which very intuitively reflects the organization typically found in (manually organized) conventional libraries. We present a case study based on a 3-month article collection from an Austrian daily newspaper.


Introduction

With the increasing amount of textual information available in digital libraries, methods for accessing and providing orientation in these vast information repositories have gained significant importance. While research in information retrieval provides us with sophisticated methods for selecting relevant subsets of collections, only few systems address the need for interactive navigation in topically organized libraries. Still, this way of organization is the most common and thus most natural one to users of conventional libraries.

With the SOMLib Digital Library [5] we developed a system focusing on this aspect of information access by automatically organizing documents by topic on a two-dimensional map using a popular neural network model, namely the self-organizing map (SOM) [3]. Documents covering similar topics are found in neighboring regions on the resulting SOM, which very much resembles the conventional bookshelf metaphor of document organization. While this approach allows easy orientation in an unknown text collection and facilitates the integration of distributed repositories [6], the resulting map-based metaphor poses some constraints on the size of the document collection that can feasibly be displayed on a single two-dimensional map. The more topics are present, the larger the according map has to be. Furthermore, the size of the map has to be determined in advance, which turns out to be impractical for unknown document collections.

What we would like to have is a hierarchical representation of the various topics in a document repository similar to conventional library organization of several floors, sections and shelves. Approaches like the Hierarchical Feature Map [4] provide such a hierarchical organization. Albeit, the architecture of the resulting system has to be determined in advance, i.e. it has to be specified how many layers we would like to obtain, as well as the size of the maps on each layer. This turns out to be difficult for document collections, where the precise nature and topic distribution of a collection is unknown. Furthermore, some topics are present more prominently than others, thus requiring more map space for representation. Various approaches address the problem of having to define the size of a SOM in advance, such as the Growing Grid [2], where new units are added to map areas where the data, i.e. the topics, are not represented at a satisfying degree of granularity. However, since all of these methods rely on the creation of a single flat map, they tend to produce huge maps limiting their usability for the display of and navigation in document libraries.

To overcome these limitations we developed the Growing Hierarchical SOM ( GHSOM), which automatically creates a hierarchical organization of a set of documents. This allows the network architecture to determine the topical structure of the given document repository during the training process, creating a hierarchy of self-organizing maps, each of which provides a topologically sorted representation of a topical subset. Starting from a rather small high-level SOM, which provides a coarse overview of the various topics present in the collection, subsequent layers are added where necessary to display a finer subdivision of topics. Each map in turn grows in size until it represents its topic to a sufficient degree of granularity. Since usually not all topics are present equally strong in a collection, this leads to an unbalanced hierarchy, assigning more `map-space' to topics that are more prominent in a given collection. This allows the user to approach and intuitively browse a document collection in a way similar to conventional libraries.


Creating Topic Hierarchies of Shelves

Text documents can be thought of as forming topical clusters in the high-dimensional feature space spanned by the individual words in the documents. The SOM is an unsupervised neural network model that provides a topology-preserving mapping from such a high-dimensional input space onto a 2-dimensional output space. A trained SOM thus represents a topical ordering of the documents, i.e. documents on similar topics are located close to each other on the map. This is comparable to what one can expect from a conventional library, where we also find books on similar topics located next to each other. Thus, the SOM offers by its very architecture an ideal way for the organization of document repositories.

Contrary to standard SOM training, the GHSOM starts by initially training a small $ 2 \times 2$ SOM. With only 4 units to represent the wealth of topics in a collection, the so-called quantization error of a map (QE), a measure for the quality of a SOM, is rather high. Thus, additional units are added to the SOM to allow a finer representation of the various topics, until they are represented at a desired level of granularity. This is indicated by a threshold parameter $\tau_1$ as a fraction of the initial QE. Next, a separate SOM is added for each unit of this first level SOM, which is used to represent the documents mapped onto the respective unit of the higher level map in more detail. Again, this map grows until it has a QE smaller than the fraction $\tau_1$ of the unit's quantization error (qe) it derives from. The training process continues this way until the absolute granularity of data representation provided by a unit's qe exceeds a given threshold $\tau_2$, provided again as a fraction of the initial QE. (A more detailed analysis of the architecture and training process of this network model can be found in [1].) We thus end up with a hierarchy of individual maps, each representing a specific topic at a given level of granularity, allowing straight-forward browsing of and orientation in a document repository.


Walking through a Newspaper Archive


Figure 1: Topic hierarchy on the Balkan War
\begin{figure}
\begin{center}
\vspace{-5mm}
\epsfxsize =100mm %60
\epsffile {pathges_mag3.eps}\end{center}\vspace{-5mm}
\end{figure}

In the experiments presented hereafter we use an article collection of the Austrian daily newspaper Der Standard covering the second quarter of 1999. The 11.627 articles are represented by automatically extracted 1104-dimensional feature vectors of word histograms weighted by a $tf \times idf$ weighting scheme.

Since it is impossible to present the complete topic hierarchy of three months of news articles, we will concentrate on a few sample topical sections. The top-level map of the GHSOM, which evolved to the size of $3 \times 4$ units during training by adding two rows and one column, is depicted on the left in Figure 1. Here, we find an overview of the main subject areas such as the war on the Balkan on unit (1/1) (we refer to a unit located in row $x$ and column $y$ as unit (x/y), starting with (1/1) in the upper left corner), automatically labelled with Kosovo and NATO. Austrian internal politics (1/2), economy (3/1), and the European Union (2/3) are other examples of main subjects. Cultural topics are located on three units in the lower left corner, representing sports, fashion and theater, respectively.

If we are interested in the Balkan War subject, we discover a more detailed map in the next layer (Fig. 1 middle), where every unit represents a specific sub-topic of this subject. The top-row sections concentrate on the general political and social situation, while unit (3/1) deals with the situation of refugees. Slobodan Milosevic is the main topic of the sections mapped onto the units (2/3), (3/2) and (3/3) in the bottom right corner of the map, whereas unit (2/2) covers political negotiations to resolve the crisis and Russia's role in the Balkan War.

One layer down the hierarchy we reach the actual `book-shelves', which, for the previously mentioned section on negotiations, represents in the top left corner unit (1/1) articles about Russia's inclusion into the so-called G-8 meetings on a safe corridor for expelled people, followed by reports on the consequences of the G-8 resolutions and comments by the German Foreign Minister Fischer on unit (2/2).

Similar topic organization can be found in the other areas of the GHSOM hierarchy with, e.g., the section on the European Union being composed of subjects such as the relationship between the EU and the USA on imports of US beef or penal taxes, or Austria-specific EU topics such as transit regulations or the agriculture subsidies within the Agenda 2000 pact.



Conclusions


The SOMLib Digital Library System has shown to be a very intuitive system by providing a topical organization of documents similar to conventional libraries and including a representation based on metaphor graphics. With the GHSOM, a dynamically growing hierarchical extension to the basic SOM model, we can now provide a hierarchical organization of topics, where each topic is assigned the required map space according to its prominence in a document collection. The resulting organization allows for an even more intuitive navigation and orientation within an unknown document collection. It allows users to find out, which topics are present, and where to find specific documents, putting the emphasis on browsing rather than mere retrieval, while retrieval functionality obviously can be integrated the same way as for the standard SOMLib system.

Bibliography


1
M. Dittenbach, D. Merkl, and A. Rauber.
Using growing hierarchical self-organizing maps for document classification.
In Proc. European Symp. on Artificial Neural Networks (ESANN00), Bruges, Belgium, 2000.

2
B. Fritzke.
Growing grid - a self-organizing network with constant neighborhood range and adaption strength.
Neural Processing Letters, 2, No. 5:1 - 5, 1995.

3
T. Kohonen.
Self-Organizing Maps.
Springer Verlag, Berlin, Germany, 1995.

4
R. Miikkulainen.
Script recognition with hierarchical feature maps.
Connection Science, 2:83 - 101, 1990.

5
A. Rauber and D. Merkl.
The SOMLib Digital Library System.
In Proc. Europ. Conf. on Research and Advanced Technology for Digital Libraries (ECDL99), Paris, France, 1999. LNCS, Springer Verlag.

6
A. Rauber and D. Merkl.
Providing topically sorted access to subsequently released newspaper editions or: How to build your private digital library.
In Proc. 11th Int'l Conf. on Database and Expert Systems Applications (DEXA00), Greenwich, UK, 2000.

.
Andreas Rauber, Michael Dittenbach, Dieter Merkl: Automatically Detecting and Organizing Documents into Topic Hierarchies: A Neural-Network Based Approach to Bookshelf Creation and Arrangement.
In: Proceedings of the 4. European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2000), Sept. 18. - 20. 2000, Lisboa, Portugal, Lecture Notes in Computer Science LNCS 1923, pp.348-351, Springer, Germany.

About this document ...

Automatically Detecting and Organizing Documents into Topic Hierarchies: A Neural Network Based Approach to Bookshelf Creation and Arrangement

This document was generated using the LaTeX2HTML translator Version 99.1 release (March 30, 1999)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split 0 rau_ecdl00.tex

The translation was initiated by Andreas RAUBER on 2000-10-04


next_group up previous
Andreas RAUBER
2000-10-04