Andreas Rauber and Dieter Merkl
Institut für Softwaretechnik, Technische Universität Wien
Resselgasse 3/188, A-1040 Wien, Austria
andi, www.ifs.tuwien.ac.at/
dieter">www.ifs.tuwien.ac.at/
andi, www.ifs.tuwien.ac.at/
dieter
Today's information age may be characterized by constant massive production and dissemination of written information. Powerful tools for exploring, searching, and organizing this mass of information are needed. Particularly the aspect of exploration has found only limited attention. While the importance of exploration of document archives has been widely agreed on, current information retrieval technology still relies on systems that retrieve documents based on the similarity between keyword-based document and query representations.
However, the map metaphor for displaying the contents of a document library in a two-dimensional map has gained some interest [1,2,5,6,7]. Maps are used to visualize the similarity between documents in terms of distances within a two-dimensional map display. Hence, similar documents may be found in neighboring regions of the map display, which is comparable to the situation encountered in conventional libraries, where books are organized in topical sections. Finding one book on a desired topic using any of the available information retrieval (IR) techniques then leaves you with several others on the same topic nearby. Apart from interactive retrieval of documents within one topical cluster, a map representation also allows users to browse an unfamiliar library in order to get an overview of the documents found in the repository. Given an enhanced visualization it not only provides information on which topics are covered in a specific collection, but also to which extent they are covered, i.e. how much information is available on a certain topic. Thus, the map serves as a kind of intuitive index to the document repository similar to road signs or overview maps of conventional libraries.
A way for automatically organizing a set of documents on such a map display is provided by the self-organizing map (SOM) [4], a popular unsupervised neural network. It provides a topology preserving mapping from a high-dimensional document space to a two-dimensional output space. On the SOM output space the documents are organized according to their topical cluster structure, i.e. documents on similar topics are grouped together, following the map metaphor of library representation. In spite of these mapping capabilities of the SOM, usage has been limited to some extent by the fact, that the interpretation of the resulting mapping was difficult since it required manual analysis of the trained SOM network. This is because the standard SOM representation provides no justification for a certain mapping and does not reveal the contents or boundaries of the clusters that are present in the map.
To overcome this limitation several enhanced cluster visualization techniques like the U-Matrix [17], the Cluster Connection Technique [9], Adaptive Coordinates [8] were developed to make the cluster structure within the map explicit. But still, none of these methods provide information on the features learned by the map. What we would like to have, is a method to automatically extract that knowledge from a trained map, i.e. to make the cluster structure and the characteristics of each cluster explicit. The LabelSOM method presented in this paper automatically extracts this information from a trained SOM by labeling each unit with those features that best describe the data, i.e. the documents mapped onto the respective unit. It thus provides the user with the concepts of the documents represented by the map in an intuitive and readable way. We demonstrate this approach using a real world document archive consisting of an article collection from the TIME Magazine.
The remainder of the paper is organized as follows: In Section 2 we give an introduction to the self-organizing map architecture and learning algorithm. Following a brief presentation of the TIME Magazine article collection used for the experiments presented in Section 3, we present a standard SOM library map representing the articles on the map display in Section 4. We next identify some of the problems related to the interpretation of such a standard map representation and describe related work addressing them in Section 5. The LabelSOM method providing an automatic labeling of the units of a self-organizing map is presented in Section 6, followed by an example showing the results of labeling the previously discussed map of TIME Magazine articles in Section 7. We finally present some conclusions on the LabelSOM method and the applicability of SOMs for digital library organization in Section 8.
The self-organizing map (SOM) as proposed in [3] and
described thoroughly in [4] is one of the most
distinguished unsupervised artificial neural network models.
It basically provides a form of cluster analysis by producing a mapping of high-dimensional input data
onto a usually 2-dimensional output space while preserving the topological relationships between the input data items as faithfully as possible.
This model consists of a set of units, which are arranged in some topology where the most common choice is a 2-dimensional grid.
Each of the units i is assigned a weight vector mi of the same dimension as the input data,
.
are filled with random values.
During each learning step, the unit c with the highest activity level, the winner c with respect to a randomly selected input pattern x is adapted in a way that it will exhibit an even higher activity level at future presentations of that specific input pattern. Commonly, the activity level of a unit is based on the Euclidean distance between the input pattern and that unit's weight vector, i.e. the unit showing the smallest Euclidean distance between it's weight vector and the presented input vector is selected as the winner. Hence, the selection of the winner c may be written as given in Expression (1).
Adaptation takes place at each learning iteration and is performed as
a gradual reduction of the difference between the respective
components of the input vector and the weight vector.
The amount of adaptation is guided by a learning-rate
that is gradually decreasing in the course of time.
This decreasing nature of adaptation strength ensures large adaptation
steps in the beginning of the learning process where the weight
vectors have to be tuned from their random initialization towards the
actual requirements of the input space.
The ever smaller adaptation steps towards the end
of the learning process enable a fine-tuned input space representation.
As an extension to standard competitive learning, units in a time-varying and gradually decreasing neighborhood around the winner are adapted, too. Pragmatically speaking, during the learning steps of the self-organizing map a set of units around the winner is tuned towards the currently presented input pattern enabling a spatial arrangement of the input patterns such that alike inputs are mapped onto regions close to each other in the grid of output units. Thus, the training process of the self-organizing map results in a topological ordering of the input patterns.
The neighborhood of units around the winner may be described implicitly by means of a neighborhood-kernel hci taking into account the distance--in terms of the output space--between unit i under consideration and unit c, the winner of the current learning iteration. This neighborhood-kernel assigns scalars in the range of [0, 1] that are used to determine the amount of adaptation ensuring that nearby units are adapted more strongly than units further away from the winner. A Gaussian may be used to define the neighborhood-kernel as given in Expression (2) where || rc - ri || denotes the distance between units c and i within the output space, with ri representing the two-dimensional vector pointing to the location of unit i within the grid.
It is common practice that in the beginning of the learning process the
neighborhood-kernel is selected large enough to cover a wide area of the
output space.
The spatial width of the neighborhood-kernel is reduced gradually
during the learning process such that towards the end of the learning
process just the winner itself is adapted.
Such a reduction is done by means of the time-varying parameter
in Expression (2).
This strategy enables the formation of large clusters in the beginning
and fine-grained input discrimination towards the end of the learning
process.
In combining these principles of self-organizing map training, we may write the learning rule as given in Expression (3). Please note that we make use of a discrete time notation with t denoting the current learning iteration.
In the experiments presented hereafter we use the TIME Magazine
article collection available at
http://www.ifs.tuwien.ac.at/ifs/research/ir as a reference document archive.
The collection comprises 420 documents from the TIME Magazine of the early 1960's.
The documents can be thought of as forming topical clusters in the high-dimensional feature space spanned by the words that the documents are made up of.
The goal is to map and identify those clusters on the 2-dimensional map display.
Thus, we use full-text indexing to represent the various documents according to
the vector space model of information retrieval.
The indexing process identified 5923 content terms, i.e. terms used
for document representation, by omitting words that
appear in more than 90% or less than 1% of the documents.
The terms are roughly stemmed and weighted according to a
,
i.e. term frequency
inverse document frequency, weighting scheme [16], which assigns high values to terms that are considered important in describing the contents of a document.
Following the feature extraction process we end up with 420 vectors describing the documents in the 5923-dimensional document space, which are further used for
neural network training.
Based on the document description as outlined above, we trained
a
self-organizing map to represent the contents of the
document archive.
Hence, in this particular setup we have 150 neural processing units to
represent the contents of the document collection.
Figure 1 gives a graphical representation of the training
result.
Each unit is shown as a rectangular area in the figure containing the
document numbers of documents mapped onto that particular unit.
Due to space considerations we cannot show the map in full detail in this paper.
We therefore have chosen several units for detailed discussion.
The document numbers for these units are shown on the sides of the
figure.
We find, that the SOM has succeeded in creating a topology preserving mapping of the document collection, i.e. we find documents on similar topics located on the same or neighboring units. For example, all articles mapped onto the units in the lower left corner of the map deal with problems in South Vietnam, with some units representing articles on the Vietnam War and other units covering the government crackdown on buddhist monks. As another example, consider articles T024, T096, T242, T461, which are located onto one single unit in the first row of the map, and which all deal with the relationship between India and Pakistan and the Kashmir conflict.
Several further topical clusters can be identified on the map. This capability of organizing free text documents according to their contents, has been demonstrated repeatedly on a number of different document collections using both the self-organizing map as well as variations of its basic architecture for the representation of digital libraries [2,10,11,12,13,14]. The spatial arrangement of documents on the map provides a very intuitive interface to the text archive similar to manually classified and sorted libraries which also have books on similar topics located next to each other in the shelves.
The SOM representation of the library suffers somewhat from the fact, that no information on the contents is provided and thus the topical arrangement of the documents cannot be exploited without thorough investigation of the map. There is neither a way to tell from the standard map representation, (1) which clusters are present in the map and where their boundaries are, nor (2) can one tell the characteristics of certain regions within the map. Thus, we cannot learn from the map, which clusters of documents on which topics are located in which place on the map. This renders the map somewhat useless unless one starts to read all articles to determine, which topics are represented by which set of units, which is a rather tiresome process for big maps and huge document collections.
To tackle this problem of map interpretation, several techniques were developed to facilitate intuitive cluster visualization to allow the detection of clusters and cluster boundaries. The U-Matrix [17] tries to visualize clusters in the map by creating a kind of 3-dimensional display of the map using the distances between neighboring units as height indicator. Coloring the units similar to conventional map coloring using light colors for high and dark colors for low distances to neighboring units, cluster boundaries become visible as light ridges between darker valleys of units belonging to the same cluster. Similar to this method, Cluster Connections [9] allow the interactive analysis of the map based on the distances between neighboring units. Thresholds are interactively adjusted to create a net of connected and disconnected units. A different approach is taken in the Adaptive Coordinates method [8], where the movement of the SOM's weight vectors in the high-dimensional input space is mirrored in a two-dimensional output space. This leads to groups of units being located close to each other on the output space making the clusters intuitively visible.
However, while these methods may provide some aid in identifying clusters in the map, no description of the clusters can be derived. Thus, we still have to analyze all clusters manually to be able to tell their characteristics, i.e. to identify the topics of the mapped documents in the case of a self-organizing map representing document collections. Once the clusters are identified, current ways of SOM representation mostly rely on the manual selection of keywords to describe certain areas in the map, similar to the approach used in describing the trained SOM in the previous section. Only when some kind of preclassified information about the various articles is available, an approach for automatic assignment of labels for the clusters identified may be chosen, based on the knowledge about the data. Such a-priori knowledge has been used for example in the WEBSOM project [2]. There, a SOM was trained with articles from various Usenet Newsgroups. The regions in the resulting SOM where then labeled with the newsgroups that the majority of articles on a unit came from.
With the LabelSOM [15] method presented thereafter we automatically select those attributes from each weight vector that best describes the input data mapped onto the respective unit, allowing the identification and characterization of clusters in the self-organizing map.
The goal of the LabelSOM method is to select a set of attributes to serve as labels for the units of a trained self-organizing map.
The selection of attributes in the LabelSOM method is motivated by the observation, that the weight vectors of the units serve as prototypes for the input signals mapped onto them.
Thus we want to select those attributes of a unit's weight vector, that are shared by all input signals mapped onto the respective unit to a similar degree.
In order to be able to identify these attributes, we calculate the quantization error vector qi of every unit's weight vector mi by calculating for each vector attribute the cumulative distance between the units' weight vectors and the input vectors mapped onto the units.
More formally, for all input signals xj mapped onto unit i, i.e.
,
we calculate the n-dimensional quantization error vector qi as
In the resulting quantization error vector, those attributes that are shared by a large number of input signals can be identified as the vector attributes having a low quantization error. Thus we can select those attributes that exhibit the lowest quantization error as labels for the respective unit.
This selection of attributes may suffice in many standard data mining applications, where the shared absence of an attribute in a set of input data is equally important and descriptive as the detection of the shared presence of certain attributes.
In the text mining domain we are faced with a somewhat different situation.
Here, we typically find a very large number of attributes, i.e. terms, that are not present in a group of documents mapped onto a specific unit.
These attributes thus usually exhibit a quantization error of close to or actually 0, perfectly qualifying for being selected as labels for the unit in question.
Since we are usually not interested in attributes describing what a group of texts is not about, we have to use a second criterion to only select those attributes that actually describe the documents mapped onto the very unit in a positive way, i.e. list only those attributes that are strongly present in all documents.
This is done by using the weight vector values of the units as a second criterion, eliminating all attributes as potential labels that exhibit weigth vector values below a certain threshold
.
Thus we end up with selecting those attributes with the lowest quantization error but a weight vector value above
as labels for a unit.
Using the threshold
as well as the maximum number of attributes being selected as labels, we can determine the granularity of the resulting concept descriptions.
The lower we set the threshold and the more attributes we allow to be selected as labels, the more details on the units are being revealed.
The result of applying this LabelSOM method to the map depicted in Figure 1 is given in Figure 2, for which a set of up to 15 labels was selected for every unit.
Again, due to space considerations, we can only list the labels for a subset of all units.
Based on these automatically selected labels we can now clearly identify the corresponding topic clusters.
For example, the cluster of articles on Vietnam discussed above can be clearly identified in the lower left corner by labels such as viet, cong, saigon, vietnamese, crusage, dinh, thuc, diem, monk, buddha, barricade, blame, catholic, religious, buddhist or, for a neighboring node in the same cluster, viet, saigon, catholic, religious, priest, blame, monk, diem, buddhist, quang. The unit representing documents on the india-pakistan problem mentioned in Section 4, for example, is labeled with india, negotiation, settlement, delhi, nehru, india, round, pakistan, whereas a unit representing two articles on Austrian politics (T182, T318) is labeled austrian, people, conservative, socialist, coalition, argument, austria, ministry. Thus, simply by considering the automatically created labels for every unit we can determine, articles on which topic can be found in which area of the map.
Furthermore, the results from the automatic assignment of labels to the units can be used to easily identify topical clusters in the map as groups of units having labels in common.
For example, in the lower left corner of the map we can identify a number of units sharing the labels viet, saigon, buddhist etc., clearly separating the cluster of articles covering Vietnam from neighboring topical clusters e.g. on the Middle East in the lower right corner of the map (with units sharing labels like egypt, iraq, algeria, syria, nasser, hussein etc.) or on Africa in the middle-left area of the map (south, africa, white, black).
Finally, we are easily able to identify topical clusters that are represented by only single units that do not share any labels with neighboring units.
Depending on the number of labels to be created for the map and the threshold
,
clusters at differing levels of granularity can be obtained.
The more labels we choose to create and the lower we set the threshold, the larger are the clusters we obtain while a lower number of labels and higher setting of
leads to the identification of more detailed subclusters.
The self-organizing map offers itself as an intuitive interface to a document archive, allowing interactive browsing and exploration by organizing documents into topical clusters on the map display. However, the interpretation of the resulting map is not as intuitive as it might be, since the cluster structure as well as the characteristics can not be derived from the resulting map display. To aid in the interpretation of trained self-organizing maps, methods for enhanced cluster visualization were developed supporting the user in identifying clusters of input data in a SOM. But still no information on the characteristics of these clusters are extracted, thus requiring manual labeling. In this paper we presented the LabelSOM method which automatically assigns a set of labels to the units of a self-organizing map. It is based on the observation, that the weight vectors of a trained SOM serve as prototypes for the input data mapped onto the respective units. Thus, based on the weight vectors and the input vectors mapped onto the respective units, a set of attributes shared by the input vectors mapped onto a unit can be selected to serve as labels. These labels give the characteristics of the documents mapped onto the respective unit describing the concepts learned by the map. The resulting map can actually be read and thus serves as a guide to the document repository, providing an instant and intuitive overview of the contents of a collection.
This document was generated using the LaTeX2HTML translator Version 98.1p1 release (March 2nd, 1998)
Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 0 rau_dexa99.tex.
The translation was initiated by Andreas RAUBER on 1999-09-07