next up previous
Next: SOMLib - The Architecture Up: SOMLib: A Distributed Digital Previous: Introduction

   
A Neural Network Approach

The self-organizing map is an unsupervised neural network providing a mapping from a high-dimensional input space to a usually two-dimensional output space while preserving topological relations as faithfully as possible. Input signals $x \in \Re^{n}$ are presented to the map, consisting of a grid of units with n-dimensional weight vectors, in random order. An activation function based on some metric (e.g. the Euclidean Distance) is used to determine the winning unit (the 'winner'). In the next step the weight vector of the winner as well as the weight vectors of the neighboring units are modified following some learning rate in order to represent the presented input signal more closely.

Basically, the entries to be included in the library system are represented in the form of feature vectors, which are created by parsing the texts and processing the resulting word histograms to provide a compact and effective representation of the texts, using a term frequency / inverse document frequency ( $tf \times idf$) representation. These feature vectors are used as input vectors to train a standard self-organizing map, whereby the very architectures used for representing specific libraries do not necessarily have to be identical. Various modifications of the basic architecture, especially hierarchical SOMs, can be used to suit the differing needs of users concerning size, training time and map representation. This approach has proven successful in a number of applications on free form documents with much attention being paid to the design of intuitive user interfaces allowing easy browsing of large collections of documents and simple retrieval of similar pieces of texts. However, so far these systems require all processing, be it feature vector extraction or map training, to take place centrally. When building larger libraries including the documents of several independent repositories (e.g. university libraries, electronic archives of journals) limits are reached, be it the availability of all texts for training as well as the necessary computational power. Thus we need a method to combine different libraries which are based on differing feature vectors made up of the vocabulary of the documents used for training those individual libraries. By combining the feature vectors, the resulting new vectors will grow to form a kind of more universal alphabet as the number of included libraries increases. However, to keep training times low, we use the weight vectors of the SOMLib library maps as input vectors to the higher-order library instead of using all original vectors, resulting in a hierarchy of maps. Still it is not a strict hierarchy in the sense used in the hierarchical SOM model, where each node in a higher level map represents exactly one map at a lower level of the hierarchy. Rather, we train higher level maps bottom-up, obtaining a topographic ordering of all the weight vectors used in the various lower order maps. Thus one node in the higher order map can reference several nodes in different lower order maps. Furthermore, the maps need not necessarily be arranged in a hierarchical order, allowing mutual referencing.


next up previous
Next: SOMLib - The Architecture Up: SOMLib: A Distributed Digital Previous: Introduction
Andreas RAUBER
1998-06-02