Back to: List of Publications
Philipp Tomsich, Andreas Rauber and Dieter Merkl
Institute of Software Technology
Vienna University of Technology
Favoritenstraße 9-11/188
A-1040 Wien, Austria
phil,andi,dieter@ifs.tuwien.ac.at
In this paper we present the par SOM, a software-based parallel implementation of the self-organizing map, which is particularly optimized for the analysis of high-dimensional input data. This model scales well in a parallel execution environment, and, by coping with memory latencies, a better than linear speed-up can be achieved using a simple, asymmetric model of parallelization. We demonstrate the benefits of the proposed implementation in the field of text classification, which due to the high dimensionalities of the data spaces encountered, forms a prominent application domain for high-performance computing.
The self-organizing map (SOM) [4] is a prominent unsupervised neural network model providing a topology-preserving mapping from a high-dimensional input space onto a two-dimensional map space. This capability allows an intuitive analysis and exploration of unknown data spaces, with its applications ranging from the analysis of financial data, medical data, to time series data [4,2,11]. Another rather prominent application arena is text classification, where documents in a collection are clustered according to their content. However, in many applications the dimensionalities of the feature spaces to be analyzed as well as the amount of data encountered are too large to allow a fast, interactive training of the neural network. Still, interactive training and mapping are highly desirable, as they open a huge field of applications in the context of data analysis in Data Warehouses or in the field of Digital Libraries, to name just two examples. In both cases, interactive response times are a must.
Existing hardware implementations [4,3,5] have not gained wide-spread acceptance because of the limitations they put both on the dimensionality of the data to be analyzed or the size of the map. On the other hand, high-performance computing provides the technology necessary for the desired speedup.
In this paper we present the par SOM, a software-based parallel implementation of the SOM, using a simple asymmetric model of parallelization. Apart from the obvious speedup gained by having the training process executed in parallel for the various neural processing units in a SOM we show, that the use of cache effects allows a better-than-linear speedup for the resulting implementation. We have implemented and evaluated the par SOM on a number of platforms, ranging from off-the-shelf dual-processor Celeron systems, the 16-processor SGI Power Challenge, to the 64-processor SGI Cray Origin 2000 using an application scenario from the digital library domain as a test-bed for our performance analysis.
The remainder of the paper is organized as follows. Section 2.1 provides an introduction to the architecture and the training algorithm of a self-organizing map, with a serial implementation discussed in Section 2.2. The actual algorithm implemented in the par SOM system is presented in Section 2.3. Our application scenario from the field of Digital Libraries is introduced in Section 3, followed by an analysis of experimental results in Section 4. Final conclusions and an outlook are presented in Section 5.
The self-organizing map [4] basically provides a form
of cluster analysis by producing a mapping of high-dimensional input
data
onto a usually 2-dimensional output space while
preserving the topological relationships between the input data items
as faithfully as possible. This model consists of a set of units,
which are arranged in some topology where the most common choice is a
two-dimensional grid. Each of the units
is assigned a weight
vector
of the same dimension as the input data,
. In the initial setup of the model prior to training, the
weight vectors are filled with random values.
During each learning step, the unit
with the highest activity
level, i.e. the winner c with respect to a randomly selected
input pattern
, is adapted in a way that it will exhibit an even
higher activity level at future presentations of that specific input
pattern. Commonly, the activity level of a unit is based on the
Euclidean distance between the input pattern and that unit's weight
vector, i.e. the unit showing the lowest Euclidean distance between
it's weight vector and the distance between the input pattern and that
unit's weight vector, i.e. the unit showing the lowest Euclidean
distance between it's weight vector and the presented input vector is
selected as the winner. Hence, the selection of the winner
may be
written as given in Expression (1).
Adaptation takes place at each learning iteration and is performed as
a gradual reduction of the difference between the respective
components of the input vector and the weight vector. The amount of
adaptation is guided by a learning-rate
that is gradually
decreasing in the course of time. This decreasing nature of
adaptation strength ensures large adaptation steps in the beginning of
the learning process where the weight vectors have to be tuned from
their random initialization towards the actual requirements of the
input space. The ever smaller adaptation steps towards the end of the
learning process enable a fine-tuned input space representation.
As an extension to standard competitive learning, units in a
time-varying and gradually decreasing neighborhood around the winner
are adapted, too. Pragmatically speaking, during the learning steps
of the self-organizing map a set of units around the winner is tuned
towards the currently presented input pattern enabling a spatial
arrangement of the input patterns such that alike inputs are mapped
onto regions close to each other in the grid of output units. Thus,
the training process of the self-organizing map results in a
topological ordering of the input patterns. According to [9]
we may thus refer to the self-organizing map as a neural network model
performing a spatially smooth version of
-means clustering.
The neighborhood of units around the winner may be described
implicitly by means of a neighborhood-kernel
taking into
account the distance--in terms of the output space--between unit
under consideration and unit
, the winner of the current learning
iteration. This neighborhood-kernel assigns scalars in the range of
[0, 1] that are used to determine the amount of adaptation ensuring
that nearby units are adapted more strongly than units further away
from the winner. A Gaussian may be used to define the
neighborhood-kernel as given in Expression (2) where
denotes the distance between units
and
within the output space, with
representing the two-dimensional
vector pointing to the location of unit
within the grid.
It is common practice that in the beginning of the learning process
the neighborhood-kernel is selected large enough to cover a wide area
of the output space. The spatial width of the neighborhood-kernel is
reduced gradually during the learning process such that towards the
end of the learning process just the winner itself is adapted. Such a
reduction is done by means of the time-varying parameter
in
Expression (2). This strategy enables the formation of
large clusters in the beginning and fine-grained input discrimination
towards the end of the learning process.
In combining these principles of self-organizing map training, we may
write the learning rule as given in Expression (3).
Please note that we make use of a discrete time notation with
denoting the current learning iteration. The other parts of this
expression are
representing the time-varying learning-rate,
representing the time-varying neighborhood-kernel,
representing the currently presented input pattern, and
denoting the weight vector assigned to unit
.
A simple graphical representation of a self-organizing map's
architecture and its learning process is provided in Figure
1. In this figure the output space consists of a square
of 36 units, depicted as circles. One input vector
is randomly
chosen and mapped onto the grid of output units. In the second step
of the learning process, the winner
showing the highest activation
is selected. Consider the winner being the unit depicted as the black
unit in the figure. The weight vector of the winner,
, is
now moved towards the current input vector. This movement is
symbolized in the input space in Figure 1. As a
consequence of the adaptation, unit
will produce an even higher
activation with respect to input pattern
at the next learning
iteration,
, because the unit's weight vector,
, is
now nearer to the input pattern
in terms of the input space.
Apart from the winner, adaptation is performed with neighboring units,
too. Units that are subject to adaptation are depicted as shaded
units in the figure. The shading of the various units corresponds to
the amount of adaptation and thus, to the spatial width of the
neighborhood-kernel. Generally, units in close vicinity of the winner
are adapted more strongly and consequently, they are depicted with a
darker shade in the figure.
A serial implementation of the SOM iterates through the training algorithm. This involves scanning all units to find the winner and then modifying all units to reflect the change. As the map will usually be implemented as an array of weight-vectors, this array is traversed twice. From this, it should be clear that a major bottleneck in a software-based implementation of a self-organizing map consists in the memory bandwidth. During every iteration of the training algorithm every unit within the map is touched twice. First, during the search for the best matching unit within the map and then again, when the weights are adjusted. Due to the fact, that we are using high-dimensional inputs with multiple thousands of inputs, it is not uncommon that a rather small map requires in excess of 5 MBs. A medium-sized map may easily occupy tens of megabytes. This is beyond the capacity of most cache memories and therefore incurs large overheads in its memory accesses. The cache thrashing can be lessened a little by using a technique known as loop inversion. Instead of traversing the unit-array from its beginning to its end when traversing it for the second time (i.e., when adjusting the weights), the array is traversed backwards to exploit the cache.
Another major bottleneck results from the huge number of floating point operations required. Training a 10x15 map of 4000 features in 20000 iterations requires in excess of 72 billion arithmetic floating point instructions and 6 million floating point comparisons. In order to provide high performance in such applications, multiple processors can be used in parallel. This becomes a viable alternative, as the distance calculation and the adaption of the weights can be executed independently for each unit. These independent operations constitute the major part of the runtime, calling for a parallel implementation.
Two different strategies for the parallelization of SOMs are available:
For our current implementation, we use partitioning, with one master controls multiple slaves. As depicted in Figure 2 three major synchronization points occur in this implementation. The data transfers are very small, as only the offset of the input vector (within a replicated file) is broadcast; the data packets for computing the minima consist of a Cartesian coordinate and a floating point number signifying the error at that coordinate. The approach chosen in our work is based on a communication-oriented computation and synchronization model. It uses a number of information and condition broadcasts to synchronize its operations. In addition, we view a multi-dimensional self-organizing map as a collection of units with associated coordinates. This allows us to distribute the units freely between the threads.
![]() |
Any subsets of the units can be easily represented in a linear array. By segmenting the SOM into multiple regions, we can effectively distribute the memory usage across multiple processing nodes. Since training the SOM requires the entire map segment to be operated on, the distribution of the map across multiple processing nodes with independent memories (e.g., on ccNUMA architectures) reduces the pressure on the memory hierarchy and makes it possible to fit the map segments entirely into L2 caches on the respective nodes.
A master thread selects an input vector and broadcasts it to all its slave threads. All threads search for their local minima, i.e. best-matching units concurrently. These best-matches are sent back to the master, which determines a global match. All slaves are notified of the global match and they modify their units relative to the location of it independently. The data packets exchanged at the synchronization points are very small, consisting of a spatial coordinate and an error value at most. This algorithm should scale very well to any number of processors in theory, as the search for the local minima and the training process account for almost the entire execution time of the program. Remember, that both finding the local minima and training involves operations on arrays of large vectors of floating point numbers. For this reason, the additional work of the master thread is negligible and does not affect the overall performance. However, for small maps and very fast processors, scalability problems may arise due to the synchronization overheads.
Information retrieval in digital libraries is often based on similarity search [1], such that a user queries the document repository by example. A typical implementation for this is based on the use of self-organizing maps (sometimes referred to as Kohonen maps these are a type of unsupervised-learning neural networks, that provide the capability to the high-dimensional feature vectors associated with documents into a low-dimensional Cartesian space. Once this mapping in concluded, it can be used to determine a clustering of documents. Similar documents should belong to the same or neighboring clusters. This provides a unique opportunity to visualize the clustered documents.
The quality of these clusters depends largely on the use of appropriate learning functions and distance-based feedback. Unfortunately these functions need to be applied in an iterative manner for multiple thousands of iterations. This is leads to execution times for every training run that fall far outside of what can be considered appropriate for interactive operation. This implies that documents can not be clustered on the fly (e.g., for web-searches or for searches using iterative refinement). The rapid growth of digital content outpaces the increase of the computing power of traditional single-processor systems. However, the most important upcoming applications of digital libraries require the just this interactive response:
For the following experiments we use a collection of 420 articles from
the TIME Magazine from the 1960's as a sample document
repository. The articles are parsed to create a word histogram
representation of the documents using the classical
,
i.e. term frequency times inverse document frequency weighting scheme
according to the vector space model of information
retrieval [10]. We thus obtain 4012-dimensional input vectors,
which were further used for training SOMs of
units. The
memory required to store this map exceeds 4.5MBs.
The resulting SOM is depicted in Figure 3. As a closer look at the resulting mapping reveals, the SOM succeeded in producing a topology preserving mapping from the 4012-dimensional input space onto the 2-dimensional map space, clustering the articles by topic. This can be easily demonstrated using the labels automatically extracted from the trained SOM by applying a variation of the labelSOM technique [6].
For example, we find a number of articles covering the war in Vietnam in the lower left corner of the map. Another rather large cluster in the lower right corner of the map is devoted to articles on the situation in the Middle East, covering Nasser's attempts to create an arab union. Due to space considerations, we cannot present the complete map in full detail. However, the map is available at http://www.ifs.tuwien.ac.at/ifs/research/projects/parsom/ for interactive exploration. For a closer analysis of the application of SOMs in the field of information retrieval, refer to [7,8].
The resulting representation of documents due to its topical organization offers itself to interactive exploration. However, to allow its application in an information retrieval or data mining setting, providing competitive response times is crucial, calling for a parallel processing of the SOM training process.
Table 1 summarizes the execution time for training the map using par SOM. We used POSIX threads, a shared address space and user-level spinlocks in our implementation. The test were conducted on a number of multi-processor machines with dissimilar memory architectures:
The dual-processor Celeron system is a commodity personal computer. The small cache sizes of the processors require large amounts of memory transfers. Due to the limited memory bandwidth, not even a linear speedup is achievable in this configuration. The only apparent cause for the bad scalability can be attributed to bus contention. On this system, inverting the training loop does not cause a significant performance gain.
An entirely different performance characteristic was obtained on our principal development platform. On the SGI PowerChallenge, which uses a shared memory architecture, a 22-fold increase in performance could be achieved with only 12 processors. Although this may seem paradox, it should be expected. The performance increases very quickly for the first few threads, until the amount of cache memory exceeds the size of the map (compare Figure 4). After that, the performance increase continues in a linear fashion. We provided time measurements up to 12 processors only, as the machine was not available for dedicated testing. The scalability on this system remains unaffected by the available memory bandwidth: the entire SOM is distributed through the cache memories of the various processor, effectively avoiding bus contention. Interestingly, loop inversion leads to a large increase in performance for the uniprocessor version on this machine. This optimization resulted in a 22% increase in performance.
![]() |
The performance gains on the SGI Origin were less impressive than those on the PowerChallenge. Although the current implementation scales to 8 processors and achieves a 5.5-fold performance increase, it still offers room for improvement. This is due to our dependence on POSIX threads and how these are implemented in IRIX 6.5. Instead of allocating a separate heap for each thread, a shared heap is used, which has its affinity hard-wired to the processor of the main thread. Understandably, performance degrades in such an environment, if many threads are used. Nonetheless, the speedup between the single-processor and dual-processor implementation shows again how the availability of sufficient cache memory to hold the entire SOM has a significant performance impact. We are currently working on a MPI-based implementation to create an optimal version for this architecture.
Our experiments with parallel, software-based implementations of self-organizing maps for Digital Libraries have shown that interactive response times--even for high dimensional feature spaces--can be achieved using supercomputing resources. The used algorithms do not only scale well, but offer an additional performance advantage due to cache effects. Parallelism effectively becomes a method to overcome the memory latencies in self-organizing neural networks.
We are currently adding MPI support. Dynamic load-balancing becomes a necessity, if non-dedicated computing resources are used, as the background load may severely affect the performance of the system. In such environments, a dynamic redistribution of the map between nodes becomes necessary to provide peak performance. Support for SIMD extensions to common microprocessors and vector computers will be also be added in future versions.
This document was generated using the LaTeX2HTML translator Version 99.1 release (March 30, 1999)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 tom_hpcn2000.tex
The translation was initiated by Andreas RAUBER on 2000-08-04