next_group up previous


Back to: List of Publications

par SOM: Using parallelism to overcome memory latency in self-organizing neural networks

Philipp Tomsich, Andreas Rauber and Dieter Merkl

Institute of Software Technology
Vienna University of Technology
Favoritenstraße 9-11/188
A-1040 Wien, Austria
phil,andi,dieter@ifs.tuwien.ac.at

Abstract:

The self-organizing map is a prominent unsupervised neural network model which lends itself to the analysis of high-dimensional input data. However, the high execution times required to train the map put a limit to its application in many high-performance data analysis application domains, where either very large datasets are encountered and/or interactive response times are required.

In this paper we present the par SOM, a software-based parallel implementation of the self-organizing map, which is particularly optimized for the analysis of high-dimensional input data. This model scales well in a parallel execution environment, and, by coping with memory latencies, a better than linear speed-up can be achieved using a simple, asymmetric model of parallelization. We demonstrate the benefits of the proposed implementation in the field of text classification, which due to the high dimensionalities of the data spaces encountered, forms a prominent application domain for high-performance computing.

Introduction

The self-organizing map (SOM) [4] is a prominent unsupervised neural network model providing a topology-preserving mapping from a high-dimensional input space onto a two-dimensional map space. This capability allows an intuitive analysis and exploration of unknown data spaces, with its applications ranging from the analysis of financial data, medical data, to time series data [4,2,11]. Another rather prominent application arena is text classification, where documents in a collection are clustered according to their content. However, in many applications the dimensionalities of the feature spaces to be analyzed as well as the amount of data encountered are too large to allow a fast, interactive training of the neural network. Still, interactive training and mapping are highly desirable, as they open a huge field of applications in the context of data analysis in Data Warehouses or in the field of Digital Libraries, to name just two examples. In both cases, interactive response times are a must.

Existing hardware implementations [4,3,5] have not gained wide-spread acceptance because of the limitations they put both on the dimensionality of the data to be analyzed or the size of the map. On the other hand, high-performance computing provides the technology necessary for the desired speedup.

In this paper we present the par SOM, a software-based parallel implementation of the SOM, using a simple asymmetric model of parallelization. Apart from the obvious speedup gained by having the training process executed in parallel for the various neural processing units in a SOM we show, that the use of cache effects allows a better-than-linear speedup for the resulting implementation. We have implemented and evaluated the par SOM on a number of platforms, ranging from off-the-shelf dual-processor Celeron systems, the 16-processor SGI Power Challenge, to the 64-processor SGI Cray Origin 2000 using an application scenario from the digital library domain as a test-bed for our performance analysis.

The remainder of the paper is organized as follows. Section 2.1 provides an introduction to the architecture and the training algorithm of a self-organizing map, with a serial implementation discussed in Section 2.2. The actual algorithm implemented in the par SOM system is presented in Section 2.3. Our application scenario from the field of Digital Libraries is introduced in Section 3, followed by an analysis of experimental results in Section 4. Final conclusions and an outlook are presented in Section 5.


Self-organizing maps


Architecture and training

The self-organizing map [4] basically provides a form of cluster analysis by producing a mapping of high-dimensional input data $x, x \in \Re^n$ onto a usually 2-dimensional output space while preserving the topological relationships between the input data items as faithfully as possible. This model consists of a set of units, which are arranged in some topology where the most common choice is a two-dimensional grid. Each of the units $i$ is assigned a weight vector $m_{i}$ of the same dimension as the input data, $m_{i} \in
\Re^{n}$. In the initial setup of the model prior to training, the weight vectors are filled with random values.

During each learning step, the unit $c$ with the highest activity level, i.e. the winner c with respect to a randomly selected input pattern $x$, is adapted in a way that it will exhibit an even higher activity level at future presentations of that specific input pattern. Commonly, the activity level of a unit is based on the Euclidean distance between the input pattern and that unit's weight vector, i.e. the unit showing the lowest Euclidean distance between it's weight vector and the distance between the input pattern and that unit's weight vector, i.e. the unit showing the lowest Euclidean distance between it's weight vector and the presented input vector is selected as the winner. Hence, the selection of the winner $c$ may be written as given in Expression (1).


\begin{displaymath}
c: \vert\vert x - m_{c} \vert\vert = \min_{i}\{\vert\vert x - m_{i} \vert\vert \}
\end{displaymath} (1)

Adaptation takes place at each learning iteration and is performed as a gradual reduction of the difference between the respective components of the input vector and the weight vector. The amount of adaptation is guided by a learning-rate $\alpha$ that is gradually decreasing in the course of time. This decreasing nature of adaptation strength ensures large adaptation steps in the beginning of the learning process where the weight vectors have to be tuned from their random initialization towards the actual requirements of the input space. The ever smaller adaptation steps towards the end of the learning process enable a fine-tuned input space representation.

As an extension to standard competitive learning, units in a time-varying and gradually decreasing neighborhood around the winner are adapted, too. Pragmatically speaking, during the learning steps of the self-organizing map a set of units around the winner is tuned towards the currently presented input pattern enabling a spatial arrangement of the input patterns such that alike inputs are mapped onto regions close to each other in the grid of output units. Thus, the training process of the self-organizing map results in a topological ordering of the input patterns. According to [9] we may thus refer to the self-organizing map as a neural network model performing a spatially smooth version of $k$-means clustering.

The neighborhood of units around the winner may be described implicitly by means of a neighborhood-kernel $h_{ci}$ taking into account the distance--in terms of the output space--between unit $i$ under consideration and unit $c$, the winner of the current learning iteration. This neighborhood-kernel assigns scalars in the range of [0, 1] that are used to determine the amount of adaptation ensuring that nearby units are adapted more strongly than units further away from the winner. A Gaussian may be used to define the neighborhood-kernel as given in Expression (2) where $\vert\vert
r_{c} - r_{i} \vert\vert$ denotes the distance between units $c$ and $i$ within the output space, with $r_{i}$ representing the two-dimensional vector pointing to the location of unit $i$ within the grid.


\begin{displaymath}
h_{ci}(t) = e^{- \frac{ \vert\vert r_{c} - r_{i} \vert\vert }{2 \cdot \delta (t)^{2}}}
\end{displaymath} (2)

It is common practice that in the beginning of the learning process the neighborhood-kernel is selected large enough to cover a wide area of the output space. The spatial width of the neighborhood-kernel is reduced gradually during the learning process such that towards the end of the learning process just the winner itself is adapted. Such a reduction is done by means of the time-varying parameter $\delta$ in Expression (2). This strategy enables the formation of large clusters in the beginning and fine-grained input discrimination towards the end of the learning process.

In combining these principles of self-organizing map training, we may write the learning rule as given in Expression (3). Please note that we make use of a discrete time notation with $t$ denoting the current learning iteration. The other parts of this expression are $\alpha$ representing the time-varying learning-rate, $h_{ci}$ representing the time-varying neighborhood-kernel, $x$ representing the currently presented input pattern, and $m_{i}$ denoting the weight vector assigned to unit $i$.


\begin{displaymath}
m_{i}(t+1)=m_{i}(t)+\alpha(t) \cdot h_{ci}(t) \cdot [x(t)-m_{i}(t)]
\end{displaymath} (3)

Figure 1: Architecture and training process of a self-organizing map
\begin{figure}
\begin{center}
\epsfig {file=somarch.eps, width=40mm}\end{center}\vspace*{-10mm}
\end{figure}

A simple graphical representation of a self-organizing map's architecture and its learning process is provided in Figure 1. In this figure the output space consists of a square of 36 units, depicted as circles. One input vector $x(t)$ is randomly chosen and mapped onto the grid of output units. In the second step of the learning process, the winner $c$ showing the highest activation is selected. Consider the winner being the unit depicted as the black unit in the figure. The weight vector of the winner, $m_{c}(t)$, is now moved towards the current input vector. This movement is symbolized in the input space in Figure 1. As a consequence of the adaptation, unit $c$ will produce an even higher activation with respect to input pattern $x$ at the next learning iteration, $t+1$, because the unit's weight vector, $m_{c}(t+1)$, is now nearer to the input pattern $x$ in terms of the input space. Apart from the winner, adaptation is performed with neighboring units, too. Units that are subject to adaptation are depicted as shaded units in the figure. The shading of the various units corresponds to the amount of adaptation and thus, to the spatial width of the neighborhood-kernel. Generally, units in close vicinity of the winner are adapted more strongly and consequently, they are depicted with a darker shade in the figure.


Implementing SOMs in software

A serial implementation of the SOM iterates through the training algorithm. This involves scanning all units to find the winner and then modifying all units to reflect the change. As the map will usually be implemented as an array of weight-vectors, this array is traversed twice. From this, it should be clear that a major bottleneck in a software-based implementation of a self-organizing map consists in the memory bandwidth. During every iteration of the training algorithm every unit within the map is touched twice. First, during the search for the best matching unit within the map and then again, when the weights are adjusted. Due to the fact, that we are using high-dimensional inputs with multiple thousands of inputs, it is not uncommon that a rather small map requires in excess of 5 MBs. A medium-sized map may easily occupy tens of megabytes. This is beyond the capacity of most cache memories and therefore incurs large overheads in its memory accesses. The cache thrashing can be lessened a little by using a technique known as loop inversion. Instead of traversing the unit-array from its beginning to its end when traversing it for the second time (i.e., when adjusting the weights), the array is traversed backwards to exploit the cache.

Another major bottleneck results from the huge number of floating point operations required. Training a 10x15 map of 4000 features in 20000 iterations requires in excess of 72 billion arithmetic floating point instructions and 6 million floating point comparisons. In order to provide high performance in such applications, multiple processors can be used in parallel. This becomes a viable alternative, as the distance calculation and the adaption of the weights can be executed independently for each unit. These independent operations constitute the major part of the runtime, calling for a parallel implementation.


The par SOM implementation

Two different strategies for the parallelization of SOMs are available:

For our current implementation, we use partitioning, with one master controls multiple slaves. As depicted in Figure 2 three major synchronization points occur in this implementation. The data transfers are very small, as only the offset of the input vector (within a replicated file) is broadcast; the data packets for computing the minima consist of a Cartesian coordinate and a floating point number signifying the error at that coordinate. The approach chosen in our work is based on a communication-oriented computation and synchronization model. It uses a number of information and condition broadcasts to synchronize its operations. In addition, we view a multi-dimensional self-organizing map as a collection of units with associated coordinates. This allows us to distribute the units freely between the threads.

Figure: A flowchart of the computation using an asymmetric model of parallelization. Three major synchronization points occur in this implementation.
\begin{figure}
\begin{center}
\hspace{0mm}
\epsfig {file=parsom-algorithm.eps, width=.6\linewidth}\end{center}\vspace*{-10mm}
\end{figure}

Any subsets of the units can be easily represented in a linear array. By segmenting the SOM into multiple regions, we can effectively distribute the memory usage across multiple processing nodes. Since training the SOM requires the entire map segment to be operated on, the distribution of the map across multiple processing nodes with independent memories (e.g., on ccNUMA architectures) reduces the pressure on the memory hierarchy and makes it possible to fit the map segments entirely into L2 caches on the respective nodes.

A master thread selects an input vector and broadcasts it to all its slave threads. All threads search for their local minima, i.e. best-matching units concurrently. These best-matches are sent back to the master, which determines a global match. All slaves are notified of the global match and they modify their units relative to the location of it independently. The data packets exchanged at the synchronization points are very small, consisting of a spatial coordinate and an error value at most. This algorithm should scale very well to any number of processors in theory, as the search for the local minima and the training process account for almost the entire execution time of the program. Remember, that both finding the local minima and training involves operations on arrays of large vectors of floating point numbers. For this reason, the additional work of the master thread is negligible and does not affect the overall performance. However, for small maps and very fast processors, scalability problems may arise due to the synchronization overheads.


SOMs and Digital Libraries

Information retrieval in digital libraries is often based on similarity search [1], such that a user queries the document repository by example. A typical implementation for this is based on the use of self-organizing maps (sometimes referred to as Kohonen maps these are a type of unsupervised-learning neural networks, that provide the capability to the high-dimensional feature vectors associated with documents into a low-dimensional Cartesian space. Once this mapping in concluded, it can be used to determine a clustering of documents. Similar documents should belong to the same or neighboring clusters. This provides a unique opportunity to visualize the clustered documents.

The quality of these clusters depends largely on the use of appropriate learning functions and distance-based feedback. Unfortunately these functions need to be applied in an iterative manner for multiple thousands of iterations. This is leads to execution times for every training run that fall far outside of what can be considered appropriate for interactive operation. This implies that documents can not be clustered on the fly (e.g., for web-searches or for searches using iterative refinement). The rapid growth of digital content outpaces the increase of the computing power of traditional single-processor systems. However, the most important upcoming applications of digital libraries require the just this interactive response:

For the following experiments we use a collection of 420 articles from the TIME Magazine from the 1960's as a sample document repository. The articles are parsed to create a word histogram representation of the documents using the classical $tf \times idf$, i.e. term frequency times inverse document frequency weighting scheme according to the vector space model of information retrieval [10]. We thus obtain 4012-dimensional input vectors, which were further used for training SOMs of $10 \times 15$ units. The memory required to store this map exceeds 4.5MBs.

Figure 3: A map trained with 420 articles from the TIME Magazine.
\begin{figure}
\begin{center}
\epsfig {file=map.eps,height=90mm}\end{center}\vspace*{-10mm}
\end{figure}

The resulting SOM is depicted in Figure 3. As a closer look at the resulting mapping reveals, the SOM succeeded in producing a topology preserving mapping from the 4012-dimensional input space onto the 2-dimensional map space, clustering the articles by topic. This can be easily demonstrated using the labels automatically extracted from the trained SOM by applying a variation of the labelSOM technique [6].

For example, we find a number of articles covering the war in Vietnam in the lower left corner of the map. Another rather large cluster in the lower right corner of the map is devoted to articles on the situation in the Middle East, covering Nasser's attempts to create an arab union. Due to space considerations, we cannot present the complete map in full detail. However, the map is available at http://www.ifs.tuwien.ac.at/ifs/research/projects/parsom/ for interactive exploration. For a closer analysis of the application of SOMs in the field of information retrieval, refer to [7,8].

The resulting representation of documents due to its topical organization offers itself to interactive exploration. However, to allow its application in an information retrieval or data mining setting, providing competitive response times is crucial, calling for a parallel processing of the SOM training process.


Analysis of Experimental Results

Table 1 summarizes the execution time for training the map using par SOM. We used POSIX threads, a shared address space and user-level spinlocks in our implementation. The test were conducted on a number of multi-processor machines with dissimilar memory architectures:


Table 1: Execution times and speedups on multiprocessor systems
  SGI PowerChallenge SGI Origin 2000 Dual Celeron PC
  time   time   time  
threads elapsed relative speedup elapsed relative speedup elapsed relative speedup
1  1555.591  1.0000  1.0000  461.940  1.0000  1.0000  1521.900  1.0000  1.0000
2 533.265 0.3428 2.9171 209.790 0.4541 2.2021 1037.850 0.6819 1.4664
| | | | | | | |-|-|-| 3 231.964 0.1491 6.7069 150.223 0.3252 3.0750      
4 180.035 0.1157 8.6430 117.100 0.2534 3.9448      
5 128.025 0.0886 11.3122 114.930 0.2488 4.0178      
6 117.228 0.0753 13.2802 102.134 0.2211 4.5216      
7 96.632 0.0621 16.1030 90.817 0.1966 5.0854      
8 91.481 0.0588 17.0068 83.240 0.1802 5.5493      
| | | | |-|-|-| 9 83.333 0.0535 18.6915            
10 80.993 0.0520 19.2307            
11 76.701 0.0493 20.2839            
12 70.390 0.0452 22.1238            
|-|-|-|-|                  


The dual-processor Celeron system is a commodity personal computer. The small cache sizes of the processors require large amounts of memory transfers. Due to the limited memory bandwidth, not even a linear speedup is achievable in this configuration. The only apparent cause for the bad scalability can be attributed to bus contention. On this system, inverting the training loop does not cause a significant performance gain.

An entirely different performance characteristic was obtained on our principal development platform. On the SGI PowerChallenge, which uses a shared memory architecture, a 22-fold increase in performance could be achieved with only 12 processors. Although this may seem paradox, it should be expected. The performance increases very quickly for the first few threads, until the amount of cache memory exceeds the size of the map (compare Figure 4). After that, the performance increase continues in a linear fashion. We provided time measurements up to 12 processors only, as the machine was not available for dedicated testing. The scalability on this system remains unaffected by the available memory bandwidth: the entire SOM is distributed through the cache memories of the various processor, effectively avoiding bus contention. Interestingly, loop inversion leads to a large increase in performance for the uniprocessor version on this machine. This optimization resulted in a 22% increase in performance.

Figure 4: Performance on a SGI PowerChallenge: A shared-memory multiprocessor system with a high-enough memory bandwidth can lead to a better than linear speedup.
\begin{figure}
\begin{center}
\epsfig {file=speedup.eps, height=50mm}\end{center}\vspace*{-10mm}
\end{figure}

The performance gains on the SGI Origin were less impressive than those on the PowerChallenge. Although the current implementation scales to 8 processors and achieves a 5.5-fold performance increase, it still offers room for improvement. This is due to our dependence on POSIX threads and how these are implemented in IRIX 6.5. Instead of allocating a separate heap for each thread, a shared heap is used, which has its affinity hard-wired to the processor of the main thread. Understandably, performance degrades in such an environment, if many threads are used. Nonetheless, the speedup between the single-processor and dual-processor implementation shows again how the availability of sufficient cache memory to hold the entire SOM has a significant performance impact. We are currently working on a MPI-based implementation to create an optimal version for this architecture.


Conclusions and Future Work

Our experiments with parallel, software-based implementations of self-organizing maps for Digital Libraries have shown that interactive response times--even for high dimensional feature spaces--can be achieved using supercomputing resources. The used algorithms do not only scale well, but offer an additional performance advantage due to cache effects. Parallelism effectively becomes a method to overcome the memory latencies in self-organizing neural networks.

We are currently adding MPI support. Dynamic load-balancing becomes a necessity, if non-dedicated computing resources are used, as the background load may severely affect the performance of the system. In such environments, a dynamic redistribution of the map between nodes becomes necessary to provide peak performance. Support for SIMD extensions to common microprocessors and vector computers will be also be added in future versions.

Bibliography

1
P. Adam, H. Essafi, M.-P. Gayrad, and M. Pic.
High-performance programming support for multimedia document database management.
In High Performance Computing and Networking.

2
G. DeBoeck and T. Kohonen.
Visual Explorations in Finance.
Springer Verlag, Berlin, Germany, 1998.

3
D. Hammerstrom.
A VLSI architecture for high-performance, low-cost, on-chip learning.
In International Joint Conference on Neural Networks.

4
T. Kohonen.
Self-Organizing Maps.
Springer Verlag, Berlin, Germany, 1995.

5
J. Liu and M. Brooke.
A fully parallel learning neural network chip for real-time control.
In International Joint Conference on Neural Networks, 1999.

6
A. Rauber.
LabelSOM: On the labeling of self-organizing maps.
In Proc. International Joint Conference on Neural Networks, Washington, DC, 1999.

7
A. Rauber and D. Merkl.
Using self-organizing maps to organize document collections and to characterize subject matters.

8
A. Rauber and D. Merkl.
Finding structure in text archives.
In Proc. European Symp. on Artificial Neural Networks (ESANN98), Bruges, Belgium, 1998.

9
B. D. Ripley.
Pattern Recognition and Neural Networks.
Cambridge University Press, Cambridge, UK, 1996.

10
G. Salton.
Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer.
Addison-Wesley, Reading, MA, 1989.

11
O. Simula, P. Vasara, J. Vesanto, and R. R. Helminen.
The self-organizing map in industry analysis.
In L.C. Jain and V.R. Vemuri, editors, Industrial Applications of Neural Networks, Washington, DC., 1999. CRC Press.

About this document ...

par SOM: Using parallelism to overcome memory latency in self-organizing neural networks

This document was generated using the LaTeX2HTML translator Version 99.1 release (March 30, 1999)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split 0 tom_hpcn2000.tex

The translation was initiated by Andreas RAUBER on 2000-08-04


next_group up previous
Andreas RAUBER
2000-08-04