Department of Software Technology
Vienna University of Technology


The Growing Hierarchical Self-Organizing Map (GHSOM) - Short Guide

1. General

The following sections provide a short overview on how to use the our implementation of the GHSOM. The source code, as well as a compiled version, can be obtained via the download-page of the SOMLib Digital Library project at http://www.ifs.tuwien.ac.at/~andi/somlib. In order to compile the source-code, unpack the gnu-zipped tar archive, change into the respective directory, and simply do a

./configure
./make

This should leave you with an executable program called ghsom that you can either use from you current directory or put in any place in your path searched for executables. (If it does not... well... then it's time for trouble-shooting :)

Input Files

The data to be clustered using the GHSOM is represented by 2 input files that are in plain-text ASCII file format. The fields enclosed in brackets have to be substituted by the actual values. You can use any of the demo files provided on the web or build your own data files following this format:

Inputvector file
The number of features and input vectors are integer values and the single vector elements may be integer or real values >=0.

$TYPE inputvec
$XDIM <# of input vectors>
$YDIM 1
$VECDIM <# of features>
<feat 1> <feat 2> ... <name of input vector 1>
<feat 1> <feat 2> ... <name of input vector 2>
            .
            .
            .
The first lineTYPE contains just a free-form text tag (a single word) allowing you to label the vector file.
XDIM lists the number of input vectors you have in your file.
YDIM for an input vector file always has to be set to 1. (If you want to know why, drop me an e-mail. It has its reasons)
VECDIM gives the dimensionality of your feature space.
Following the general header, you have 1 line for each vector, listing the values of the attributes as space-separated real numbers, followed by a label for th evector, which again can be any text/id. The matrix thus is VECDIM+1 x XDIM.

Template vector file
The template vector file basically describes the dimensions of your feature space, and may be used to assign labels to the clusters by selecting the dimension identifiers that are most characteristic for a given cluster. The file format is basically as follows:

$TYPE template
$XDIM 7
$YDIM  <# of input vectors>
$VECDIM <# of features>
0 <name of feature 1> <df> <tf> <min_tf> <max_tf> <mean_tf>
1 <name of feature 2> <df> <tf> <min_tf> <max_tf> <mean_tf>
2 <name of feature 3> <df> <tf> <min_tf> <max_tf> <mean_tf>

:
:
:
<# of features - 1>  <name of feature n> <df> <tf> <min_tf> <max_tf> <mean_tf>
TYPE again is a free-text identifier characterizing the file.
XDIM always is set to 7 as the vector file contains 7 columns of information
YDIM usually is set to the number of vectors you want to analyze. This may be used intext analysis to calculate weighted values such as tfxidf.
VECDIM again gives the dimensionality of your feature space.
Following this header you have lines describing each dimension, with the columns representing, respectively, the number of the dimension (a counter starting with 0), the name of the dimension (a single word), the term frequency of the attribute in your data collection, the document frequency of the attribute, as well as the min, max and average term frequencies.
This information has been optimized for text data analysis. If you want to use the GHSOM for different purposes, you can savely set all the frequency values to 1 - they are not used during the training process.

3. Usage

All paramters to the GHSOM training have to be specified in a property file which is described inmore detail below. In order to train a map, you simply call ghsom with that property file as input, i.e.

ghsom 


The property-file is a simple plain-text file consisting of several property -
value pairs like this:

property1=value1
property2=value2
property3=value3
...

ATTENTION: no white-spaces are allowed between property/value and the equal sign. Furthermore, no trailing white spaces should be present after the value.

If you don't provide one or more of several of the following properties, a default value for them will be set.

PropertyTypeRangeDescription
EXPAND_CYCLESint>=1# of cycles after which the map is checked for eventual expansion; 1 cycle actually means # of input vectors;
Example: 100 input vectors, 10 cycles = 1000 times a randomly chosen pattern is presented to the SOM for learning
TAU_1real[0-1]percentage of remaining error that has to be explained by each map, ako stopping criterion for horozontal growth. The smaller this value, the larger each map will grow, and the flatter the hierarchy will be
A good starting point may be a value of about 0.25
TAU_2real[0-1]final degree of granularity represented by the maps in the lowest layer. The smaller, the more detailed the data representation will be, and thus the bigger the overall GHSOM structure.
An appropriate value for testing may be 0.1 or less; if you set this property to 1, only one single SOM in the first layer will be trained
INITIAL_LEARNRATEreal[0-1]determines how strong the winner and its neighboring units are initially adapted, decreases over time
good starting point: 0.8
NRreal<=0NEW in GHSOM-1.6: height of Gaussian Neighborhood at map border
defines the "height" of the Gaussian Neighborhood bubble at the unit which is max-x (width) or max-y (height) away from the winner (used to calculate sigma of the Gaussian kernel) - replaces the initial neighborhood radius of version 1.5
HTML_PREFIXstring-prefix for the output files. All files will be labeled that way,, followed by an underscore and subsequent numbering
DATAFILE_EXTENSIONstringmay be emptysuffix for the reference of the data files in the HTML tables;
we usually name the vectors in the inputvector-file to link to the actual files but omit the extension to get "better looking" maps; if you do so, you have to provide the extension to get the correct links to the document files; for browsing, the document files are always expected in a subdirectory files of the directory where the HTML files are located
randomSeedintanyinitial seed value for the random number generator to enable repeatable training-runs
inputFilestring-path (relative to the current directory you are in or absolute) + name of the input vector file (vectors/test.in)
descriptionFilestring-path (relative to the current directory you are in or absolute) + name of the input vector file (vectors/test.tv)
savePathstring-directory where the output files are written (without trailing slash). Note: make sure that this directory exists, and that you have write permissions on it! :) (output)
normInputVectorsstringNONE
| LENGTH
| INTERVAL
if and how the input vectors are normalized; NONE=raw input data will be used; LENGTH=vectors are normalized to length 1; INTERVAL=vector elements are transformed into the interval [0-1]
INITIAL_X_SIZEint>=1initial size of new maps in x-direction. For any growing map you will want to set this to 2, from which the map will start to grow. However, you can set it to any desired size right away.
INITIAL_Y_SIZEint>=1initial size of new maps in y-direction. For any growing map you will want to set this to 2, from which the map will start to grow. However, you can set it to any desired size right away. If you set this value to 1, you will create a 1-dimensional SOM, that grows only linearly, resulting, if expanded hierarchically, in a tree-like representation fo your data.
LABELS_NUMint>=0max # of labels per unit; 0 = no labels.
The labelSOM method is used to select those features that are most characteristic of the respective unit to describe it.
LABELS_ONLYbooltrue | falseif 'true', only the labels will be shown on nodes which have been expanded into the next layer along with a link labeled "down". Setting this property to 'false' is only useful for testing small data sets to see which data is mapped onto the according map in the next layer.
LABELS_THRESHOLDreal[0-1]features which are most important are used as labels; a value of 0.8 means that only the features with values in the top 20% of all are printed as labels; the lower this value the more labels will be shown (limited by LABELS_NUM)


Up

Comments: rauber@ifs.tuwien.ac.at