Department of Software Technology
Vienna University of Technology

The Growing Hierarchical Self-Organizing Map (GHSOM) - Short Guide

1. General
2. Input Files
3. Usage

1. General

The following sections provide a short overview on how to use the our implementation of the GHSOM. The source code, as well as a compiled version, can be obtained via the download-page of the SOMLib Digital Library project at http://www.ifs.tuwien.ac.at/~andi/somlib. In order to compile the source-code, unpack the gnu-zipped tar archive, change into the respective directory, and simply do a

./configure
./make

This should leave you with an executable program called ghsom that you can either use from you current directory or put in any place in your path searched for executables. (If it does not... well... then it's time for trouble-shooting :)

Input Files

The data to be clustered using the GHSOM is represented by 2 input files that are in plain-text ASCII file format. The fields enclosed in brackets have to be substituted by the actual values. You can use any of the demo files provided on the web or build your own data files following this format:

Inputvector file
The number of features and input vectors are integer values and the single vector elements may be integer or real values >=0.

$TYPE inputvec
$XDIM <# of input vectors>
$YDIM 1
$VECDIM <# of features>
<feat 1> <feat 2> ... <name of input vector 1>
<feat 1> <feat 2> ... <name of input vector 2>
            .
            .
            .

The first lineTYPE contains just a free-form text tag (a single word) allowing you to label the vector file.
XDIM lists the number of input vectors you have in your file.
YDIM for an input vector file always has to be set to 1. (If you want to know why, drop me an e-mail. It has its reasons)
VECDIM gives the dimensionality of your feature space.
Following the general header, you have 1 line for each vector, listing the values of the attributes as space-separated real numbers, followed by a label for th evector, which again can be any text/id. The matrix thus is VECDIM+1 x XDIM.

Template vector file
The template vector file basically describes the dimensions of your feature space, and may be used to assign labels to the clusters by selecting the dimension identifiers that are most characteristic for a given cluster. The file format is basically as follows:

$TYPE template
$XDIM 7
$YDIM  <# of input vectors>
$VECDIM <# of features>
0 <name of feature 1> <df> <tf> <min_tf> <max_tf> <mean_tf>
1 <name of feature 2> <df> <tf> <min_tf> <max_tf> <mean_tf>
2 <name of feature 3> <df> <tf> <min_tf> <max_tf> <mean_tf>

:
:
:
<# of features - 1>  <name of feature n> <df> <tf> <min_tf> <max_tf> <mean_tf>

TYPE again is a free-text identifier characterizing the file.
XDIM always is set to 7 as the vector file contains 7 columns of information
YDIM usually is set to the number of vectors you want to analyze. This may be used intext analysis to calculate weighted values such as tfxidf.
VECDIM again gives the dimensionality of your feature space.
Following this header you have lines describing each dimension, with the columns representing, respectively, the number of the dimension (a counter starting with 0), the name of the dimension (a single word), the term frequency of the attribute in your data collection, the document frequency of the attribute, as well as the min, max and average term frequencies.
This information has been optimized for text data analysis. If you want to use the GHSOM for different purposes, you can savely set all the frequency values to 1 - they are not used during the training process.

3. Usage

All paramters to the GHSOM training have to be specified in a property file which is described inmore detail below. In order to train a map, you simply call ghsom with that property file as input, i.e.

ghsom 


The property-file is a simple plain-text file consisting of several property -
value pairs like this:
property1=value1
property2=value2
property3=value3
...


ATTENTION: no white-spaces are allowed between property/value and the
equal sign. Furthermore, no trailing white spaces should be present after the value.

If you don't provide one or more of several of the following
properties, a default value for them will be set.



  Property Type Range Description
  EXPAND_CYCLES int >=1 # of cycles
      after which the map is checked for eventual expansion;
      1 cycle actually means # of input vectors;
Example: 100 input
      vectors, 10 cycles = 1000 times a randomly chosen pattern is
      presented to the SOM for learning
  TAU_1 real [0-1] percentage of remaining error that has to be explained by each map, ako stopping criterion for horozontal growth. The smaller this value, the larger each map will grow, and the flatter the hierarchy will be

      A good starting point may be a value of about 0.25
  TAU_2 real [0-1] final degree of
      granularity represented by the maps in the lowest layer. The smaller, the more detailed the data representation will be, and thus the bigger the overall GHSOM structure. 

      An appropriate value for testing may be 0.1 or less; if you set
      this property to 1, only one single SOM in the first layer will
      be trained
  INITIAL_LEARNRATE real [0-1] determines how strong the winner and its neighboring units are initially adapted, decreases over time
 good starting point: 0.8
  NR real <=0 NEW in GHSOM-1.6: height of Gaussian Neighborhood at map border
defines the "height" of the Gaussian Neighborhood bubble at the unit which is max-x (width) or max-y (height) away from the winner (used to calculate sigma of the Gaussian kernel) - replaces the initial neighborhood radius of version 1.5

  HTML_PREFIX string - prefix for the
      output files. All files will be labeled that way,, followed by an underscore and subsequent numbering
  DATAFILE_EXTENSION string may be empty suffix
      for the reference of the data files in the HTML tables;

      we usually name the vectors in the inputvector-file to link to the actual files but omit the extension to get
      "better looking" maps; if you do so, you have to provide the
      extension to get the correct links to the document files; for
      browsing, the document files are always expected in a subdirectory
      files of the directory where the HTML files are located
  randomSeed int any initial seed value for the
      random number generator to enable repeatable training-runs
  inputFile string - path (relative to
      the current directory you are in or absolute)  + name of the
      input vector file (vectors/test.in)
  descriptionFile string - path (relative to
      the current directory you are in or absolute)  + name of the
      input vector file (vectors/test.tv)
  savePath string - directory where the
  output files are written (without trailing slash). Note: make sure that this directory exists, and that you have write permissions on it! :) (output)
  normInputVectors string NONE
|
      LENGTH
| INTERVAL if and how the input vectors are
	  normalized; NONE=raw input data will be used; LENGTH=vectors are
	  normalized to length 1; INTERVAL=vector elements are
	  transformed into the interval [0-1]
  INITIAL_X_SIZE int >=1 initial size
      of new maps in x-direction. For any growing map you will want to set this to 2, from which the map will start to grow. However, you can set it to any desired size right away.
  INITIAL_Y_SIZE int >=1 initial size
      of new maps in y-direction. For any growing map you will want to set this to 2, from which the map will start to grow. However, you can set it to any desired size right away. If you set this value to 1, you will create a 1-dimensional SOM, that grows only linearly, resulting, if expanded hierarchically, in a tree-like representation fo your data.
  LABELS_NUM int >=0 max # of labels
      per unit; 0 = no labels.
 The labelSOM method is used to select those features that are most characteristic of the respective unit to describe it.
  LABELS_ONLY bool true | false if
      'true', only the labels will be shown on nodes which have been
      expanded into the next layer along with a link labeled
      "down". Setting this property to 'false' is only useful for
      testing small data sets to see which data is mapped onto the
      according map in the next layer.
  LABELS_THRESHOLD real [0-1] features
      which are most important are used as labels; a value of 0.8
      means that only the features with values in the top 20% of all
      are printed as labels; the lower this value the more labels will
      be shown (limited by LABELS_NUM)
  








Up



Comments: rauber@ifs.tuwien.ac.at

Property	Type	Range	Description
EXPAND_CYCLES	int	>=1	# of cycles after which the map is checked for eventual expansion; 1 cycle actually means # of input vectors; Example: 100 input vectors, 10 cycles = 1000 times a randomly chosen pattern is presented to the SOM for learning
TAU_1	real	[0-1]	percentage of remaining error that has to be explained by each map, ako stopping criterion for horozontal growth. The smaller this value, the larger each map will grow, and the flatter the hierarchy will be A good starting point may be a value of about 0.25
TAU_2	real	[0-1]	final degree of granularity represented by the maps in the lowest layer. The smaller, the more detailed the data representation will be, and thus the bigger the overall GHSOM structure. An appropriate value for testing may be 0.1 or less; if you set this property to 1, only one single SOM in the first layer will be trained
INITIAL_LEARNRATE	real	[0-1]	determines how strong the winner and its neighboring units are initially adapted, decreases over time good starting point: 0.8
NR	real	<=0	NEW in GHSOM-1.6: height of Gaussian Neighborhood at map border defines the "height" of the Gaussian Neighborhood bubble at the unit which is max-x (width) or max-y (height) away from the winner (used to calculate sigma of the Gaussian kernel) - replaces the initial neighborhood radius of version 1.5
HTML_PREFIX	string	-	prefix for the output files. All files will be labeled that way,, followed by an underscore and subsequent numbering
DATAFILE_EXTENSION	string	may be empty	suffix for the reference of the data files in the HTML tables; we usually name the vectors in the inputvector-file to link to the actual files but omit the extension to get "better looking" maps; if you do so, you have to provide the extension to get the correct links to the document files; for browsing, the document files are always expected in a subdirectory `files` of the directory where the HTML files are located
randomSeed	int	any	initial seed value for the random number generator to enable repeatable training-runs
inputFile	string	-	path (relative to the current directory you are in or absolute) + name of the input vector file (vectors/test.in)
descriptionFile	string	-	path (relative to the current directory you are in or absolute) + name of the input vector file (vectors/test.tv)
savePath	string	-	directory where the output files are written (without trailing slash). Note: make sure that this directory exists, and that you have write permissions on it! :) (output)
normInputVectors	string	NONE \| LENGTH \| INTERVAL	if and how the input vectors are normalized; NONE=raw input data will be used; LENGTH=vectors are normalized to length 1; INTERVAL=vector elements are transformed into the interval [0-1]
INITIAL_X_SIZE	int	>=1	initial size of new maps in x-direction. For any growing map you will want to set this to 2, from which the map will start to grow. However, you can set it to any desired size right away.
INITIAL_Y_SIZE	int	>=1	initial size of new maps in y-direction. For any growing map you will want to set this to 2, from which the map will start to grow. However, you can set it to any desired size right away. If you set this value to 1, you will create a 1-dimensional SOM, that grows only linearly, resulting, if expanded hierarchically, in a tree-like representation fo your data.
LABELS_NUM	int	>=0	max # of labels per unit; 0 = no labels. The labelSOM method is used to select those features that are most characteristic of the respective unit to describe it.
LABELS_ONLY	bool	true \| false	if 'true', only the labels will be shown on nodes which have been expanded into the next layer along with a link labeled "down". Setting this property to 'false' is only useful for testing small data sets to see which data is mapped onto the according map in the next layer.
LABELS_THRESHOLD	real	[0-1]	features which are most important are used as labels; a value of 0.8 means that only the features with values in the top 20% of all are printed as labels; the lower this value the more labels will be shown (limited by LABELS_NUM)