TU Logo   IFS Logo Vienna University of Technology
Institute of Software Technology and Interactive Systems
Data Mining with the Java SOMToolbox
[DataMining Home] [People] [Publications] [SOMToolbox]

SOMLib Input files: Quick guide

For a more detailed description of all File Formats, see the complete specification

The data to be clustered using the GHSOM is represented by 2 input files that are in plain-text ASCII file format. The fields enclosed in brackets have to be substituted by the actual values. You can use any of the demo files provided on the web or build your own data files following this format:

Inputvector file

The number of features and input vectors are integer values and the single vector elements may be integer or real values >=0.

    $TYPE inputvec
    $XDIM <# of input vectors>
    $YDIM 1
    $VECDIM <# of features>
    
    <feat 1> <feat 2> ... <name of input vector 1>
    <feat 1> <feat 2> ... <name of input vector 2>
                .
                .
                .
    

The first lineTYPE contains just a free-form text tag (a single word) allowing you to label the vector file.
XDIM lists the number of input vectors you have in your file.
YDIM for an input vector file always has to be set to 1. (If you want to know why, drop me an e-mail. It has its reasons)
VECDIM gives the dimensionality of your feature space.

Following the general header, you have 1 line for each vector, listing the values of the attributes as space-separated real numbers, followed by a label for th evector, which again can be any text/id. The matrix thus is VECDIM+1 x XDIM.

Template vector file

The template vector file basically describes the dimensions of your feature space, and may be used to assign labels to the clusters by selecting the dimension identifiers that are most characteristic for a given cluster. The file format is basically as follows:

    $TYPE template
    $XDIM 7
    $YDIM  <# of input vectors>
    $VECDIM <# of features>
    0 <name of feature 1> <df> <tf> <min_tf> <max_tf> <mean_tf>
    
    1 <name of feature 2> <df> <tf> <min_tf> <max_tf> <mean_tf>
    2 <name of feature 3> <df> <tf> <min_tf> <max_tf> <mean_tf>
    
    :
    :
    :
    <# of features - 1>  <name of feature n> <df> <tf> <min_tf> <max_tf> <mean_tf>
    

TYPE again is a free-text identifier characterizing the file.
XDIM always is set to 7 as the vector file contains 7 columns of information
YDIM usually is set to the number of vectors you want to analyze. This may be used in text analysis to calculate weighted values such as tfxidf.
VECDIM again gives the dimensionality of your feature space.

Following this header you have <vecdim> lines describing each dimension, with the columns representing, respectively, the number of the dimension (a counter starting with 0), the name of the dimension (a single word), the term frequency of the attribute in your data collection, the document frequency of the attribute, as well as the min, max and average term frequencies.
This information has been optimized for text data analysis. If you want to use the GHSOM for different purposes, you can savely set all the frequency values to 1 - they are not used during the training process.