Department of Software Technology
Vienna University
of Technology


LabelSOM II: Labeling a SOMLib Digital Library with KEA

Step-by-Step Guide

1. Motivation

This step-by-step guide describes how you can label map created by the SOMLib Digital Library System with key phrases rather than simple keywords. It uses the KEA keyphrase extraction tool, rather than the labelSOM method, to extract phrases from the document clusters. Links to the downloadable software is provided, as well as short descriptions which options are available. The rest of the document is organized as follows:

  1. Software and Prerequisites: An overview where you can get the software and how to install it
  2. Training a KEA Keyphrase Model: How you can train your own KEA Keyphrase Model, explantion of the different options
  3. Extracting Keyphrases from Textdocuments: Extracting Keyphrases from textdocument using different models, different stopword lists, neccessary options
  4. Labelling the Map: Parsing the unit files and generating the html output, explantions of the different options

2. Software and Prerequisites

This sections will describe where you get the various modules and how to install them.

LabelSOM II

You can install it e.g. in /usr/local/labelsomII/just make sure that the variable $PATH (e.g. bashrc) is set right.

JDK 1.2.2

For some KEA modules you need the java development kit, you will get it

http://java.sun.com/products/jdk/1.2/

KEA 2.0You find the software package plus a detailed description at

http://www.nzdl.org/Kea/index.html

You should install the modules in

/usr/local/KEA-2.0/KEA-2.0/

Please include the path in the global variable $CLASSPATH. Otherwise you should include -cp KEAINSTALLPATH in each KEA call (KEA is fully written in java).

Other Prerequisites

You should have a trained GHSOM as described in the SOMLib step-by-step guide For our example we are using the TIME60 collection, so after the trainingsprocess you should end up with a directory structure like this:

[user@Kenny time60]# ls -l

  total 7752

  drwxr-xr-x 2 user user 32768 Aug 22 2001 files_cleaned

  lrwxrwxrwx 1 user user 16    Oct 23 2001 files -> /www/time/files/

  drwxr-xr-x 3 user user 12288 Jan  7 2002 output

  drwxrwxrwx 1 user user 17    Oct 23 2001 parser

  drwxr-xr-x 2 user user 4096  Aug 22 2001 props

  drwxrwxrwx 1 user user 18    Oct 23 2001 vectors

3. Building a keyphrase extraction model

To extract keyphrases for new documents, you first need to build a KEA keyphrase extraction model from a set of documents (preferably from the same domain) or which you have author- assigned keyphrases. To this end you have to go through the following steps:

  1. Create a directory, called, for example, "training_documents", containing the documents that you want to use for training the keyphrase extractor.
  2. Rename the document files in that directory so that they end with the suffix ".txt".
  3. Delete the author-assigned keyphrases from those documents and put them into separate ".key" files. For example, if your document file is called doc1.txt, move the keyphrases into a new file called "doc1.key". It is important that you put each keyphrase on a separate line in the .key file!
  4. Build the keyphrase extraction model by running the KEAModelBuilder with the desired options (detailed info in the KEA documentation) (local copy)

For our example we can build our KEA model with the training documents coming along with the KEA package.So we change into the KEA directory and build the model defaultmodel with the default values

  [user@Kenny /]# cd /usr/local/KEA-2.0/KEA-2.0/

  [user@Kenny KEA-2.0]# java KEAModelBuilder -l CSTR_abstracts_train -m defaultmodel

   Building model with options: -l CSTR_abstracts_train -m defaultmodel -e default -x 3 -y 1 -o 2

  [user@Kenny KEA-2.0]#

4. Extracting Keyphrases for the Labeling Process

To extract keyphrases for our collections, we have to put the documents in a directory, for example in cleaned files. The files have to end with the suffix .TXT. So we create in our working directory TIME60 a subdirectory , copy the collection into this directory an rename the files.

  [user@Kenny time60]# mkdir key

  [user@Kenny time60]# cp files/* key -R

  [user@Kenny time60]# cd key/

  [user@Kenny files.txt]# for file in *; do

  > mv $file $file.txt -f

  > done

  [user@Kenny key]#

NOTE: Depending on how you original files are named you need files with the following format for the keyphrase extraction and later processing:
original_filename.TXT
NOTE II: You will get better results in extracting keyphrases when you are using pure text instead of HTML-files or similiar stuff. So you should clean them up first. Before we can extract the keyphrases from our documents we have to fix one shortcoming of KEA. KEA is working with a fixed (hardcoded) english stopwords list - which will give you bad results when you are working with a collection in a different language. So we need to recompile the specific module with the collection-specific stopwords found in the parsing process. (Note that you should adapt the script if you haven't set the CLASSPATH variable and/or KEA is installed in a different directory. You will find the script generate_stopwords in the directory where you have installed all the LABELSOM_II binaries)

  [user@Kenny time60]# cd vectors/

  [user@Kenny vectors]# generate_stopwords time.removed.txt

  Programme: generate_stopwords 1.0

  extracting the stopwords from a giving file which has been generated through

  somlib_parser_script

  ...pressuming KEA is installed on /usr/local/KEA-2.0/KEA-2.0/

  Starting ....

  greping stopwords from time.removed.txt ...

  generating code

  Compiling Code

  Done ... you will find the defined stopwords in the file

  generate_stopwords.removed in this directory

  On your next run of the KEA_ModelBuilder or KEA_KeyphraseExtractor these words

  will be assumed as stopwords

  [user@Kenny vectors]# cd ..

In this case we have now clean text files ready for keyphrase extraction, we just have to start the KEAKeyphraseExtractor with our previously build KEA model "defaultmodel".Don't forget the -a option otherwise the further processing will not work, more detailed information plus the explanation for all options is given in the KEA documentation. (local copy)

  [user@Kenny time60]# java KEAKeyphraseExtractor -m /usr/local/KEA-2.0/KEA-2.0/defaultmodel -l key -a -n 15

  Extracting keyphrases with options: -l key -m /usr/local/KEA-2.0/KEA-2.0/defaultmodel
  -e default -n 15 -a

  Avg. number of correct keyphrases: 0 +/- 0

  Based on 0 documents

  [user@Kenny time60]#

This will create a ".key" file for each document in the directory. Each file will contain fifteen (option -n) extracted keyphrases for the corresponding document. If not needed anymore we can remove the ".txt" files from the directory to save disc space.

5. Labeling Process

The labelling process is diveded in two parts, first we are parsing the unit files generated by the GHSOM , calculating the labels and writing a new unit file and in a second step from the new unit file a HTML Output is generated. These two programmes are parse_unitfile and unit2html. Both are using the same configuration file which we will explain first.Configuration file (download  here).

·         NUMBEROFPHRASES 15 
This option effects both programmes and defines how many labels for each cluster is calculated.
·         OUTPUTPATH output/
This is the relativ path where parse_unitfile and unit2html should find the ".unit" files and where the whole output is written
·         KEYPATH output/key/
This is the path relative to the current directory where the ".key" files are supposed to be.
·         KEAFILEEXTENSION .key
Keep this option unchanged as you are not changing the suffix of the ".key"-files.
·         SOURCEPATH files/
The path relative to the current directory where the original sourcefiles can be found
·         TMPKEAPATH output/tmp_kea/
A temporary directory relative to the current directory.
·         KEYFILESLINKPATH key/
The directory paht to the ".key"-files relative to the outputdirectory.
·         SOURCEFILESLINKPATH files/
The directory paht to the original sourcefiles relative to the outputdirectory.
·         KEYLINKS_OUTPUT 1
This option affects the HTML-Output only. It determines weather the hyperlinks to the ".key"-files are generated or not (0 = FALSE everthing else=TRUE, default = true)
·         LABELSOMI_OUTPUT 0
This option affects the HTML-Output only. If set TRUE the Labels of the LABELSOM I are available they are written (default = true)
·         FILELINKS_OUTPUT 1
This option affects the HTML-Output only. If set to TRUE the links to mapped files are generated (default = true)

NOTE: Please let all pathes end with an slash /.

Before we can start we have to set up some directories for the HTML-Output

  [user@Kenny time60]# cd output

  [user@Kenny output]# ln -s ../key key

  [user@Kenny output]# ln -s ../files files

  [user@Kenny output]# mkdir tmp_kea

Parsing the unit files

  parse_unitfile
  Usage: parse_unitfile processfile config_file

The programme has two input parameter, the first is a process file where all the file to be parsed are listet. The second paramter is the configuration file. So we have a trained GHSOM in our output directory named time60 with 38 maps (unitfiles).

[user@Kenny output]# ls *.unit

  time60_10_2_2_2.unit time60_2_2_0_0.unit time60_33_4_1_1.unit

  time60_1_1_0_0.unit time60_22_3_0_1.unit time60_34_4_2_1.unit

  time60_11_2_0_3.unit time60_23_3_1_1.unit time60_35_4_1_2.unit

  time60_12_2_1_3.unit time60_24_3_2_1.unit time60_36_4_2_2.unit

  time60_13_2_2_3.unit time60_25_3_0_2.unit time60_37_5_1_2.unit

  time60_14_3_1_0.unit time60_26_3_2_2.unit time60_38_5_3_1.unit

  time60_15_3_1_0.unit time60_27_3_1_3.unit time60_4_2_2_0.unit

  time60_16_3_0_1.unit time60_28_3_2_3.unit time60_5_2_0_1.unit

  time60_17_3_1_1.unit time60_29_3_2_0.unit time60_6_2_1_1.unit

  time60_18_3_0_3.unit time60_30_3_0_2.unit time60_7_2_2_1.unit

  time60_19_3_0_0.unit time60_31_4_1_0.unit time60_8_2_0_2.unit

  time60_20_3_1_0.unit time60_3_2_1_0.unit time60_9_2_1_2.unit

  time60_21_3_2_0.unit time60_32_4_2_0.unit

Due to the hierarical structure of a GHSOM we need to process the files in the right order from the bottom (lowest layer) to the top. We have written a script which generates a proper process file. The usage of the script is preprocess <name_of_the_map> and the output is written to the file process.parse_unitfile.

  [user@Kenny output]# generate_processfile time60

  ...written process.parse_unitfile !

  [user@Kenny output]# more process.parse_unitfile

  time60_38_5_3_1.unit

  time60_37_5_1_2.unit

  time60_36_4_2_2.unit

  time60_35_4_1_2.unit

  time60_34_4_2_1.unit

  time60_33_4_1_1.unit

  ....

  time60_6_2_1_1.unit

  time60_5_2_0_1.unit

  time60_4_2_2_0.unit

  time60_3_2_1_0.unit

  time60_2_2_0_0.unit

  time60_1_1_0_0.unit

So we can start the parsing process.

  [user@Kenny time60]# parse_unitfile 
  output/process.parse_unitfile labels.config

  Processing output/time60_38_5_3_1.unit and writing output/time60_38_5_3_1.unit.labelunit
  ...

  Processing output/time60_37_5_1_2.unit and writing output/time60_37_5_1_2.unit.labelunit
  ...

  Processing output/time60_36_4_2_2.unit and writing output/time60_36_4_2_2.unit.labelunit
  ...

  ...

  Processing output/time60_2_2_0_0.unit and writing output/time60_2_2_0_0.unit.labelunit
  ...

  Processing output/time60_1_1_0_0.unit and writing output/time60_1_1_0_0.unit.labelunit
  ...

  [user@Kenny time60]#

The programme parse_unitfile is parsing each file and writes a new one with an additional suffic ".labelunit" , so the old files are preserved. Unfortunately the ".labelunit" as the ".unit"-files are not so comfortable to read and interpret so we wrote a second application which generates a HTML-output easy to read and browse.

Generating HTML-Output

  unit2html
  Usage: unit2html processfile config_file

Again there are two arguements, a process file and the configuration file. We using the same configuration file as abvove. To generate a process file we can do the following:

  [user@Kenny time60]# cd output

  [user@Kenny output]# ls *.labelunit > process.unit2html

  [user@Kenny output]# more process.unit2html

  time60_10_2_2_2.unit.labelunit

  time60_1_1_0_0.unit.labelunit

  ... and so on !

Now we can create the HTML-page with

  [user@Kenny time60]# unit2html output/process.unit2html labels.config

  Processing output/time60_10_2_2_2.unit.labelunit ....and writing output/time60_10_2_2_2.unit.labelunit.html
  !

  Processing output/time60_1_1_0_0.unit.labelunit ....and writing output/time60_1_1_0_0.unit.labelunit.html
  !

  ....

  Processing output/time60_8_2_0_2.unit.labelunit ....and writing output/time60_8_2_0_2.unit.labelunit.html
  !

  Processing output/time60_9_2_1_2.unit.labelunit ....and writing output/time60_9_2_1_2.unit.labelunit.html
  !

  [user@Kenny time60]#

The top layer of the map we can browser by the following command:

  [user@Kenny time60]# netscape output/time60_1_1_0_0.unit.labelunit.html   &

Up

Comments:michael.majdic@gmx.at, rauber@ifs.tuwien.ac.at