MUSCLE Network of Excellence

Text Analysis Tools

Within the MUSCLE Network of Excellence on multimedia understanding, datamining and machine learning researchers have developed a range of tools for text analysis, text annotation, Natural Language Processing text classification and semantic indexing. This deliverable of WP4 represents the final inventory of text analysis tools developed:

Part-of-Speech Tagger, Spatial Query Extractor
Updatable Probabilistic Latent Semantic Indexing
Natural Language Processing Tools, OWL version of WordNet
SOMLib Java Package
Semi-automated Corpus Annotator (CNRS LLACAN)
Text Classifier
ECUE Spam Concept Drift Datasets
TechTC - Repository of Text Categorization Datasets

Part-of-Speech Tagger, Spatial Query Extractor

Bilkent University, Ugur Gudukbay, Ozgur Ulusoy

BilVideo is a video database management system. The first version, BilVideo v1.0, supports complex spatio-temporal object queries by an SQL-like textual query language, by sketches or by simple English sentences (NLP interface). BilVideo v2.0 is currently under development. It is designed to be an MPEG-7 compliant video database management system.

The visual query interface, relevant publications and user manuals are available online (URL see below). When completed, BilVideo v2.0 will also be made accessible through the web site. The whole system is composed of several components and therefore not publicly available.

BilVideo can extract from natural language queries the following spatial relations:

topological relations that describe order in 2D space (disjoint, touch, inside, contain, overlap, cover, coveredby)
directional relations that describe the neighborhood of objects (directions: north, south, east, west, northeast, northwest, southeast, southwest and neighborhood: left, right, below, above)
3D relations that describe object positions in 3D space (infrontof, strictlyinfrontof, behind, strictlybehind, touchfrombehind, touchedfrombehind, samelevel)

For example, given a video search query such as "Retrieve segments where James Kelly is to the right of his assistant", this system will extract the spatial relation right(JamesKelly, assitant) that can sent to a further query processing engine.

The tool can be experimented through the Web as a Web client. There is a
demo video for query processing purposes together with some examples, queries and tutorials.

Website: http://pcvideo.cs.bilkent.edu.tr/

Short description and references: http://www.cs.bilkent.edu.tr/~bilmdg/bilvideo

Downloadable tools: http://pcvideo.cs.bilkent.edu.tr/querying.html

Presentation in MUSCLE BSCW

Updatable Probabilistic Latent Semantic Indexing

AUTH, Constantine Kotropoulos

Probabilistic latent semantic indexing (PLSI) is a semantic space reduction method that folds documents and the concepts that appear in them into a smaller dimensioned semantic space which can then be used to index and classify new documents. Building a reduced semantic space is time consuming, order O(N^3). AUTH has implemented a new method for updating PLSI when new documents arrive. The new method incrementally adds the words of any new document in the term-document matrix and derives the updating equations for the probability of terms given the class (i.e. latent) variables, as well those of documents given the latent variables. This quick updating would be useful in a web crawler where the term-document matrix must be refreshed very often.

Website: http://www.aiia.csd.auth.gr/EN/

Presentation in MUSCLE BSCW

Natural Language Processing Tools, OWL version of WordNet

CEA LIST, Olivier Mesnard

The CEA has a suite of natural language processing tools for the following languages: English, French, Italian, Spanish, German, Chinese, and Arabic. Alpha versions exist for Hungarian, Japanese, and Russian. They perform the following functions:

language identification and text encoding identification
UNICODE translation of codesets
tokenization, dividing input stream into individual words
morphological analysis (recognizing conjugated word forms and providing their normalized dictionary-entry forms)
part-of-speech tagging (choosing the grammatical function of each word in a text)
entity recognition (identifying people, organizations, place names, products, money, time)
dependency extraction (recognizing subject-verb-object relations, and modifier relations)

These functions allow the transformation of raw text into symbolic knowledge that can be used to describe, index and access textual information, such as that associated with image captions, or in raw descriptions.

The CEA has also developed an OWL ontology version of the WordNet lexical hierarchy. A reduced version of this ontology restricted to all the picturable objects in WordNet (30 Mbytes) is available from the CEA LIST.

Contact: Adrian.popescu@cea.fr

Website of commercial version of these tools: http://www.new-phenix.com

Presentation in MUSCLE BSCW

SOMLib Java Package

TU-WIEN - IFS, Andreas Rauber

TU Vienna - IFS has developed a software for analyzing text documents and organizing them on a Self Organizing Map (SOM) - a representation of a reduced semantic dimension, bringing similar documents, or objects closer together on a two or three dimensional plane. The SOMLib Java Package is a collection of JAVA programs that can be used to create SOMLib library systems for organizing text collections. The package includes

Feature Extraction
Feature space pruning
Feature vector creation
Feature vector normalization
SOM training
SOM Labeling
libViewer template generation.

Website: http://www.ifs.tuwien.ac.at/~andi/somlib/download/index.html

Quick Reference: http://www.ifs.tuwien.ac.at/~andi/somlib/download/java_package/

Semi-automated Corpus Annotator (CNRS LLACAN)

CNRS, Fathi Debili

Automatic versus Interactive Analysis of Arabic Corpora

The tools we present make it possible to interactively annotate large corpora of Arabic texts.
These annotations involve splitting text into words, lemmatization, vowellization, tagging, segmentation into nominal and verbal chains, along with the construction of dependency relations.

Within this process, interactive annotation serves automatic parsing by massively providing it with annotated texts from which rules can be learned and evaluated. Automatic parsing serves interactive annotation whose correlated performance is to be measured by comparison to manual processing and its cost.

The language tools of CNRS LLACAN relate to the automatic processing of the Arabic language. Based on a dictionary of forms, they enable morphological analysis, POS tagging, phrase chunking and dependency analysis of Modern Standard Arabic, with variable levels of coverage and performance.

The required production of large training corpora and the difficulties specific to the Arabic language led to the realization of interactive analysis tools.

These tools are now operating under MS Windows. An intranet version is being developed.

These tools have also been used in the preparation of a tagged corpus of about 250000 words (available from ELDA).

Presentation in MUSCLE BSCW

UTIA Text Classifier

UTIA, Jana Novovicova

Text categorization (also known as text classification) is the task of automatically sorting a set of documents into predefined classes based on its contents. Document classification is needed in many applications including e-mail filtering, mail routing, spam filtering, news monitoring, selective dissemination of information to information consumers, and automated indexing of scientific articles. The Prague-based team of UTIA has produced a method for text classification using Oscillating Search which, unlike traditional approaches, evaluates feature groups instead of individuals and which improves classification accuracy in experiments.

Paper describing the work: http://staff.utia.cas.cz/novovic/files/CIARP06_NSP.pdf

ECUE Spam Concept Drift Datasets

NUID / UCD, Sarah Jane Delany

The ECUE Spam Concept Drift Datasets each consist of more than 10,000 emails collected over a period of approximately 2 years. Each is a collection of spam and legitimate email received by an individual. The following files are included in each dataset:

SpamTraining.txt = all emails used as initial training data in the concept drift experiments performed using this dataset.
NonspamTraining.txt = all emails used as initial training data in the concept drift experiments performed using this dataset.
TestMMM99.txt = all emails used as 'test' data in the concept drift experiments using this dataset where MMM represents the month and 99 the year the emails were originally received.

Website (download): http://www.comp.dit.ie/sjdelany/Dataset.htm

Papers describing the work:
https://www.cs.tcd.ie/publications/tech-reports/reports.06/TCD-CS-2006-05.pdf (ECAI 2006),
https://www.cs.tcd.ie/publications/tech-reports/reports.05/TCD-CS-2005-19.pdf (FLAIRS 2006)

TechTC - Repository of Text Categorization Datasets

Technion-ML, Shaul Markovich

While numerous works studied text categorization (TC) in the past, good test collections are by far less abundant. The TechTC-300 Test Collection contains 300 labeled datasets whose categorization difficulty (as measured by baseline SVM accuracy) is uniformly distributed between 0.6 and 1.0. Each dataset consists of a pair of ODP categories with an average of 150-200 documents (depending on the specific test collection), and defines a binary classification task that consists in telling these two categories apart. The average document size after filtering is slightly over 11 Kilobytes. HTML documents were converted into plain text and organized as a dataset, which were rendered in a simple XML-like format.

The data is available in two formats:

Plain text In plain text form, each dataset consists of a pair of files corresponding to the two categories comprising the dataset. Each file contains all the documents in one category in ASCII text format, which resulted from HTML-to-text conversion.
Preprocessed feature vectors. In this format, texts were only tokenized and digitized, but underwent no other preprocessing whatsoever.

The following test collections are currently available:

TechTC-300 - a collection of 300 datasets whose categorization difficulty (as measured by baseline SVM accuracy) is uniformly distributed between 0.6 and 1.0.
TechTC-100 - a collection of 100 datasets whose categorization difficulty (as measured by baseline SVM accuracy) is uniformly distributed between 0.6 and 0.92.
Note: TechTC-100 is a subset of TechTC-300.

Website (download): http://techtc.cs.technion.ac.il/

Papers describing the work: http://www.muscle-noe.org/images/DocumentPDF/MP_504_Gabrilovich-Markovitch-aaai2006.pdf

top