Hierarchical Structuring, Analysis, and Exploration of High-Dimensional Feature Spaces Introduction: With the much-proclaimed increase in the production and availability of data, algorithms to analyze it, to automatically structure it and extract knowledge from it, are gaining importance. This book introduces new means for the transformation of these data spaces into information spaces by putting the data into context. The analysis and visualization methods presented assist the analyst in exploring a data space, understanding its characteristics, the similarities in data, and allowing the creation of a mental model of the data under investigation. The devised approach, an extension to a popular neural network, acts both as an analysis tool as well as a direct interface to the data, making it a valuable instrument for many application domains. Two of these domains -- one more traditional, the other rather new, namely text and audio mining -- are presented here. There is, on the one hand, the challenge of integrating several aspects of text mining within a single framework. Texts are analyzed and structured according to their content, offering the basis for subject cataloging, topic-based access, multi-document summarization, and other applications, in mono- as well as multi-lingual settings. Yet, not only content, but also the structure of texts needs to be analyzed, revealing genre information, and providing information on their communicative function, i.e. how a certain piece of information is expressed. Combining these two characteristics of texts, i.e. the propositional and communicative function, provides the basis for meeting the information needs of users in working with textual information. On the other hand, there is more than textual information to be explored. This book addresses a particularly challenging and new domain, which is the analysis of audio data, specifically music. Determining the similarity of two pieces of music requires the extraction of new sets of features from the audio signal, incorporating psychoacoustic modeling, as well as a diversified view of what constitutes similarity in music for different listeners. To meet this goal, a new feature set is devised and evaluated with respect to its suitability to form the basis for an organization of a music collection. The resulting system creates explorable and comprehensible organizations of text and music collections by combining a range of features with a novel neural network model for mapping data spaces, with application potential going far beyond the two areas covered here in detail. When it comes to the analysis of large volumes of data, a variety of data mining techniques is being used to extract meaningful information, to test hypotheses, or to organize data according to some pre-defined principles. One particular role is taken by clustering algorithms, which are often employed as an initial step in the knowledge discovery process, when little information on the data and their characteristics is known. Clustering allows, among other things, to analyze the structure of data, develop hypotheses, or detect outliers that might require special treatment. Numerous approaches for performing cluster analysis are known, each exhibiting specific characteristics, rendering each suitable for specific tasks. For a concise overview of the main algorithms see [42]. One famous approach to performing cluster analysis is based on the self-organizing map (SOM) [54], a neural network model that provides a topology-preserving mapping from high-dimensional input spaces onto a commonly two-dimensional output space. Originally motivated by the localized regions of activity in the human cortex [52], where similar regions react to similar stimuli, similar regions of the input space are mapped as faithfully as possible onto neighboring regions of the map space during an unsupervised training process. By virtue of this topology-preserving mapping the SOM, in addition to performing clustering of data onto map units, also offers a convenient exploratory interface. Thanks to several algorithmic shortcuts that can be taken, the SOM has shown to scale up to large amounts of data [48], and has experienced wide-spread use in a huge range of disciplines and applications [46,71]. Among other benefits of the SOM is its capability to deal faithfully with very high-dimensional input spaces, its robustness against noisy data, as well as its limited sensibility to parameter choices, making it a very stable model. It is mainly its characteristic of offering a map space of the data that is key to the work presented here, i.e. the analysis of text and music collections. A SOM-inspired neural network is used both as a tool for analyzing the data as well as serving directly as the basis for a corresponding interface to the data collections. The remainder of this book consists of three main topical sections, which are (1) the growing hierarchical self-organizing map (GHSOM), a novel neural network model used for analyzing data; (2) its application and evaluation within the text analysis domain both for content-based, as well as for the less commonly pursued genre-oriented structuring of document collections, and (3) its application to a rather new domain, i.e. the automatic genre-oriented organization of music data, where, contrary to the text mining domain, the extraction of relevant features constitutes a new research challenge, and for which a novel set of features is introduced. In spite of the wide-spread use of the SOM, especially in the text mining domain, some shortcomings have to be noted. It is these shortcomings, which are related to the static architecture of the SOM model, that are addressed in Chapter 2 of this book. First of all, the size of the map space used to represent the data has to be known in advance. Secondly, any hierarchical structure that commonly is present in any real-world data collection is lost during the mapping process onto the two-dimensional map space. To counter these architectural deficiencies of the SOM, a novel neural network model, the growing hierarchical self-organizing map (GHSOM) [87] is presented in Section 2.1. This model adopts its architecture during an unsupervised training process to reflect the hierarchical structure present in the data. Additionally, the individual maps at each layer in the evolving hierarchy grow in size according to the requirements of the data. Although several techniques for visualizing the SOM have been proposed and are in wide-spread use, shortcomings with respect to faithfully reflecting the cluster structure on the map have to be noted. In order to improve the visualization of the cluster structure in SOMs a novel visualization method based on smoothed data histograms (SDH) [75], is introduced in Section 2.2, estimating the probability density of data on the map. Chapter 3 presents the application of the SOM and GHSOM technologies in a text mining environment, where the SOM is being successfully employed by a large community. Section 3.1, based on [86,87], discusses the application of the GHSOM to this domain to detect hierarchical topical structure in text repositories. These experiments are based on a collection of news articles from the Austrian daily newspaper "Der Standard", comparing two different hierarchical structures that evolve during the unsupervised training process. With this type of content-based text analysis being a strong-hold domain for the standard SOM model, the focus of this section lies with the analysis of the model's capabilities to automatically and faithfully reflect the hierarchical structure of the data and its robustness against parameter variations in a real-world setting. Particularly, the GHSOM's capability to maintain the individual map's orientation according to the global topology evolving on the first layer is demonstrated. This characteristic overcomes a strong deficiency of many hierarchical clustering algorithms, where the relationship between data points are lost if a cluster is subdivided at some layer of the hierarchy. Section 3.2 takes a look at the system's capabilities of handling multi-lingual document collections, as presented in [81], a topic that has seen increased interest recently. More specifically, the system's robustness against noise introduced by a standard machine translation tool is tested, applying it to a non-parallel corpus of English, French, and German news articles. While most approaches for cross-language information retrieval (CLIR) aim at obtaining a high-quality translation representation of either queries or documents, the setting discussed in this section studies, in how far crude translation techniques provide sufficient quality for subsequent topic-based clustering. Section 3.3, finally, explores another dimension of text analysis that has, until recently, only received partial attention from the information retrieval community, i.e. structural and genre analysis. Contrary to content-based analysis of document collections only a very limited amount of work has been performed regarding the application of machine learning techniques to automatic genre detection and respective organization. Yet, the question of how a given piece of information is represented is accepted as being equally important as the topic-wise relevance of information as such. The application of the SOM to this domain marks a novel step, offering new possibilities for the integration of content and structural analysis of text documents within a single framework. To achieve this, structural surface level features are extracted from texts. Documents are subsequently grouped according to stylistic or genre-type similarity [86,88]. This organization is integrated with the content based structuring using a metaphor-graphical interface originally developed for the SOMLIB system. Last, but not least, Chapter 4 discusses the application of the SOM and GHSOM to a new application domain that has gained significant interest recently, i.e. the analysis and organization of music collections. Similar to content-based organization of texts, collections of music files in, e.g., mp3 format, are organized automatically according to their musical style. However, contrary to the textual domain with its well-explored feature spaces, the extraction of features capturing musical genre information is still a challenging, new area of research. Following some general introduction to this domain in Section 4.1, and a review of the state of the art in Section 4.2, a novel set of features extracted from the audio signal is introduced, that allows genre based organization of music by modeling rhythm patterns. The individual steps of this feature extraction process and the resulting SOMeJB system [74,82,89,90] are described in detail in Section 4.3, followed ba a review of an alternative set of features employed in genre-oriented audio analysis in Section 4.4. Results of using both the GHSOM as well as the SOM with an advanced visualization based on SDH in Section 4.5 demonstrate the capabilities of the presented approach. A short summary concludes this book by revisiting the main points in Chapter 5. ------------------------- A. Rauber: Hierarchical Structuring, Analysis, and Exploration of High-Dimensional Feature Spaces OCG-Books 173, 90pp, Austrian Computer Society, 2003.