Department of Software Technology
Vienna University of Technology


The SOMeJB Music Digital Library - Prototype 1

Principles

The principles of the system can be described as follows. We use XMMS, a popular open source media player, to extract frequency spectra from music data such as MP3 or Wave files. As these frequency spectra are not provided in evenly spaced time intervals, we use Lagrange transformation to obtain timed snapshots. This is followed by a Fast Fourier Transformation (FFT) across the segments for a selected set of frequency spectra to obtain Fourier coefficients modeling the dynamics. These feature vectors are used to train a SOM of music segments. Segments with similar sound characteristics are located next to each other on the map, while highly differently sounding segments are found in sections far apart on the map. Quite frequently we also find pieces of music to show two very distinct sound characteristics in their verses and chorus parts, thus having their segments spread across different parts of the map. In a second step, a unique feature vector is created for each piece of music based on the cluster distribution of its segments. These vectors are again fed into a SOM to create a map of the music.

flow diagram of SOMeJB prototype 1 system

SOMeJB Prototype 1: SOM-enhanced JukeBox System: feature extraction, conversion, 1st- and 2nd-level SOM training.

Feature Extraction

Music comes in a variety of file formats such as MP3, WAV, AU, etc., all of which basically store the sound information in the form of pulse code modulation (PCM) using a sampling rate of 44.1 kHz. The analog sound signal is thus represented by 44.100 16 bit integer numbers per second, which are interpreted by media players to reproduce the sound signal. In order to be able to compute similarity scores between musical tunes, a feature vector representation of the various pieces of music needs to be created, which can further be analyzed by the SOM.

Starting with any popular music file format, most media players, such as the open source X Multimedia System (XMMS) are capable of splitting this data stream into several frequency bands. Using XMMS, the signal is split into 256 frequency bands, with approximately one sample value every 20 to 25 ms each. Since not all frequency bands are necessary for evaluating sound similarity, and in order to reduce the amount of data to be processed, a subset of 17 frequency bands (i.e. every 15th frequency band) is selected for further analysis, covering the whole spectrum available. In order to capture musical variations of a tune, the music stream is split into sections of 5 seconds length, which are further treated as the single musical entities to be analyzed. While basically all 5-second sequences could be used for further analysis, or even overlapping segments might be chosen, experimental results have shown that appropriate clustering results can be obtained by the SOM using only a subset of all available segments. Especially segments at the beginning as well as at the end of a specific piece of music can be eliminated to ignore fade-in and fade-out effects. Furthermore, our results show that choosing every second to third segment, i.e. a 5-second interval every 10 to 15 seconds, provides sufficient quality of data analysis.

The intervals between the frequency snapshots provided by the player varies with the system load and can thus not be guaranteed to occur at specified time intervals. We thus have a set of amplitude/timestamp values about every 20 to 25 ms in each of the 17 selected frequency bands. In order to obtain equi-distant data points, a Lagrange interpolation is performed on these values.

As a result of this transformation we now have equi-distant data samples in each frequency band. The resulting function can be approximated by a linear combination of sine and cosine waves with different frequencies. We can thus obtain a closed representation for each frequency band by performing a Fast Fourier Transformation (FFT), resulting in a set of 256 coefficients for the respective sine and cosine parts. Combining the 256 coefficients for the 17 frequency bands results in a 4352-dimensional vector representing a 5-second segment of music. These feature vectors are further used for training a Self-Organizing Map.

Self-Organizing Map

The self-organizing map (SOM, Kohonen-Map) is one of the most prominent artificial neural network models adhering to the unsupervised learning paradigm. The model consists of a number of neural processing elements, i.e. units. Each of the units i is assigned an n-dimensional weight vector mi. It is important to note that the weight vectors have the same dimensionality as the input patterns.

The training process of self-organizing maps may be described in terms of input pattern presentation and weight vector adaptation. Each training iteration t starts with the random selection of one input pattern x(t). This input pattern is presented to the self-organizing map and each unit determines its activation. Usually, the Euclidean distance between weight vector and input pattern is used to calculate a unit's activation. The unit with the lowest activation is referred to as the winner, c, of the training iteration, i.e.  $m_c(t) = \min_i{\vert\vert x(t)-m_i(t)\vert\vert}$. Finally, the weight vector of the winner as well as the weight vectors of selected units in the vicinity of the winner are adapted. This adaptation is implemented as a gradual reduction of the component-wise difference between input pattern and weight vector, i.e.  $m_i(t+1) = m_i(t) \cdot \alpha(t) \cdot h_{ci}(t) \cdot [x(t)-m_i(t)]$. Geometrically speaking, the weight vectors of the adapted units are moved a bit towards the input pattern. The amount of weight vector movement is guided by a so-called learning rate, alpha,decreasing in time. The number of units that are affected by adaptation is determined by a so-called neighborhood function, hci. This number of units also decreases in time. This movement has as a consequence, that the Euclidean distance between those vectors decreases and thus, the weight vectors become more similar to the input pattern. The respective unit is more likely to win at future presentations of this input pattern. The consequence of adapting not only the winner alone but also a number of units in the neighborhood of the winner leads to a spatial clustering of similar input patters in neighboring parts of the self-organizing map. Thus, similarities between input patterns that are present in the n-dimensional input space are mirrored within the two-dimensional output space of the self-organizing map. The training process of the self-organizing map describes a topology preserving mapping from a high-dimensional input space onto a two-dimensional output space where patterns that are similar in terms of the input space are mapped to geographically close locations in the output space.

Consider Figure 1 for a graphical representation of self-organizing maps. The map consists of a square arrangement of 7x7 neural processing elements, i.e. units, shown as circles on the left hand side of the figure. The black circle indicates the unit that was selected as the winner for the presentation of input pattern x(t). The weight vector of the winner, mc(t), is moved towards the input pattern and thus, mc(t+1) is nearer to x(t) than was mc(t). Similar, yet less strong, adaptation is performed with a number of units in the vicinity of the winner. These units are marked as shaded circles in Figure 1. The degree of shading corresponds to the strength of adaptation. Thus, the weight vectors of units shown with a darker shading are moved closer to x than units shown with a lighter shading.


  
Figure 1: Architecture of a 7x7 self-organizing map
\begin{figure}\begin{center}
\leavevmode
\epsfxsize=50mm %
\epsffile{somarch.eps}
\end{center}\end{figure}

As a result of the training process, similar input data are mapped onto neighboring regions of the map. In the case of text document classification, documents on similar topics as indicated by their feature vector representations, are grouped accordingly. For examples of resulting document mappings, see below.

2-stage Clustering Procedure

Step 1: Clustering Segments of Music

The feature vectors representing music segments can be thought of as data points in a 4352-dimensional space, with similar pieces of music, i.e. segments exhibiting similar frequency spectra and thus similar FFT coefficients, being located close to each other. Using the SOM to cluster these feature vectors, we may expect similar music segments to be located close to each other in the resulting map display. On the resulting segment SOM the various segments are scattered across the map according to their mutual similarity. This allows, for example, pieces of music touching on different musical genres to be located in two or more different clusters, whereas rather homogeneous pieces of music are usually located within one rather confined cluster on the map. While this already provides a very intuitive interface to a musical collection, a second clustering may be built on top of the segment map to obtain a grouping of pieces of music according to their overall characteristics.

Step 2: Creating an Organization based on Segment Distribution

To obtain such a clustering, we use the mapping positions of the segments of a piece of music. We create a feature vector representation for each piece of music using the location of its segments as descriptive attributes. Given a SOM with units organized in x rows and y columns we create an x * y dimensional weight vector, where the attributes are the (coordinates of) the units of the segment SOM. Each vector attribute represents the number of segments of a particular piece of music mapped onto the respective unit in the SOM. Consider a piece of music A consisting of 7 segments, three of which are mapped onto unit (0/0) in the upper left corner of a 3x3 map, two segments on unit (2/1) , and one segment on the neighboring units (1/1) and (2/2), respectively, as depicted in the Figure below. The attributes of the resulting 9-dimensional feature vector of the song are basically set to the according values (3 0 0 0 1 2 0 0 1). Subsequent norming to unit length makes up for length differences of songs.

Representation of segments distributed across 3x3 SOM and according 9-dim feature vector

Creating 2nd level feature vectors based on segment distribution

Instead of directly using the number of segments mapped onto a specific unit as the attribute of the newly created input vector for a given piece of music, we may improve data representation by incorporating information about the similarity of a given segment with the weight vector of the unit it is mapped onto. As the weight vector serves as a cluster prototype, we can use the mapping distance between a segments feature vector and the unit's weight vector to give higher weights to segments that are very similar to the cluster centroid, whereas we may give lower weights to segments that are not mapped as well onto this unit. Furthermore, we may distribute the contribution of a segment being mapped onto a specific unit also across units in the neighborhood, utilizing the topology-preserving characteristics of the SOM. This allows for a more stable representation of the segments distribution across the segment map. A Gaussian centered at the winner can thus be used to model the contribution of each segment's location onto the neighboring units, and thus onto the attributes of the feature vector for the music SOM, as indicated by the shadings in the figure above.

We thus create a feature vector for each particular piece of music based on the distribution of its segments on the segment SOM. Training a second SOM using these feature vectors we obtain a clustering where each piece of music is mapped onto one single location on the resulting map, with similar pieces of music being mapped close to each other.

Selected Publications on Prototype 1


Up to the SOMeJB Homepage
Comments: rauber@ifs.tuwien.ac.at