Europeana Sounds Music Collection Genre Detection Challenge

Take part and help identifying genres in musical heritage.

The Europeana Sounds project has been working for the last three years to get the collections of sound archives around Europe online. All the material related to music can be found in its dedicated thematic collection: Europeana Music. So far over 250,000 pictures, texts and sound files can be found there.

For a user of Europeana Music, it is useful to be able to search for particular music genres (e.g free jazz, Irish folk, baroque) to find what they are looking for in this vast amount of material. However, this information is not always available in the data. Currently, only about a fifth of the Europeana Music Collection has been labelled with a unified genre description. And even in those cases the genre classification is often very general, because it has been applied at the collection level instead of being optimised for a specific piece.

To improve the quality of the genre information, we are organising a genre detection challenge on the 1st of October in Vienna.

We have teamed up with the organisers of a large hackathon as part of the Vienna Waves festival, a yearly returning festival focussing on cutting edge music combining club nights with lectures, keynote talks, discussions and room for experiments. Up to 100 participants are welcome in the amazing "Werkstaetten- und Kulturhaus" (WUK) to work on this challenge or one of the many others proposed there.

The Europeana API provides programmatic access to over 35,000 music recordings that are available through the Europeana Music Collection. With this challenge we are looking for methods that automatically process the Europeana Music Collection to apply suitable genre descriptions at the item level. For this day all the openly licensed sound files from the Europeana database will be made available beforehand to the participants to make sure they can get working on it right from the start of the day.

Prize

For the best solution of the day, a prize of 500 euros (in vouchers) is available - but we don't want the work to end after this day. If the prototype developed during the hackday has potential, the Europeana Foundation will be able to work with you (and pay) to further develop it into a working product and feed the generated data into the Europeana crowdsourcing API.

We think the next step would be to verify the results being generated by the algorithm using the crowdsourcing tool also being developed as part of the Europeana Sounds project. So instead of asking ‘what genre do you think this is?’’, we can ask the Europeana user ‘we think this is polka, do you agree?’. This makes it much easier for the user to participate in the crowdsource activities and therefore improve the data.

Questions?

  • For organizational questions refer to Joris Pekel
  • For technical questions refer to Alexander Schindler

Required modules

To run this notebook and its example code the following modules are required:

  • pip install numpy
  • pip install pandas
  • pip install sklearn (or scikit-learn)
In [1]:
import numpy  as np
import pandas as pd

pd.set_option('display.max_colwidth', 30)
In [2]:
DATA_PATH = "G:/eu_sounds_challenge/"

Text Processing

Load Metadata

In [3]:
# read csv-data (separated by semicolons)
metadata = pd.read_csv("%s/metadata.csv" % DATA_PATH, sep=";")

# convert nan-values to empty strings
metadata = metadata.fillna("")

# use the "id" field as index
metadata = metadata.set_index("id")

metadata.head()
Out[3]:
contributor country created creator dataProvider date description format identifier language medium provider publisher relation spatial subject title type year
id
/09301/02DABC7F5850EABCA553BCEEEEAE2B6543A1CA67 Schorr, A. [lyricist], Bur... france Bibliothèque Medem - Maiso... Voice solo Publisher number : Columbi... yi Judaica Europeana Columbia Ai dai dera dai tchastushkes United States [publication... Vos toig es aich vos darft... sound recording, Chanson c...
/09301/03DE04818BECB920A21EA63EA9D5FD1C49CF74A0 Burstein, Peisach [singer] france Bibliothèque Medem - Maiso... Full orchestra ; Voice solo Publisher number : Columbi... yi Judaica Europeana Columbia Die poilische chassene New-York [recording], Unit... Pitia Diritia sound recording, Chanson c...
/09301/0701AE796F7B504E67B8CEB031C1061E113BE3F4 Kremer, Isa [singer] france Bibliothèque Medem - Maiso... Full orchestra ; Voice sol... Publisher number : Polydor... yi Judaica Europeana Polydor Oi Abram Paris [publication] Sitz ich mir in schtibele Chanson d'amour, soun...
/09301/0A3E8D891310E10CEC0B6D7790ADE1470E6541B8 Potock, Bernard [conductor... france Bibliothèque Medem - Maiso... Full orchestra ; Voice sol... Publisher number : Elesdis... yi Judaica Europeana Elesdisc Leon Speiser Ich will nicht sein kein r... Paris [publication] A briew fun a dienst Chanson humouristique, sou...
/09301/0BF6DFA79787E76872BA33061C84A79588318953 Ellstein, Abraham [instrum... france Bibliothèque Medem - Maiso... Chorus ; Piano ; Voice solo Publisher number : Metro 1... yi Judaica Europeana Metro A nign'dl Taiere malke sound recording, Chanson p...

Analyze Metadata

Search for typical terms in title

There are a lot of terms that are frequently used in title tags which can be associated to musical genres. As always when working with Europeana data be aware that many descriptive data items are multi-lingual.

In [4]:
# define a list of typical terms
classical_music_terms = ["opera", "oper", "operette", "violin concert", "quartett"]

# search all title fields (lower-case) of the metadata

# 1. use title fields to search
search_index = metadata.title
# 2. convert to lower case
search_index = search_index.str.lower()
# 3. get boolean map of indexes which contain the search-strings
search_index = search_index.str.contains('|'.join(classical_music_terms))

# the same can be written as a single line
search_result = metadata[metadata.title.str.lower().str.contains('|'.join(classical_music_terms))]

# print results
print "Number of classical music tracks found: %d" % search_result.shape[0]

search_result.head()
Number of classical music tracks found: 107
Out[4]:
contributor country created creator dataProvider date description format identifier language medium provider publisher relation spatial subject title type year
id
/2059201/data_sounds_28043 Sakss, Pauls, 1878-1966 latvia Offenbach, Jacques, 1819-1880 National Library of Latvia [19--] Shellac record, audio/mpeg... 3336, 10034167 audio disc Europeana Sounds Latvija : Bellaccord Electro Operas -- Excerpts Mīlas nakts : barkarolla n...
/2059201/data_sounds_50752 Frinbergs, Artūrs, 1916-1984 latvia Puccini, Giacomo, 1858-1924 National Library of Latvia [19--] Shellac record, audio/mpeg... B/20894 audio disc Europeana Sounds [Latvija] : Рижский завод Operas -- Excerpts Kavaradosi ārija no operas...
/2059201/data_sounds_50753 Frinbergs, Artūrs, 1916-1984 latvia Чайковский, Петр Ильич, 18... National Library of Latvia [19--] Shellac record, audio/mpeg... B/19958 audio disc Europeana Sounds [Latvija] : Рижский завод Operas -- Excerpts Hermaņa ārija no operas &q...
/2059201/data_sounds_50815 Ašmane, Marta latvia Händel, Georg Friedrich, 1... National Library of Latvia [19--] Shellac record, audio/mpeg... 205, 5126950 audio disc Europeana Sounds [B.v.] : Метрополь Рекордъ Operas -- Excerpts Ārija iz operas "Rina...
/2059201/data_sounds_51040 Zommers, Roberts, 1899-1984 latvia Bizet, Georges, 1838-1875 National Library of Latvia 1934 Shellac record, audio/mpeg... 10032551 audio disc Europeana Sounds Rīga : Skaņu ateljē Da Capo Operas -- Excerpts Ziedu ārija no Ž. Bizē ope... 1934

Search for commonly known artists

Similar to the title-based search also the creator and contributor fields can be searched for commonly known terms.

In [5]:
# define a list of typical terms
classical_composer_names = ["mozart", "beethoven", "offenbach", "bizet"]

# search all title fields of the metadata
search_result = metadata[metadata.creator.str.lower().str.contains('|'.join(classical_composer_names))]

# print results
print "Number of classical music tracks found: %d" % search_result.shape[0]

search_result.head()
Number of classical music tracks found: 162
Out[5]:
contributor country created creator dataProvider date description format identifier language medium provider publisher relation spatial subject title type year
id
/2059201/data_sounds_11079 Vētra, Mariss, 1901-1965, ... latvia Beethoven, Ludwig van, 177... National Library of Latvia [19--] Shellac record, audio/mpeg... 3795, 32182557 audio disc Europeana Sounds [Latvija] : Bellaccord Ele... Songs (High voice) with or... Tā debess izteic
/2059201/data_sounds_28043 Sakss, Pauls, 1878-1966 latvia Offenbach, Jacques, 1819-1880 National Library of Latvia [19--] Shellac record, audio/mpeg... 3336, 10034167 audio disc Europeana Sounds Latvija : Bellaccord Electro Operas -- Excerpts Mīlas nakts : barkarolla n...
/2059201/data_sounds_48216 Vējš, Teodors, 1902-1965, ... latvia Bizet, Georges, 1838-1875 National Library of Latvia [19--] Shellac record, audio/mpeg... 3315, 10032551 audio disc Europeana Sounds [Latvija] : Bellaccord Ele... Operas -- Excerpts Nadira ārija no op. "...
/2059201/data_sounds_51040 Zommers, Roberts, 1899-1984 latvia Bizet, Georges, 1838-1875 National Library of Latvia 1934 Shellac record, audio/mpeg... 10032551 audio disc Europeana Sounds Rīga : Skaņu ateljē Da Capo Operas -- Excerpts Ziedu ārija no Ž. Bizē ope... 1934
/2059201/data_sounds_53431 Priednieks-Kavara, Artūrs,... latvia Beethoven, Ludwig van, 177... National Library of Latvia [19--] Shellac record, audio/mpeg... E.K.1083, 32182557 audio disc Europeana Sounds [England] : His Master&apo... Songs (High voice) with or... Tā debess izteic tā mūžīgā...

Combined search on various fields

In [6]:
# search terms
classical_music_terms    = ["opera", "oper", "operette", "violin concert", "quartett"]
classical_composer_names = ["mozart", "beethoven", "offenbach", "bizet"]

# there are various ways to approach this task
# here we demonstrate it in a more structured way

# 1. all title terms
title_boolean_map = metadata.title.str.lower().str.contains('|'.join(classical_music_terms))
# 2. all creator terms
composer_boolean_map = metadata.creator.str.lower().str.contains('|'.join(classical_composer_names))
# 3. logically combine results
combined_boolean_map = title_boolean_map | composer_boolean_map

# search all fields using the combined maps
search_result = metadata[combined_boolean_map]

# print results
print "Number of classical music tracks found: %d" % search_result.shape[0]

search_result.head()
Number of classical music tracks found: 257
Out[6]:
contributor country created creator dataProvider date description format identifier language medium provider publisher relation spatial subject title type year
id
/2059201/data_sounds_11079 Vētra, Mariss, 1901-1965, ... latvia Beethoven, Ludwig van, 177... National Library of Latvia [19--] Shellac record, audio/mpeg... 3795, 32182557 audio disc Europeana Sounds [Latvija] : Bellaccord Ele... Songs (High voice) with or... Tā debess izteic
/2059201/data_sounds_28043 Sakss, Pauls, 1878-1966 latvia Offenbach, Jacques, 1819-1880 National Library of Latvia [19--] Shellac record, audio/mpeg... 3336, 10034167 audio disc Europeana Sounds Latvija : Bellaccord Electro Operas -- Excerpts Mīlas nakts : barkarolla n...
/2059201/data_sounds_48216 Vējš, Teodors, 1902-1965, ... latvia Bizet, Georges, 1838-1875 National Library of Latvia [19--] Shellac record, audio/mpeg... 3315, 10032551 audio disc Europeana Sounds [Latvija] : Bellaccord Ele... Operas -- Excerpts Nadira ārija no op. "...
/2059201/data_sounds_50752 Frinbergs, Artūrs, 1916-1984 latvia Puccini, Giacomo, 1858-1924 National Library of Latvia [19--] Shellac record, audio/mpeg... B/20894 audio disc Europeana Sounds [Latvija] : Рижский завод Operas -- Excerpts Kavaradosi ārija no operas...
/2059201/data_sounds_50753 Frinbergs, Artūrs, 1916-1984 latvia Чайковский, Петр Ильич, 18... National Library of Latvia [19--] Shellac record, audio/mpeg... B/19958 audio disc Europeana Sounds [Latvija] : Рижский завод Operas -- Excerpts Hermaņa ārija no operas &q...

Add a "genre" column to the metadata to store results

In [7]:
# add the column
metadata["genre"] = ""

# set calculated "genre" labels accordingly

# define a list of typical terms
classical_composer_names = ["mozart", "beethoven", "offenbach", "bizet"]

# the same boolean map which was previously used to display the result, 
# can be used to set the labels for the calculated results
search_result_index = metadata.creator.str.lower().str.contains('|'.join(classical_composer_names))

# set the labels accordingly
metadata.genre[search_result_index] = "classical"

# show results
print "number of classical tracks: %d" % metadata[metadata.genre == "classical"].shape[0]

metadata[metadata.genre == "classical"].head()
number of classical tracks: 162
Out[7]:
contributor country created creator dataProvider date description format identifier language medium provider publisher relation spatial subject title type year genre
id
/2059201/data_sounds_11079 Vētra, Mariss, 1901-1965, ... latvia Beethoven, Ludwig van, 177... National Library of Latvia [19--] Shellac record, audio/mpeg... 3795, 32182557 audio disc Europeana Sounds [Latvija] : Bellaccord Ele... Songs (High voice) with or... Tā debess izteic classical
/2059201/data_sounds_28043 Sakss, Pauls, 1878-1966 latvia Offenbach, Jacques, 1819-1880 National Library of Latvia [19--] Shellac record, audio/mpeg... 3336, 10034167 audio disc Europeana Sounds Latvija : Bellaccord Electro Operas -- Excerpts Mīlas nakts : barkarolla n... classical
/2059201/data_sounds_48216 Vējš, Teodors, 1902-1965, ... latvia Bizet, Georges, 1838-1875 National Library of Latvia [19--] Shellac record, audio/mpeg... 3315, 10032551 audio disc Europeana Sounds [Latvija] : Bellaccord Ele... Operas -- Excerpts Nadira ārija no op. "... classical
/2059201/data_sounds_51040 Zommers, Roberts, 1899-1984 latvia Bizet, Georges, 1838-1875 National Library of Latvia 1934 Shellac record, audio/mpeg... 10032551 audio disc Europeana Sounds Rīga : Skaņu ateljē Da Capo Operas -- Excerpts Ziedu ārija no Ž. Bizē ope... 1934 classical
/2059201/data_sounds_53431 Priednieks-Kavara, Artūrs,... latvia Beethoven, Ludwig van, 177... National Library of Latvia [19--] Shellac record, audio/mpeg... E.K.1083, 32182557 audio disc Europeana Sounds [England] : His Master&apo... Songs (High voice) with or... Tā debess izteic tā mūžīgā... classical

Audio Feature Processing

Load and pre-process audio Features

In [8]:
# loading audio features. Pandas can read zipped csv-files. Manual unzipping is not required.
features_mfcc = pd.read_csv("%s/mfcc.csv.gz" % DATA_PATH, sep=";", header=None)

# set index
features_mfcc = features_mfcc.set_index(0)
features_mfcc.index.name = "id"

features_mfcc.head()
Out[8]:
1 2 3 4 5 6 7 8 9 10 ... 56 57 58 59 60 61 62 63 64 65
id
/2023601/oai_eu_dismarc_CHARM_DISC01SIDE02METSEE29 0.045898 -0.006166 0.034595 -0.012495 -0.314453 -0.469206 -0.724443 -1.274571 -1.821123 -2.888282 ... 8.849905 7.798388 7.017814 6.247522 5.712278 5.170348 4.825831 4.455029 4.153709 3.743235
/2023601/oai_eu_dismarc_CHARM_DISC01SIDE02METSEEE54 0.061028 0.012468 0.046163 -0.012373 -0.331475 -0.418479 -0.776257 -1.194863 -2.746311 -4.085886 ... 5.668731 4.740953 3.946593 3.308268 2.723058 2.213726 1.778679 1.388869 1.086135 0.831883
/2023601/oai_eu_dismarc_CHARM_DISC01SIDE02METSEG2031 0.000829 0.000408 0.000185 0.000143 -0.822003 -6.464063 -9.709557 -11.803703 -16.232235 -17.719405 ... 9.148643 8.243159 7.083017 5.999343 5.124123 4.545999 4.188426 3.986916 3.901077 3.899819
/2023601/oai_eu_dismarc_CHARM_DISC01SIDE02METSEG3184 0.009110 -0.000004 -0.000575 0.000027 -0.584520 -0.249008 -4.699470 -9.617082 -13.266026 -14.578073 ... 6.537391 5.389040 4.170882 3.118195 2.448338 2.084395 1.958324 2.004001 2.230389 2.541304
/2023601/oai_eu_dismarc_CHARM_DISC01SIDE02METSEG3687 0.008827 -0.000392 -0.033676 -1.289619 -2.237123 -1.635143 -0.407569 -4.454713 -9.577355 -12.580112 ... 3.439976 3.326632 3.043295 2.753692 2.407103 2.163529 1.988904 1.868926 1.786403 1.736148

5 rows × 65 columns

pre-process features

In [9]:
from sklearn.preprocessing import StandardScaler

# create and fit the scaler
scaler = StandardScaler()
scaler.fit(features_mfcc)

# sklearn returns a numpy array, thus we have to re-create the pandas DataFrame and its index
features_mfcc = pd.DataFrame(scaler.transform(features_mfcc), index=features_mfcc.index)

get subset of features for which we already know genre labels

Using the previously introduced text processing approach.

First: create the subset using text processing

In [10]:
def search_for_tracks(query_terms, columns):
    
    boolean_masks = []
    
    for column in columns:
        if len(query_terms) > 1:
            boolean_masks.append( metadata[column].str.lower().str.contains('|'.join(query_terms)) )
        else:
            boolean_masks.append( metadata[column].str.lower().str.contains(query_terms[0]) )
            
    return metadata[np.any(boolean_masks, axis=0)]
In [11]:
# define the classes for the classifier
classes = {}

classes["polka"]       = search_for_tracks(["polka"],     ["title", "description", "subject"])
classes["opera"]       = search_for_tracks(["oper"],      ["title", "description", "subject"])
classes["spoken_word"] = search_for_tracks(["interview"], ["title", "description", "subject"])
classes["choral"]      = search_for_tracks(["choral"],    ["title", "description", "subject"])

Second: use the indexes of the search result to get the corresponding feature vectors

In [12]:
features_for_classification = []
labels_for_classification   = []

for class_name in classes.keys():
    
    # metadata and features have the same indexes, thus we can use the search result to index the feature-set
    # because there are less features than metadata, we need to use the intersection of the search result with the
    # feature set. Otherwise, we would receive a lot of nan-valued rows.
    features = features_mfcc.ix[classes[class_name].index.intersection(features_mfcc.index)]
    
    features_for_classification.append(features)
    labels_for_classification.extend( [class_name for _ in range(features.shape[0])] )

# combine features into a feature-space
features_for_classification = pd.concat(features_for_classification, axis=0)
labels_for_classification   = np.asarray(labels_for_classification)

print "%d feature vectors for the %d classes available" % (features_for_classification.shape[0], len(classes.keys()))
302 feature vectors for the 4 classes available

Third: train a classifier

In [13]:
# first a little pre-processing is required to convert the textual labels into a decimal representation
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
encoder.fit(labels_for_classification)
Out[13]:
LabelEncoder()
In [14]:
# to test if classifier works accurately, we split the data into a training-set and a test-set
from sklearn.model_selection import StratifiedShuffleSplit

# create the splitter
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, train_size=0.8)

# split the data
splits = sss.split(features_for_classification, labels_for_classification)

# this is a trick to get the splits without using a for-loop
(train_index, test_index) = list(splits)[0]
In [15]:
# create the classifier - for this example we use a Support Vector Machine
from sklearn.svm import LinearSVC

classifier = LinearSVC()

# train the classifier
classifier.fit(features_for_classification.ix[train_index], encoder.transform(labels_for_classification[train_index]))

# test the classifier
predictions = classifier.predict(features_for_classification.ix[test_index])

pd.DataFrame({"expected":  labels_for_classification[test_index], 
              "predicted": encoder.inverse_transform(predictions)},
             index=features_for_classification.ix[test_index].index)
Out[15]:
expected predicted
id
/2059201/data_sounds_54536 polka opera
/2059201/data_sounds_59504 opera opera
/2059201/data_sounds_59581 polka opera
/2051921/data_euscreenXL_5350 spoken_word spoken_word
/2051921/data_euscreenXL_5995 spoken_word polka
/2059201/data_sounds_55008 choral polka
/2059201/data_sounds_55067 polka polka
/2059201/data_sounds_49583 opera opera
/2059201/data_sounds_55054 polka polka
/2059201/data_sounds_64978 opera opera
/2059201/data_sounds_55278 opera opera
/2059201/data_sounds_60753 opera polka
/2051921/data_euscreenXL_5403 spoken_word spoken_word
/2059201/data_sounds_61595 opera opera
/2059201/data_sounds_28043 opera opera
/2059201/data_sounds_62830 opera opera
/2059201/data_sounds_29 polka polka
/2059201/data_sounds_53177 polka polka
/2059201/data_sounds_63748 polka opera
/2059201/data_sounds_63619 polka polka
/2051921/data_euscreenXL_5393 spoken_word spoken_word
/2059201/data_sounds_56117 polka polka
/2059207/data_sounds_T202_3 opera opera
/2051921/data_euscreenXL_5013 spoken_word spoken_word
/2059201/data_sounds_51040 opera opera
/2059201/data_sounds_62822 polka polka
/2059201/data_sounds_57876 opera opera
/2051921/data_euscreenXL_5181 spoken_word spoken_word
/2059201/data_sounds_65006 polka polka
/2059201/data_sounds_61899 polka polka
... ... ...
/2059213/data_sounds_8761 polka polka
/2059201/data_sounds_64034 polka polka
/2059201/data_sounds_53337 polka polka
/2059201/data_sounds_63597 polka opera
/2059201/data_sounds_58784 opera opera
/2059201/data_sounds_54885 opera opera
/2059201/data_sounds_49584 opera opera
/2059201/data_sounds_60472 opera opera
/2059201/data_sounds_70 polka polka
/2059201/data_sounds_59787 opera opera
/2059201/data_sounds_64374 polka polka
/2059201/data_sounds_61135 polka opera
/2059201/data_sounds_63019 opera spoken_word
/2059201/data_sounds_56106 opera opera
/2059201/data_sounds_60096 opera opera
/2051921/data_euscreenXL_6220 spoken_word spoken_word
/2051921/data_euscreenXL_6108 spoken_word spoken_word
/2059201/data_sounds_53411 opera opera
/2059201/data_sounds_53391 opera polka
/2059201/data_sounds_50815 opera polka
/2059201/data_sounds_59782 opera opera
/2059201/data_sounds_63193 polka polka
/2051921/data_euscreenXL_5402 spoken_word spoken_word
/2059207/data_sounds_T284_5 polka polka
/2059201/data_sounds_54887 opera opera
/2059201/data_sounds_65086 polka polka
/2059201/data_sounds_48220 polka opera
/2059201/data_sounds_60520 polka polka
/2059201/data_sounds_49594 opera opera
/2059201/data_sounds_54881 opera opera

61 rows × 2 columns

In [16]:
# estimate the precision of the classifier
from sklearn.metrics import accuracy_score

print "Classifier accuracy: %.2f%%" % (accuracy_score(predictions, encoder.transform(labels_for_classification[test_index])) * 100.0)
Classifier accuracy: 78.69%