Europeana Sounds Music Collection Genre Detection Challenge¶

Take part and help identifying genres in musical heritage.

The Europeana Sounds project has been working for the last three years to get the collections of sound archives around Europe online. All the material related to music can be found in its dedicated thematic collection: Europeana Music. So far over 250,000 pictures, texts and sound files can be found there.

For a user of Europeana Music, it is useful to be able to search for particular music genres (e.g free jazz, Irish folk, baroque) to find what they are looking for in this vast amount of material. However, this information is not always available in the data. Currently, only about a fifth of the Europeana Music Collection has been labelled with a unified genre description. And even in those cases the genre classification is often very general, because it has been applied at the collection level instead of being optimised for a specific piece.

To improve the quality of the genre information, we are organising a genre detection challenge on the 1st of October in Vienna.

We have teamed up with the organisers of a large hackathon as part of the Vienna Waves festival, a yearly returning festival focussing on cutting edge music combining club nights with lectures, keynote talks, discussions and room for experiments. Up to 100 participants are welcome in the amazing "Werkstaetten- und Kulturhaus" (WUK) to work on this challenge or one of the many others proposed there.

The Europeana API provides programmatic access to over 35,000 music recordings that are available through the Europeana Music Collection. With this challenge we are looking for methods that automatically process the Europeana Music Collection to apply suitable genre descriptions at the item level. For this day all the openly licensed sound files from the Europeana database will be made available beforehand to the participants to make sure they can get working on it right from the start of the day.

Prize

For the best solution of the day, a prize of 500 euros (in vouchers) is available - but we don't want the work to end after this day. If the prototype developed during the hackday has potential, the Europeana Foundation will be able to work with you (and pay) to further develop it into a working product and feed the generated data into the Europeana crowdsourcing API.

We think the next step would be to verify the results being generated by the algorithm using the crowdsourcing tool also being developed as part of the Europeana Sounds project. So instead of asking âwhat genre do you think this is?ââ, we can ask the Europeana user âwe think this is polka, do you agree?â. This makes it much easier for the user to participate in the crowdsource activities and therefore improve the data.

Questions?

For organizational questions refer to Joris Pekel
For technical questions refer to Alexander Schindler

Required modules¶

To run this notebook and its example code the following modules are required:

pip install numpy
pip install pandas
pip install sklearn (or scikit-learn)

import numpy  as np
import pandas as pd

pd.set_option('display.max_colwidth', 30)

DATA_PATH = "G:/eu_sounds_challenge/"

Text Processing¶

Load Metadata¶

# read csv-data (separated by semicolons)
metadata = pd.read_csv("%s/metadata.csv" % DATA_PATH, sep=";")

# convert nan-values to empty strings
metadata = metadata.fillna("")

# use the "id" field as index
metadata = metadata.set_index("id")

metadata.head()

Analyze Metadata¶

Search for typical terms in title¶

There are a lot of terms that are frequently used in title tags which can be associated to musical genres. As always when working with Europeana data be aware that many descriptive data items are multi-lingual.

# define a list of typical terms
classical_music_terms = ["opera", "oper", "operette", "violin concert", "quartett"]

# search all title fields (lower-case) of the metadata

# 1. use title fields to search
search_index = metadata.title
# 2. convert to lower case
search_index = search_index.str.lower()
# 3. get boolean map of indexes which contain the search-strings
search_index = search_index.str.contains('|'.join(classical_music_terms))

# the same can be written as a single line
search_result = metadata[metadata.title.str.lower().str.contains('|'.join(classical_music_terms))]

# print results
print "Number of classical music tracks found: %d" % search_result.shape[0]

search_result.head()

Number of classical music tracks found: 107

Search for commonly known artists¶

Similar to the title-based search also the creator and contributor fields can be searched for commonly known terms.

# define a list of typical terms
classical_composer_names = ["mozart", "beethoven", "offenbach", "bizet"]

# search all title fields of the metadata
search_result = metadata[metadata.creator.str.lower().str.contains('|'.join(classical_composer_names))]

# print results
print "Number of classical music tracks found: %d" % search_result.shape[0]

search_result.head()

Number of classical music tracks found: 162

Combined search on various fields¶

# search terms
classical_music_terms    = ["opera", "oper", "operette", "violin concert", "quartett"]
classical_composer_names = ["mozart", "beethoven", "offenbach", "bizet"]

# there are various ways to approach this task
# here we demonstrate it in a more structured way

# 1. all title terms
title_boolean_map = metadata.title.str.lower().str.contains('|'.join(classical_music_terms))
# 2. all creator terms
composer_boolean_map = metadata.creator.str.lower().str.contains('|'.join(classical_composer_names))
# 3. logically combine results
combined_boolean_map = title_boolean_map | composer_boolean_map

# search all fields using the combined maps
search_result = metadata[combined_boolean_map]

# print results
print "Number of classical music tracks found: %d" % search_result.shape[0]

search_result.head()

Number of classical music tracks found: 257

Add a "genre" column to the metadata to store results¶

# add the column
metadata["genre"] = ""

# set calculated "genre" labels accordingly

# define a list of typical terms
classical_composer_names = ["mozart", "beethoven", "offenbach", "bizet"]

# the same boolean map which was previously used to display the result, 
# can be used to set the labels for the calculated results
search_result_index = metadata.creator.str.lower().str.contains('|'.join(classical_composer_names))

# set the labels accordingly
metadata.genre[search_result_index] = "classical"

# show results
print "number of classical tracks: %d" % metadata[metadata.genre == "classical"].shape[0]

metadata[metadata.genre == "classical"].head()

number of classical tracks: 162

Audio Feature Processing¶

Load and pre-process audio Features¶

# loading audio features. Pandas can read zipped csv-files. Manual unzipping is not required.
features_mfcc = pd.read_csv("%s/mfcc.csv.gz" % DATA_PATH, sep=";", header=None)

# set index
features_mfcc = features_mfcc.set_index(0)
features_mfcc.index.name = "id"

features_mfcc.head()

pre-process features¶

from sklearn.preprocessing import StandardScaler

# create and fit the scaler
scaler = StandardScaler()
scaler.fit(features_mfcc)

# sklearn returns a numpy array, thus we have to re-create the pandas DataFrame and its index
features_mfcc = pd.DataFrame(scaler.transform(features_mfcc), index=features_mfcc.index)

get subset of features for which we already know genre labels¶

Using the previously introduced text processing approach.

First: create the subset using text processing

def search_for_tracks(query_terms, columns):
    
    boolean_masks = []
    
    for column in columns:
        if len(query_terms) > 1:
            boolean_masks.append( metadata[column].str.lower().str.contains('|'.join(query_terms)) )
        else:
            boolean_masks.append( metadata[column].str.lower().str.contains(query_terms[0]) )
            
    return metadata[np.any(boolean_masks, axis=0)]

# define the classes for the classifier
classes = {}

classes["polka"]       = search_for_tracks(["polka"],     ["title", "description", "subject"])
classes["opera"]       = search_for_tracks(["oper"],      ["title", "description", "subject"])
classes["spoken_word"] = search_for_tracks(["interview"], ["title", "description", "subject"])
classes["choral"]      = search_for_tracks(["choral"],    ["title", "description", "subject"])

Second: use the indexes of the search result to get the corresponding feature vectors

features_for_classification = []
labels_for_classification   = []

for class_name in classes.keys():
    
    # metadata and features have the same indexes, thus we can use the search result to index the feature-set
    # because there are less features than metadata, we need to use the intersection of the search result with the
    # feature set. Otherwise, we would receive a lot of nan-valued rows.
    features = features_mfcc.ix[classes[class_name].index.intersection(features_mfcc.index)]
    
    features_for_classification.append(features)
    labels_for_classification.extend( [class_name for _ in range(features.shape[0])] )

# combine features into a feature-space
features_for_classification = pd.concat(features_for_classification, axis=0)
labels_for_classification   = np.asarray(labels_for_classification)

print "%d feature vectors for the %d classes available" % (features_for_classification.shape[0], len(classes.keys()))

302 feature vectors for the 4 classes available

Third: train a classifier

# first a little pre-processing is required to convert the textual labels into a decimal representation
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
encoder.fit(labels_for_classification)

LabelEncoder()

# to test if classifier works accurately, we split the data into a training-set and a test-set
from sklearn.model_selection import StratifiedShuffleSplit

# create the splitter
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, train_size=0.8)

# split the data
splits = sss.split(features_for_classification, labels_for_classification)

# this is a trick to get the splits without using a for-loop
(train_index, test_index) = list(splits)[0]

# create the classifier - for this example we use a Support Vector Machine
from sklearn.svm import LinearSVC

classifier = LinearSVC()

# train the classifier
classifier.fit(features_for_classification.ix[train_index], encoder.transform(labels_for_classification[train_index]))

# test the classifier
predictions = classifier.predict(features_for_classification.ix[test_index])

pd.DataFrame({"expected":  labels_for_classification[test_index], 
              "predicted": encoder.inverse_transform(predictions)},
             index=features_for_classification.ix[test_index].index)

# estimate the precision of the classifier
from sklearn.metrics import accuracy_score

print "Classifier accuracy: %.2f%%" % (accuracy_score(predictions, encoder.transform(labels_for_classification[test_index])) * 100.0)

Classifier accuracy: 78.69%

	contributor	country	dataProvider	description	identifier	language	provider	publisher	relation	spatial	title	type
id
/09301/02DABC7F5850EABCA553BCEEEEAE2B6543A1CA67	Schorr, A. [lyricist], Bur...	france	Bibliothèque Medem - Maiso...	Voice solo	Publisher number : Columbi...	yi	Judaica Europeana	Columbia	Ai dai dera dai tchastushkes	United States [publication...	Vos toig es aich vos darft...	sound recording, Chanson c...
/09301/03DE04818BECB920A21EA63EA9D5FD1C49CF74A0	Burstein, Peisach [singer]	france	Bibliothèque Medem - Maiso...	Full orchestra ; Voice solo	Publisher number : Columbi...	yi	Judaica Europeana	Columbia	Die poilische chassene	New-York [recording], Unit...	Pitia Diritia	sound recording, Chanson c...
/09301/0701AE796F7B504E67B8CEB031C1061E113BE3F4	Kremer, Isa [singer]	france	Bibliothèque Medem - Maiso...	Full orchestra ; Voice sol...	Publisher number : Polydor...	yi	Judaica Europeana	Polydor	Oi Abram	Paris [publication]	Sitz ich mir in schtibele	Chanson d'amour, soun...
/09301/0A3E8D891310E10CEC0B6D7790ADE1470E6541B8	Potock, Bernard [conductor...	france	Bibliothèque Medem - Maiso...	Full orchestra ; Voice sol...	Publisher number : Elesdis...	yi	Judaica Europeana	Elesdisc Leon Speiser	Ich will nicht sein kein r...	Paris [publication]	A briew fun a dienst	Chanson humouristique, sou...
/09301/0BF6DFA79787E76872BA33061C84A79588318953	Ellstein, Abraham [instrum...	france	Bibliothèque Medem - Maiso...	Chorus ; Piano ; Voice solo	Publisher number : Metro 1...	yi	Judaica Europeana	Metro	A nign'dl		Taiere malke	sound recording, Chanson p...

	contributor	country	creator	dataProvider	date	format	identifier	medium	provider	publisher	subject	title	year
id
/2059201/data_sounds_28043	Sakss, Pauls, 1878-1966	latvia	Offenbach, Jacques, 1819-1880	National Library of Latvia	[19--]	Shellac record, audio/mpeg...	3336, 10034167	audio disc	Europeana Sounds	Latvija : Bellaccord Electro	Operas -- Excerpts	Mīlas nakts : barkarolla n...
/2059201/data_sounds_50752	Frinbergs, Artūrs, 1916-1984	latvia	Puccini, Giacomo, 1858-1924	National Library of Latvia	[19--]	Shellac record, audio/mpeg...	B/20894	audio disc	Europeana Sounds	[Latvija] : Рижский завод	Operas -- Excerpts	Kavaradosi ārija no operas...
/2059201/data_sounds_50753	Frinbergs, Artūrs, 1916-1984	latvia	Чайковский, Петр Ильич, 18...	National Library of Latvia	[19--]	Shellac record, audio/mpeg...	B/19958	audio disc	Europeana Sounds	[Latvija] : Рижский завод	Operas -- Excerpts	Hermaņa ārija no operas &q...
/2059201/data_sounds_50815	Ašmane, Marta	latvia	Händel, Georg Friedrich, 1...	National Library of Latvia	[19--]	Shellac record, audio/mpeg...	205, 5126950	audio disc	Europeana Sounds	[B.v.] : Метрополь Рекордъ	Operas -- Excerpts	Ārija iz operas "Rina...
/2059201/data_sounds_51040	Zommers, Roberts, 1899-1984	latvia	Bizet, Georges, 1838-1875	National Library of Latvia	1934	Shellac record, audio/mpeg...	10032551	audio disc	Europeana Sounds	Rīga : Skaņu ateljē Da Capo	Operas -- Excerpts	Ziedu ārija no Ž. Bizē ope...	1934

	contributor	country	creator	dataProvider	date	format	identifier	medium	provider	publisher	subject	title	year
id
/2059201/data_sounds_11079	Vētra, Mariss, 1901-1965, ...	latvia	Beethoven, Ludwig van, 177...	National Library of Latvia	[19--]	Shellac record, audio/mpeg...	3795, 32182557	audio disc	Europeana Sounds	[Latvija] : Bellaccord Ele...	Songs (High voice) with or...	Tā debess izteic
/2059201/data_sounds_28043	Sakss, Pauls, 1878-1966	latvia	Offenbach, Jacques, 1819-1880	National Library of Latvia	[19--]	Shellac record, audio/mpeg...	3336, 10034167	audio disc	Europeana Sounds	Latvija : Bellaccord Electro	Operas -- Excerpts	Mīlas nakts : barkarolla n...
/2059201/data_sounds_48216	Vējš, Teodors, 1902-1965, ...	latvia	Bizet, Georges, 1838-1875	National Library of Latvia	[19--]	Shellac record, audio/mpeg...	3315, 10032551	audio disc	Europeana Sounds	[Latvija] : Bellaccord Ele...	Operas -- Excerpts	Nadira ārija no op. "...
/2059201/data_sounds_51040	Zommers, Roberts, 1899-1984	latvia	Bizet, Georges, 1838-1875	National Library of Latvia	1934	Shellac record, audio/mpeg...	10032551	audio disc	Europeana Sounds	Rīga : Skaņu ateljē Da Capo	Operas -- Excerpts	Ziedu ārija no Ž. Bizē ope...	1934
/2059201/data_sounds_53431	Priednieks-Kavara, Artūrs,...	latvia	Beethoven, Ludwig van, 177...	National Library of Latvia	[19--]	Shellac record, audio/mpeg...	E.K.1083, 32182557	audio disc	Europeana Sounds	[England] : His Master&apo...	Songs (High voice) with or...	Tā debess izteic tā mūžīgā...

	contributor	country	creator	dataProvider	date	format	identifier	medium	provider	publisher	subject	title
id
/2059201/data_sounds_11079	Vētra, Mariss, 1901-1965, ...	latvia	Beethoven, Ludwig van, 177...	National Library of Latvia	[19--]	Shellac record, audio/mpeg...	3795, 32182557	audio disc	Europeana Sounds	[Latvija] : Bellaccord Ele...	Songs (High voice) with or...	Tā debess izteic
/2059201/data_sounds_28043	Sakss, Pauls, 1878-1966	latvia	Offenbach, Jacques, 1819-1880	National Library of Latvia	[19--]	Shellac record, audio/mpeg...	3336, 10034167	audio disc	Europeana Sounds	Latvija : Bellaccord Electro	Operas -- Excerpts	Mīlas nakts : barkarolla n...
/2059201/data_sounds_48216	Vējš, Teodors, 1902-1965, ...	latvia	Bizet, Georges, 1838-1875	National Library of Latvia	[19--]	Shellac record, audio/mpeg...	3315, 10032551	audio disc	Europeana Sounds	[Latvija] : Bellaccord Ele...	Operas -- Excerpts	Nadira ārija no op. "...
/2059201/data_sounds_50752	Frinbergs, Artūrs, 1916-1984	latvia	Puccini, Giacomo, 1858-1924	National Library of Latvia	[19--]	Shellac record, audio/mpeg...	B/20894	audio disc	Europeana Sounds	[Latvija] : Рижский завод	Operas -- Excerpts	Kavaradosi ārija no operas...
/2059201/data_sounds_50753	Frinbergs, Artūrs, 1916-1984	latvia	Чайковский, Петр Ильич, 18...	National Library of Latvia	[19--]	Shellac record, audio/mpeg...	B/19958	audio disc	Europeana Sounds	[Latvija] : Рижский завод	Operas -- Excerpts	Hermaņa ārija no operas &q...

	contributor	country	creator	dataProvider	date	format	identifier	medium	provider	publisher	subject	title	year	genre
id
/2059201/data_sounds_11079	Vētra, Mariss, 1901-1965, ...	latvia	Beethoven, Ludwig van, 177...	National Library of Latvia	[19--]	Shellac record, audio/mpeg...	3795, 32182557	audio disc	Europeana Sounds	[Latvija] : Bellaccord Ele...	Songs (High voice) with or...	Tā debess izteic		classical
/2059201/data_sounds_28043	Sakss, Pauls, 1878-1966	latvia	Offenbach, Jacques, 1819-1880	National Library of Latvia	[19--]	Shellac record, audio/mpeg...	3336, 10034167	audio disc	Europeana Sounds	Latvija : Bellaccord Electro	Operas -- Excerpts	Mīlas nakts : barkarolla n...		classical
/2059201/data_sounds_48216	Vējš, Teodors, 1902-1965, ...	latvia	Bizet, Georges, 1838-1875	National Library of Latvia	[19--]	Shellac record, audio/mpeg...	3315, 10032551	audio disc	Europeana Sounds	[Latvija] : Bellaccord Ele...	Operas -- Excerpts	Nadira ārija no op. "...		classical
/2059201/data_sounds_51040	Zommers, Roberts, 1899-1984	latvia	Bizet, Georges, 1838-1875	National Library of Latvia	1934	Shellac record, audio/mpeg...	10032551	audio disc	Europeana Sounds	Rīga : Skaņu ateljē Da Capo	Operas -- Excerpts	Ziedu ārija no Ž. Bizē ope...	1934	classical
/2059201/data_sounds_53431	Priednieks-Kavara, Artūrs,...	latvia	Beethoven, Ludwig van, 177...	National Library of Latvia	[19--]	Shellac record, audio/mpeg...	E.K.1083, 32182557	audio disc	Europeana Sounds	[England] : His Master&apo...	Songs (High voice) with or...	Tā debess izteic tā mūžīgā...		classical

	1	2	3	4	5	6	7	8	9	10	...	56	57	58	59	60	61	62	63	64	65
id
/2023601/oai_eu_dismarc_CHARM_DISC01SIDE02METSEE29	0.045898	-0.006166	0.034595	-0.012495	-0.314453	-0.469206	-0.724443	-1.274571	-1.821123	-2.888282	...	8.849905	7.798388	7.017814	6.247522	5.712278	5.170348	4.825831	4.455029	4.153709	3.743235
/2023601/oai_eu_dismarc_CHARM_DISC01SIDE02METSEEE54	0.061028	0.012468	0.046163	-0.012373	-0.331475	-0.418479	-0.776257	-1.194863	-2.746311	-4.085886	...	5.668731	4.740953	3.946593	3.308268	2.723058	2.213726	1.778679	1.388869	1.086135	0.831883
/2023601/oai_eu_dismarc_CHARM_DISC01SIDE02METSEG2031	0.000829	0.000408	0.000185	0.000143	-0.822003	-6.464063	-9.709557	-11.803703	-16.232235	-17.719405	...	9.148643	8.243159	7.083017	5.999343	5.124123	4.545999	4.188426	3.986916	3.901077	3.899819
/2023601/oai_eu_dismarc_CHARM_DISC01SIDE02METSEG3184	0.009110	-0.000004	-0.000575	0.000027	-0.584520	-0.249008	-4.699470	-9.617082	-13.266026	-14.578073	...	6.537391	5.389040	4.170882	3.118195	2.448338	2.084395	1.958324	2.004001	2.230389	2.541304
/2023601/oai_eu_dismarc_CHARM_DISC01SIDE02METSEG3687	0.008827	-0.000392	-0.033676	-1.289619	-2.237123	-1.635143	-0.407569	-4.454713	-9.577355	-12.580112	...	3.439976	3.326632	3.043295	2.753692	2.407103	2.163529	1.988904	1.868926	1.786403	1.736148

	expected	predicted
id
/2059201/data_sounds_54536	polka	opera
/2059201/data_sounds_59504	opera	opera
/2059201/data_sounds_59581	polka	opera
/2051921/data_euscreenXL_5350	spoken_word	spoken_word
/2051921/data_euscreenXL_5995	spoken_word	polka
/2059201/data_sounds_55008	choral	polka
/2059201/data_sounds_55067	polka	polka
/2059201/data_sounds_49583	opera	opera
/2059201/data_sounds_55054	polka	polka
/2059201/data_sounds_64978	opera	opera
/2059201/data_sounds_55278	opera	opera
/2059201/data_sounds_60753	opera	polka
/2051921/data_euscreenXL_5403	spoken_word	spoken_word
/2059201/data_sounds_61595	opera	opera
/2059201/data_sounds_28043	opera	opera
/2059201/data_sounds_62830	opera	opera
/2059201/data_sounds_29	polka	polka
/2059201/data_sounds_53177	polka	polka
/2059201/data_sounds_63748	polka	opera
/2059201/data_sounds_63619	polka	polka
/2051921/data_euscreenXL_5393	spoken_word	spoken_word
/2059201/data_sounds_56117	polka	polka
/2059207/data_sounds_T202_3	opera	opera
/2051921/data_euscreenXL_5013	spoken_word	spoken_word
/2059201/data_sounds_51040	opera	opera
/2059201/data_sounds_62822	polka	polka
/2059201/data_sounds_57876	opera	opera
/2051921/data_euscreenXL_5181	spoken_word	spoken_word
/2059201/data_sounds_65006	polka	polka
/2059201/data_sounds_61899	polka	polka
...	...	...
/2059213/data_sounds_8761	polka	polka
/2059201/data_sounds_64034	polka	polka
/2059201/data_sounds_53337	polka	polka
/2059201/data_sounds_63597	polka	opera
/2059201/data_sounds_58784	opera	opera
/2059201/data_sounds_54885	opera	opera
/2059201/data_sounds_49584	opera	opera
/2059201/data_sounds_60472	opera	opera
/2059201/data_sounds_70	polka	polka
/2059201/data_sounds_59787	opera	opera
/2059201/data_sounds_64374	polka	polka
/2059201/data_sounds_61135	polka	opera
/2059201/data_sounds_63019	opera	spoken_word
/2059201/data_sounds_56106	opera	opera
/2059201/data_sounds_60096	opera	opera
/2051921/data_euscreenXL_6220	spoken_word	spoken_word
/2051921/data_euscreenXL_6108	spoken_word	spoken_word
/2059201/data_sounds_53411	opera	opera
/2059201/data_sounds_53391	opera	polka
/2059201/data_sounds_50815	opera	polka
/2059201/data_sounds_59782	opera	opera
/2059201/data_sounds_63193	polka	polka
/2051921/data_euscreenXL_5402	spoken_word	spoken_word
/2059207/data_sounds_T284_5	polka	polka
/2059201/data_sounds_54887	opera	opera
/2059201/data_sounds_65086	polka	polka
/2059201/data_sounds_48220	polka	opera
/2059201/data_sounds_60520	polka	polka
/2059201/data_sounds_49594	opera	opera
/2059201/data_sounds_54881	opera	opera