Take part and help identifying genres in musical heritage.
The Europeana Sounds project has been working for the last three years to get the collections of sound archives around Europe online. All the material related to music can be found in its dedicated thematic collection: Europeana Music. So far over 250,000 pictures, texts and sound files can be found there.
For a user of Europeana Music, it is useful to be able to search for particular music genres (e.g free jazz, Irish folk, baroque) to find what they are looking for in this vast amount of material. However, this information is not always available in the data. Currently, only about a fifth of the Europeana Music Collection has been labelled with a unified genre description. And even in those cases the genre classification is often very general, because it has been applied at the collection level instead of being optimised for a specific piece.
To improve the quality of the genre information, we are organising a genre detection challenge on the 1st of October in Vienna.
We have teamed up with the organisers of a large hackathon as part of the Vienna Waves festival, a yearly returning festival focussing on cutting edge music combining club nights with lectures, keynote talks, discussions and room for experiments. Up to 100 participants are welcome in the amazing "Werkstaetten- und Kulturhaus" (WUK) to work on this challenge or one of the many others proposed there.
The Europeana API provides programmatic access to over 35,000 music recordings that are available through the Europeana Music Collection. With this challenge we are looking for methods that automatically process the Europeana Music Collection to apply suitable genre descriptions at the item level. For this day all the openly licensed sound files from the Europeana database will be made available beforehand to the participants to make sure they can get working on it right from the start of the day.
Prize
For the best solution of the day, a prize of 500 euros (in vouchers) is available - but we don't want the work to end after this day. If the prototype developed during the hackday has potential, the Europeana Foundation will be able to work with you (and pay) to further develop it into a working product and feed the generated data into the Europeana crowdsourcing API.
We think the next step would be to verify the results being generated by the algorithm using the crowdsourcing tool also being developed as part of the Europeana Sounds project. So instead of asking âwhat genre do you think this is?ââ, we can ask the Europeana user âwe think this is polka, do you agree?â. This makes it much easier for the user to participate in the crowdsource activities and therefore improve the data.
Questions?
To run this notebook and its example code the following modules are required:
import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth', 30)
DATA_PATH = "G:/eu_sounds_challenge/"
# read csv-data (separated by semicolons)
metadata = pd.read_csv("%s/metadata.csv" % DATA_PATH, sep=";")
# convert nan-values to empty strings
metadata = metadata.fillna("")
# use the "id" field as index
metadata = metadata.set_index("id")
metadata.head()
There are a lot of terms that are frequently used in title tags which can be associated to musical genres. As always when working with Europeana data be aware that many descriptive data items are multi-lingual.
# define a list of typical terms
classical_music_terms = ["opera", "oper", "operette", "violin concert", "quartett"]
# search all title fields (lower-case) of the metadata
# 1. use title fields to search
search_index = metadata.title
# 2. convert to lower case
search_index = search_index.str.lower()
# 3. get boolean map of indexes which contain the search-strings
search_index = search_index.str.contains('|'.join(classical_music_terms))
# the same can be written as a single line
search_result = metadata[metadata.title.str.lower().str.contains('|'.join(classical_music_terms))]
# print results
print "Number of classical music tracks found: %d" % search_result.shape[0]
search_result.head()
Similar to the title-based search also the creator and contributor fields can be searched for commonly known terms.
# define a list of typical terms
classical_composer_names = ["mozart", "beethoven", "offenbach", "bizet"]
# search all title fields of the metadata
search_result = metadata[metadata.creator.str.lower().str.contains('|'.join(classical_composer_names))]
# print results
print "Number of classical music tracks found: %d" % search_result.shape[0]
search_result.head()
# search terms
classical_music_terms = ["opera", "oper", "operette", "violin concert", "quartett"]
classical_composer_names = ["mozart", "beethoven", "offenbach", "bizet"]
# there are various ways to approach this task
# here we demonstrate it in a more structured way
# 1. all title terms
title_boolean_map = metadata.title.str.lower().str.contains('|'.join(classical_music_terms))
# 2. all creator terms
composer_boolean_map = metadata.creator.str.lower().str.contains('|'.join(classical_composer_names))
# 3. logically combine results
combined_boolean_map = title_boolean_map | composer_boolean_map
# search all fields using the combined maps
search_result = metadata[combined_boolean_map]
# print results
print "Number of classical music tracks found: %d" % search_result.shape[0]
search_result.head()
# add the column
metadata["genre"] = ""
# set calculated "genre" labels accordingly
# define a list of typical terms
classical_composer_names = ["mozart", "beethoven", "offenbach", "bizet"]
# the same boolean map which was previously used to display the result,
# can be used to set the labels for the calculated results
search_result_index = metadata.creator.str.lower().str.contains('|'.join(classical_composer_names))
# set the labels accordingly
metadata.genre[search_result_index] = "classical"
# show results
print "number of classical tracks: %d" % metadata[metadata.genre == "classical"].shape[0]
metadata[metadata.genre == "classical"].head()
# loading audio features. Pandas can read zipped csv-files. Manual unzipping is not required.
features_mfcc = pd.read_csv("%s/mfcc.csv.gz" % DATA_PATH, sep=";", header=None)
# set index
features_mfcc = features_mfcc.set_index(0)
features_mfcc.index.name = "id"
features_mfcc.head()
from sklearn.preprocessing import StandardScaler
# create and fit the scaler
scaler = StandardScaler()
scaler.fit(features_mfcc)
# sklearn returns a numpy array, thus we have to re-create the pandas DataFrame and its index
features_mfcc = pd.DataFrame(scaler.transform(features_mfcc), index=features_mfcc.index)
Using the previously introduced text processing approach.
First: create the subset using text processing
def search_for_tracks(query_terms, columns):
boolean_masks = []
for column in columns:
if len(query_terms) > 1:
boolean_masks.append( metadata[column].str.lower().str.contains('|'.join(query_terms)) )
else:
boolean_masks.append( metadata[column].str.lower().str.contains(query_terms[0]) )
return metadata[np.any(boolean_masks, axis=0)]
# define the classes for the classifier
classes = {}
classes["polka"] = search_for_tracks(["polka"], ["title", "description", "subject"])
classes["opera"] = search_for_tracks(["oper"], ["title", "description", "subject"])
classes["spoken_word"] = search_for_tracks(["interview"], ["title", "description", "subject"])
classes["choral"] = search_for_tracks(["choral"], ["title", "description", "subject"])
Second: use the indexes of the search result to get the corresponding feature vectors
features_for_classification = []
labels_for_classification = []
for class_name in classes.keys():
# metadata and features have the same indexes, thus we can use the search result to index the feature-set
# because there are less features than metadata, we need to use the intersection of the search result with the
# feature set. Otherwise, we would receive a lot of nan-valued rows.
features = features_mfcc.ix[classes[class_name].index.intersection(features_mfcc.index)]
features_for_classification.append(features)
labels_for_classification.extend( [class_name for _ in range(features.shape[0])] )
# combine features into a feature-space
features_for_classification = pd.concat(features_for_classification, axis=0)
labels_for_classification = np.asarray(labels_for_classification)
print "%d feature vectors for the %d classes available" % (features_for_classification.shape[0], len(classes.keys()))
Third: train a classifier
# first a little pre-processing is required to convert the textual labels into a decimal representation
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
encoder.fit(labels_for_classification)
# to test if classifier works accurately, we split the data into a training-set and a test-set
from sklearn.model_selection import StratifiedShuffleSplit
# create the splitter
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, train_size=0.8)
# split the data
splits = sss.split(features_for_classification, labels_for_classification)
# this is a trick to get the splits without using a for-loop
(train_index, test_index) = list(splits)[0]
# create the classifier - for this example we use a Support Vector Machine
from sklearn.svm import LinearSVC
classifier = LinearSVC()
# train the classifier
classifier.fit(features_for_classification.ix[train_index], encoder.transform(labels_for_classification[train_index]))
# test the classifier
predictions = classifier.predict(features_for_classification.ix[test_index])
pd.DataFrame({"expected": labels_for_classification[test_index],
"predicted": encoder.inverse_transform(predictions)},
index=features_for_classification.ix[test_index].index)
# estimate the precision of the classifier
from sklearn.metrics import accuracy_score
print "Classifier accuracy: %.2f%%" % (accuracy_score(predictions, encoder.transform(labels_for_classification[test_index])) * 100.0)