{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Europeana Sounds Music Collection Genre Detection Challenge" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Take part and help identifying genres in musical heritage.**\n", "\n", "\n", "\n", "\n", "\n", "The Europeana Sounds project has been working for the last three years to get the collections of sound archives around Europe online. All the material related to music can be found in its dedicated thematic collection: Europeana Music. So far over 250,000 pictures, texts and sound files can be found there.\n", "\n", "For a user of Europeana Music, it is useful to be able to search for particular music genres (e.g free jazz, Irish folk, baroque) to find what they are looking for in this vast amount of material. However, this information is not always available in the data. Currently, only about a fifth of the Europeana Music Collection has been labelled with a unified genre description. And even in those cases the genre classification is often very general, because it has been applied at the collection level instead of being optimised for a specific piece.\n", "\n", "To improve the quality of the genre information, we are organising a genre detection challenge on the 1st of October in Vienna.\n", "\n", "We have teamed up with the organisers of a large hackathon as part of the Vienna Waves festival, a yearly returning festival focussing on cutting edge music combining club nights with lectures, keynote talks, discussions and room for experiments. Up to 100 participants are welcome in the amazing \"Werkstaetten- und Kulturhaus\" (WUK) to work on this challenge or one of the many others proposed there.\n", "\n", "The Europeana API provides programmatic access to over 35,000 music recordings that are available through the Europeana Music Collection. With this challenge we are looking for methods that automatically process the Europeana Music Collection to apply suitable genre descriptions at the item level. For this day all the openly licensed sound files from the Europeana database will be made available beforehand to the participants to make sure they can get working on it right from the start of the day.\n", "\n", "**Prize**\n", "\n", "For the best solution of the day, a prize of 500 euros (in vouchers) is available - but we don't want the work to end after this day. If the prototype developed during the hackday has potential, the Europeana Foundation will be able to work with you (and pay) to further develop it into a working product and feed the generated data into the Europeana crowdsourcing API. \n", "\n", "We think the next step would be to verify the results being generated by the algorithm using the crowdsourcing tool also being developed as part of the Europeana Sounds project. So instead of asking ‘what genre do you think this is?’’, we can ask the Europeana user ‘we think this is polka, do you agree?’. This makes it much easier for the user to participate in the crowdsource activities and therefore improve the data. \n", "\n", "**Questions?**\n", "\n", "* For organizational questions refer to Joris Pekel \n", "* For technical questions refer to Alexander Schindler " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Required modules" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To run this notebook and its example code the following modules are required:\n", "\n", "* pip install numpy\n", "* pip install pandas\n", "* pip install sklearn (or scikit-learn)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2016-09-30T12:34:22.664000", "start_time": "2016-09-30T12:34:18.165000" }, "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "pd.set_option('display.max_colwidth', 30)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2016-09-30T12:34:22.667000", "start_time": "2016-09-30T12:34:22.665000" }, "collapsed": true }, "outputs": [], "source": [ "DATA_PATH = \"G:/eu_sounds_challenge/\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Text Processing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Metadata" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2016-09-30T12:34:23.140000", "start_time": "2016-09-30T12:34:22.669000" }, "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
contributorcountrycreatedcreatordataProviderdatedescriptionformatidentifierlanguagemediumproviderpublisherrelationspatialsubjecttitletypeyear
id
/09301/02DABC7F5850EABCA553BCEEEEAE2B6543A1CA67Schorr, A. [lyricist], Bur...franceBibliothèque Medem - Maiso...Voice soloPublisher number : Columbi...yiJudaica EuropeanaColumbiaAi dai dera dai tchastushkesUnited States [publication...Vos toig es aich vos darft...sound recording, Chanson c...
/09301/03DE04818BECB920A21EA63EA9D5FD1C49CF74A0Burstein, Peisach [singer]franceBibliothèque Medem - Maiso...Full orchestra ; Voice soloPublisher number : Columbi...yiJudaica EuropeanaColumbiaDie poilische chasseneNew-York [recording], Unit...Pitia Diritiasound recording, Chanson c...
/09301/0701AE796F7B504E67B8CEB031C1061E113BE3F4Kremer, Isa [singer]franceBibliothèque Medem - Maiso...Full orchestra ; Voice sol...Publisher number : Polydor...yiJudaica EuropeanaPolydorOi AbramParis [publication]Sitz ich mir in schtibeleChanson d'amour, soun...
/09301/0A3E8D891310E10CEC0B6D7790ADE1470E6541B8Potock, Bernard [conductor...franceBibliothèque Medem - Maiso...Full orchestra ; Voice sol...Publisher number : Elesdis...yiJudaica EuropeanaElesdisc Leon SpeiserIch will nicht sein kein r...Paris [publication]A briew fun a dienstChanson humouristique, sou...
/09301/0BF6DFA79787E76872BA33061C84A79588318953Ellstein, Abraham [instrum...franceBibliothèque Medem - Maiso...Chorus ; Piano ; Voice soloPublisher number : Metro 1...yiJudaica EuropeanaMetroA nign'dlTaiere malkesound recording, Chanson p...
\n", "
" ], "text/plain": [ " contributor country created \\\n", "id \n", "/09301/02DABC7F5850EABCA553... Schorr, A. [lyricist], Bur... france \n", "/09301/03DE04818BECB920A21E... Burstein, Peisach [singer] france \n", "/09301/0701AE796F7B504E67B8... Kremer, Isa [singer] france \n", "/09301/0A3E8D891310E10CEC0B... Potock, Bernard [conductor... france \n", "/09301/0BF6DFA79787E76872BA... Ellstein, Abraham [instrum... france \n", "\n", " creator dataProvider date \\\n", "id \n", "/09301/02DABC7F5850EABCA553... Bibliothèque Medem - Maiso... \n", "/09301/03DE04818BECB920A21E... Bibliothèque Medem - Maiso... \n", "/09301/0701AE796F7B504E67B8... Bibliothèque Medem - Maiso... \n", "/09301/0A3E8D891310E10CEC0B... Bibliothèque Medem - Maiso... \n", "/09301/0BF6DFA79787E76872BA... Bibliothèque Medem - Maiso... \n", "\n", " description format \\\n", "id \n", "/09301/02DABC7F5850EABCA553... Voice solo \n", "/09301/03DE04818BECB920A21E... Full orchestra ; Voice solo \n", "/09301/0701AE796F7B504E67B8... Full orchestra ; Voice sol... \n", "/09301/0A3E8D891310E10CEC0B... Full orchestra ; Voice sol... \n", "/09301/0BF6DFA79787E76872BA... Chorus ; Piano ; Voice solo \n", "\n", " identifier language medium \\\n", "id \n", "/09301/02DABC7F5850EABCA553... Publisher number : Columbi... yi \n", "/09301/03DE04818BECB920A21E... Publisher number : Columbi... yi \n", "/09301/0701AE796F7B504E67B8... Publisher number : Polydor... yi \n", "/09301/0A3E8D891310E10CEC0B... Publisher number : Elesdis... yi \n", "/09301/0BF6DFA79787E76872BA... Publisher number : Metro 1... yi \n", "\n", " provider publisher \\\n", "id \n", "/09301/02DABC7F5850EABCA553... Judaica Europeana Columbia \n", "/09301/03DE04818BECB920A21E... Judaica Europeana Columbia \n", "/09301/0701AE796F7B504E67B8... Judaica Europeana Polydor \n", "/09301/0A3E8D891310E10CEC0B... Judaica Europeana Elesdisc Leon Speiser \n", "/09301/0BF6DFA79787E76872BA... Judaica Europeana Metro \n", "\n", " relation \\\n", "id \n", "/09301/02DABC7F5850EABCA553... Ai dai dera dai tchastushkes \n", "/09301/03DE04818BECB920A21E... Die poilische chassene \n", "/09301/0701AE796F7B504E67B8... Oi Abram \n", "/09301/0A3E8D891310E10CEC0B... Ich will nicht sein kein r... \n", "/09301/0BF6DFA79787E76872BA... A nign'dl \n", "\n", " spatial subject \\\n", "id \n", "/09301/02DABC7F5850EABCA553... United States [publication... \n", "/09301/03DE04818BECB920A21E... New-York [recording], Unit... \n", "/09301/0701AE796F7B504E67B8... Paris [publication] \n", "/09301/0A3E8D891310E10CEC0B... Paris [publication] \n", "/09301/0BF6DFA79787E76872BA... \n", "\n", " title \\\n", "id \n", "/09301/02DABC7F5850EABCA553... Vos toig es aich vos darft... \n", "/09301/03DE04818BECB920A21E... Pitia Diritia \n", "/09301/0701AE796F7B504E67B8... Sitz ich mir in schtibele \n", "/09301/0A3E8D891310E10CEC0B... A briew fun a dienst \n", "/09301/0BF6DFA79787E76872BA... Taiere malke \n", "\n", " type year \n", "id \n", "/09301/02DABC7F5850EABCA553... sound recording, Chanson c... \n", "/09301/03DE04818BECB920A21E... sound recording, Chanson c... \n", "/09301/0701AE796F7B504E67B8... Chanson d'amour, soun... \n", "/09301/0A3E8D891310E10CEC0B... Chanson humouristique, sou... \n", "/09301/0BF6DFA79787E76872BA... sound recording, Chanson p... " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read csv-data (separated by semicolons)\n", "metadata = pd.read_csv(\"%s/metadata.csv\" % DATA_PATH, sep=\";\")\n", "\n", "# convert nan-values to empty strings\n", "metadata = metadata.fillna(\"\")\n", "\n", "# use the \"id\" field as index\n", "metadata = metadata.set_index(\"id\")\n", "\n", "metadata.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analyze Metadata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Search for typical terms in title\n", "\n", "There are a lot of terms that are frequently used in title tags which can be associated to musical genres. As always when working with Europeana data be aware that many descriptive data items are multi-lingual." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2016-09-30T12:34:23.315000", "start_time": "2016-09-30T12:34:23.142000" }, "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of classical music tracks found: 107\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
contributorcountrycreatedcreatordataProviderdatedescriptionformatidentifierlanguagemediumproviderpublisherrelationspatialsubjecttitletypeyear
id
/2059201/data_sounds_28043Sakss, Pauls, 1878-1966latviaOffenbach, Jacques, 1819-1880National Library of Latvia[19--]Shellac record, audio/mpeg...3336, 10034167audio discEuropeana SoundsLatvija : Bellaccord ElectroOperas -- ExcerptsMīlas nakts : barkarolla n...
/2059201/data_sounds_50752Frinbergs, Artūrs, 1916-1984latviaPuccini, Giacomo, 1858-1924National Library of Latvia[19--]Shellac record, audio/mpeg...B/20894audio discEuropeana Sounds[Latvija] : Рижский заводOperas -- ExcerptsKavaradosi ārija no operas...
/2059201/data_sounds_50753Frinbergs, Artūrs, 1916-1984latviaЧайковский, Петр Ильич, 18...National Library of Latvia[19--]Shellac record, audio/mpeg...B/19958audio discEuropeana Sounds[Latvija] : Рижский заводOperas -- ExcerptsHermaņa ārija no operas &q...
/2059201/data_sounds_50815Ašmane, MartalatviaHändel, Georg Friedrich, 1...National Library of Latvia[19--]Shellac record, audio/mpeg...205, 5126950audio discEuropeana Sounds[B.v.] : Метрополь РекордъOperas -- ExcerptsĀrija iz operas "Rina...
/2059201/data_sounds_51040Zommers, Roberts, 1899-1984latviaBizet, Georges, 1838-1875National Library of Latvia1934Shellac record, audio/mpeg...10032551audio discEuropeana SoundsRīga : Skaņu ateljē Da CapoOperas -- ExcerptsZiedu ārija no Ž. Bizē ope...1934
\n", "
" ], "text/plain": [ " contributor country created \\\n", "id \n", "/2059201/data_sounds_28043 Sakss, Pauls, 1878-1966 latvia \n", "/2059201/data_sounds_50752 Frinbergs, Artūrs, 1916-1984 latvia \n", "/2059201/data_sounds_50753 Frinbergs, Artūrs, 1916-1984 latvia \n", "/2059201/data_sounds_50815 Ašmane, Marta latvia \n", "/2059201/data_sounds_51040 Zommers, Roberts, 1899-1984 latvia \n", "\n", " creator \\\n", "id \n", "/2059201/data_sounds_28043 Offenbach, Jacques, 1819-1880 \n", "/2059201/data_sounds_50752 Puccini, Giacomo, 1858-1924 \n", "/2059201/data_sounds_50753 Чайковский, Петр Ильич, 18... \n", "/2059201/data_sounds_50815 Händel, Georg Friedrich, 1... \n", "/2059201/data_sounds_51040 Bizet, Georges, 1838-1875 \n", "\n", " dataProvider date description \\\n", "id \n", "/2059201/data_sounds_28043 National Library of Latvia [19--] \n", "/2059201/data_sounds_50752 National Library of Latvia [19--] \n", "/2059201/data_sounds_50753 National Library of Latvia [19--] \n", "/2059201/data_sounds_50815 National Library of Latvia [19--] \n", "/2059201/data_sounds_51040 National Library of Latvia 1934 \n", "\n", " format identifier \\\n", "id \n", "/2059201/data_sounds_28043 Shellac record, audio/mpeg... 3336, 10034167 \n", "/2059201/data_sounds_50752 Shellac record, audio/mpeg... B/20894 \n", "/2059201/data_sounds_50753 Shellac record, audio/mpeg... B/19958 \n", "/2059201/data_sounds_50815 Shellac record, audio/mpeg... 205, 5126950 \n", "/2059201/data_sounds_51040 Shellac record, audio/mpeg... 10032551 \n", "\n", " language medium provider \\\n", "id \n", "/2059201/data_sounds_28043 audio disc Europeana Sounds \n", "/2059201/data_sounds_50752 audio disc Europeana Sounds \n", "/2059201/data_sounds_50753 audio disc Europeana Sounds \n", "/2059201/data_sounds_50815 audio disc Europeana Sounds \n", "/2059201/data_sounds_51040 audio disc Europeana Sounds \n", "\n", " publisher relation spatial \\\n", "id \n", "/2059201/data_sounds_28043 Latvija : Bellaccord Electro \n", "/2059201/data_sounds_50752 [Latvija] : Рижский завод \n", "/2059201/data_sounds_50753 [Latvija] : Рижский завод \n", "/2059201/data_sounds_50815 [B.v.] : Метрополь Рекордъ \n", "/2059201/data_sounds_51040 Rīga : Skaņu ateljē Da Capo \n", "\n", " subject title \\\n", "id \n", "/2059201/data_sounds_28043 Operas -- Excerpts Mīlas nakts : barkarolla n... \n", "/2059201/data_sounds_50752 Operas -- Excerpts Kavaradosi ārija no operas... \n", "/2059201/data_sounds_50753 Operas -- Excerpts Hermaņa ārija no operas &q... \n", "/2059201/data_sounds_50815 Operas -- Excerpts Ārija iz operas "Rina... \n", "/2059201/data_sounds_51040 Operas -- Excerpts Ziedu ārija no Ž. Bizē ope... \n", "\n", " type year \n", "id \n", "/2059201/data_sounds_28043 \n", "/2059201/data_sounds_50752 \n", "/2059201/data_sounds_50753 \n", "/2059201/data_sounds_50815 \n", "/2059201/data_sounds_51040 1934 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# define a list of typical terms\n", "classical_music_terms = [\"opera\", \"oper\", \"operette\", \"violin concert\", \"quartett\"]\n", "\n", "# search all title fields (lower-case) of the metadata\n", "\n", "# 1. use title fields to search\n", "search_index = metadata.title\n", "# 2. convert to lower case\n", "search_index = search_index.str.lower()\n", "# 3. get boolean map of indexes which contain the search-strings\n", "search_index = search_index.str.contains('|'.join(classical_music_terms))\n", "\n", "# the same can be written as a single line\n", "search_result = metadata[metadata.title.str.lower().str.contains('|'.join(classical_music_terms))]\n", "\n", "# print results\n", "print \"Number of classical music tracks found: %d\" % search_result.shape[0]\n", "\n", "search_result.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Search for commonly known artists\n", "\n", "Similar to the title-based search also the ***creator*** and ***contributor*** fields can be searched for commonly known terms." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2016-09-30T12:34:23.413000", "start_time": "2016-09-30T12:34:23.316000" }, "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of classical music tracks found: 162\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
contributorcountrycreatedcreatordataProviderdatedescriptionformatidentifierlanguagemediumproviderpublisherrelationspatialsubjecttitletypeyear
id
/2059201/data_sounds_11079Vētra, Mariss, 1901-1965, ...latviaBeethoven, Ludwig van, 177...National Library of Latvia[19--]Shellac record, audio/mpeg...3795, 32182557audio discEuropeana Sounds[Latvija] : Bellaccord Ele...Songs (High voice) with or...Tā debess izteic
/2059201/data_sounds_28043Sakss, Pauls, 1878-1966latviaOffenbach, Jacques, 1819-1880National Library of Latvia[19--]Shellac record, audio/mpeg...3336, 10034167audio discEuropeana SoundsLatvija : Bellaccord ElectroOperas -- ExcerptsMīlas nakts : barkarolla n...
/2059201/data_sounds_48216Vējš, Teodors, 1902-1965, ...latviaBizet, Georges, 1838-1875National Library of Latvia[19--]Shellac record, audio/mpeg...3315, 10032551audio discEuropeana Sounds[Latvija] : Bellaccord Ele...Operas -- ExcerptsNadira ārija no op. "...
/2059201/data_sounds_51040Zommers, Roberts, 1899-1984latviaBizet, Georges, 1838-1875National Library of Latvia1934Shellac record, audio/mpeg...10032551audio discEuropeana SoundsRīga : Skaņu ateljē Da CapoOperas -- ExcerptsZiedu ārija no Ž. Bizē ope...1934
/2059201/data_sounds_53431Priednieks-Kavara, Artūrs,...latviaBeethoven, Ludwig van, 177...National Library of Latvia[19--]Shellac record, audio/mpeg...E.K.1083, 32182557audio discEuropeana Sounds[England] : His Master&apo...Songs (High voice) with or...Tā debess izteic tā mūžīgā...
\n", "
" ], "text/plain": [ " contributor country created \\\n", "id \n", "/2059201/data_sounds_11079 Vētra, Mariss, 1901-1965, ... latvia \n", "/2059201/data_sounds_28043 Sakss, Pauls, 1878-1966 latvia \n", "/2059201/data_sounds_48216 Vējš, Teodors, 1902-1965, ... latvia \n", "/2059201/data_sounds_51040 Zommers, Roberts, 1899-1984 latvia \n", "/2059201/data_sounds_53431 Priednieks-Kavara, Artūrs,... latvia \n", "\n", " creator \\\n", "id \n", "/2059201/data_sounds_11079 Beethoven, Ludwig van, 177... \n", "/2059201/data_sounds_28043 Offenbach, Jacques, 1819-1880 \n", "/2059201/data_sounds_48216 Bizet, Georges, 1838-1875 \n", "/2059201/data_sounds_51040 Bizet, Georges, 1838-1875 \n", "/2059201/data_sounds_53431 Beethoven, Ludwig van, 177... \n", "\n", " dataProvider date description \\\n", "id \n", "/2059201/data_sounds_11079 National Library of Latvia [19--] \n", "/2059201/data_sounds_28043 National Library of Latvia [19--] \n", "/2059201/data_sounds_48216 National Library of Latvia [19--] \n", "/2059201/data_sounds_51040 National Library of Latvia 1934 \n", "/2059201/data_sounds_53431 National Library of Latvia [19--] \n", "\n", " format identifier \\\n", "id \n", "/2059201/data_sounds_11079 Shellac record, audio/mpeg... 3795, 32182557 \n", "/2059201/data_sounds_28043 Shellac record, audio/mpeg... 3336, 10034167 \n", "/2059201/data_sounds_48216 Shellac record, audio/mpeg... 3315, 10032551 \n", "/2059201/data_sounds_51040 Shellac record, audio/mpeg... 10032551 \n", "/2059201/data_sounds_53431 Shellac record, audio/mpeg... E.K.1083, 32182557 \n", "\n", " language medium provider \\\n", "id \n", "/2059201/data_sounds_11079 audio disc Europeana Sounds \n", "/2059201/data_sounds_28043 audio disc Europeana Sounds \n", "/2059201/data_sounds_48216 audio disc Europeana Sounds \n", "/2059201/data_sounds_51040 audio disc Europeana Sounds \n", "/2059201/data_sounds_53431 audio disc Europeana Sounds \n", "\n", " publisher relation spatial \\\n", "id \n", "/2059201/data_sounds_11079 [Latvija] : Bellaccord Ele... \n", "/2059201/data_sounds_28043 Latvija : Bellaccord Electro \n", "/2059201/data_sounds_48216 [Latvija] : Bellaccord Ele... \n", "/2059201/data_sounds_51040 Rīga : Skaņu ateljē Da Capo \n", "/2059201/data_sounds_53431 [England] : His Master&apo... \n", "\n", " subject \\\n", "id \n", "/2059201/data_sounds_11079 Songs (High voice) with or... \n", "/2059201/data_sounds_28043 Operas -- Excerpts \n", "/2059201/data_sounds_48216 Operas -- Excerpts \n", "/2059201/data_sounds_51040 Operas -- Excerpts \n", "/2059201/data_sounds_53431 Songs (High voice) with or... \n", "\n", " title type year \n", "id \n", "/2059201/data_sounds_11079 Tā debess izteic \n", "/2059201/data_sounds_28043 Mīlas nakts : barkarolla n... \n", "/2059201/data_sounds_48216 Nadira ārija no op. "... \n", "/2059201/data_sounds_51040 Ziedu ārija no Ž. Bizē ope... 1934 \n", "/2059201/data_sounds_53431 Tā debess izteic tā mūžīgā... " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# define a list of typical terms\n", "classical_composer_names = [\"mozart\", \"beethoven\", \"offenbach\", \"bizet\"]\n", "\n", "# search all title fields of the metadata\n", "search_result = metadata[metadata.creator.str.lower().str.contains('|'.join(classical_composer_names))]\n", "\n", "# print results\n", "print \"Number of classical music tracks found: %d\" % search_result.shape[0]\n", "\n", "search_result.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Combined search on various fields" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2016-09-30T12:34:23.579000", "start_time": "2016-09-30T12:34:23.414000" }, "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of classical music tracks found: 257\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
contributorcountrycreatedcreatordataProviderdatedescriptionformatidentifierlanguagemediumproviderpublisherrelationspatialsubjecttitletypeyear
id
/2059201/data_sounds_11079Vētra, Mariss, 1901-1965, ...latviaBeethoven, Ludwig van, 177...National Library of Latvia[19--]Shellac record, audio/mpeg...3795, 32182557audio discEuropeana Sounds[Latvija] : Bellaccord Ele...Songs (High voice) with or...Tā debess izteic
/2059201/data_sounds_28043Sakss, Pauls, 1878-1966latviaOffenbach, Jacques, 1819-1880National Library of Latvia[19--]Shellac record, audio/mpeg...3336, 10034167audio discEuropeana SoundsLatvija : Bellaccord ElectroOperas -- ExcerptsMīlas nakts : barkarolla n...
/2059201/data_sounds_48216Vējš, Teodors, 1902-1965, ...latviaBizet, Georges, 1838-1875National Library of Latvia[19--]Shellac record, audio/mpeg...3315, 10032551audio discEuropeana Sounds[Latvija] : Bellaccord Ele...Operas -- ExcerptsNadira ārija no op. "...
/2059201/data_sounds_50752Frinbergs, Artūrs, 1916-1984latviaPuccini, Giacomo, 1858-1924National Library of Latvia[19--]Shellac record, audio/mpeg...B/20894audio discEuropeana Sounds[Latvija] : Рижский заводOperas -- ExcerptsKavaradosi ārija no operas...
/2059201/data_sounds_50753Frinbergs, Artūrs, 1916-1984latviaЧайковский, Петр Ильич, 18...National Library of Latvia[19--]Shellac record, audio/mpeg...B/19958audio discEuropeana Sounds[Latvija] : Рижский заводOperas -- ExcerptsHermaņa ārija no operas &q...
\n", "
" ], "text/plain": [ " contributor country created \\\n", "id \n", "/2059201/data_sounds_11079 Vētra, Mariss, 1901-1965, ... latvia \n", "/2059201/data_sounds_28043 Sakss, Pauls, 1878-1966 latvia \n", "/2059201/data_sounds_48216 Vējš, Teodors, 1902-1965, ... latvia \n", "/2059201/data_sounds_50752 Frinbergs, Artūrs, 1916-1984 latvia \n", "/2059201/data_sounds_50753 Frinbergs, Artūrs, 1916-1984 latvia \n", "\n", " creator \\\n", "id \n", "/2059201/data_sounds_11079 Beethoven, Ludwig van, 177... \n", "/2059201/data_sounds_28043 Offenbach, Jacques, 1819-1880 \n", "/2059201/data_sounds_48216 Bizet, Georges, 1838-1875 \n", "/2059201/data_sounds_50752 Puccini, Giacomo, 1858-1924 \n", "/2059201/data_sounds_50753 Чайковский, Петр Ильич, 18... \n", "\n", " dataProvider date description \\\n", "id \n", "/2059201/data_sounds_11079 National Library of Latvia [19--] \n", "/2059201/data_sounds_28043 National Library of Latvia [19--] \n", "/2059201/data_sounds_48216 National Library of Latvia [19--] \n", "/2059201/data_sounds_50752 National Library of Latvia [19--] \n", "/2059201/data_sounds_50753 National Library of Latvia [19--] \n", "\n", " format identifier \\\n", "id \n", "/2059201/data_sounds_11079 Shellac record, audio/mpeg... 3795, 32182557 \n", "/2059201/data_sounds_28043 Shellac record, audio/mpeg... 3336, 10034167 \n", "/2059201/data_sounds_48216 Shellac record, audio/mpeg... 3315, 10032551 \n", "/2059201/data_sounds_50752 Shellac record, audio/mpeg... B/20894 \n", "/2059201/data_sounds_50753 Shellac record, audio/mpeg... B/19958 \n", "\n", " language medium provider \\\n", "id \n", "/2059201/data_sounds_11079 audio disc Europeana Sounds \n", "/2059201/data_sounds_28043 audio disc Europeana Sounds \n", "/2059201/data_sounds_48216 audio disc Europeana Sounds \n", "/2059201/data_sounds_50752 audio disc Europeana Sounds \n", "/2059201/data_sounds_50753 audio disc Europeana Sounds \n", "\n", " publisher relation spatial \\\n", "id \n", "/2059201/data_sounds_11079 [Latvija] : Bellaccord Ele... \n", "/2059201/data_sounds_28043 Latvija : Bellaccord Electro \n", "/2059201/data_sounds_48216 [Latvija] : Bellaccord Ele... \n", "/2059201/data_sounds_50752 [Latvija] : Рижский завод \n", "/2059201/data_sounds_50753 [Latvija] : Рижский завод \n", "\n", " subject \\\n", "id \n", "/2059201/data_sounds_11079 Songs (High voice) with or... \n", "/2059201/data_sounds_28043 Operas -- Excerpts \n", "/2059201/data_sounds_48216 Operas -- Excerpts \n", "/2059201/data_sounds_50752 Operas -- Excerpts \n", "/2059201/data_sounds_50753 Operas -- Excerpts \n", "\n", " title type year \n", "id \n", "/2059201/data_sounds_11079 Tā debess izteic \n", "/2059201/data_sounds_28043 Mīlas nakts : barkarolla n... \n", "/2059201/data_sounds_48216 Nadira ārija no op. "... \n", "/2059201/data_sounds_50752 Kavaradosi ārija no operas... \n", "/2059201/data_sounds_50753 Hermaņa ārija no operas &q... " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# search terms\n", "classical_music_terms = [\"opera\", \"oper\", \"operette\", \"violin concert\", \"quartett\"]\n", "classical_composer_names = [\"mozart\", \"beethoven\", \"offenbach\", \"bizet\"]\n", "\n", "# there are various ways to approach this task\n", "# here we demonstrate it in a more structured way\n", "\n", "# 1. all title terms\n", "title_boolean_map = metadata.title.str.lower().str.contains('|'.join(classical_music_terms))\n", "# 2. all creator terms\n", "composer_boolean_map = metadata.creator.str.lower().str.contains('|'.join(classical_composer_names))\n", "# 3. logically combine results\n", "combined_boolean_map = title_boolean_map | composer_boolean_map\n", "\n", "# search all fields using the combined maps\n", "search_result = metadata[combined_boolean_map]\n", "\n", "# print results\n", "print \"Number of classical music tracks found: %d\" % search_result.shape[0]\n", "\n", "search_result.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Add a \"genre\" column to the metadata to store results" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2016-09-30T12:34:23.700000", "start_time": "2016-09-30T12:34:23.581000" }, "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "number of classical tracks: 162\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
contributorcountrycreatedcreatordataProviderdatedescriptionformatidentifierlanguagemediumproviderpublisherrelationspatialsubjecttitletypeyeargenre
id
/2059201/data_sounds_11079Vētra, Mariss, 1901-1965, ...latviaBeethoven, Ludwig van, 177...National Library of Latvia[19--]Shellac record, audio/mpeg...3795, 32182557audio discEuropeana Sounds[Latvija] : Bellaccord Ele...Songs (High voice) with or...Tā debess izteicclassical
/2059201/data_sounds_28043Sakss, Pauls, 1878-1966latviaOffenbach, Jacques, 1819-1880National Library of Latvia[19--]Shellac record, audio/mpeg...3336, 10034167audio discEuropeana SoundsLatvija : Bellaccord ElectroOperas -- ExcerptsMīlas nakts : barkarolla n...classical
/2059201/data_sounds_48216Vējš, Teodors, 1902-1965, ...latviaBizet, Georges, 1838-1875National Library of Latvia[19--]Shellac record, audio/mpeg...3315, 10032551audio discEuropeana Sounds[Latvija] : Bellaccord Ele...Operas -- ExcerptsNadira ārija no op. "...classical
/2059201/data_sounds_51040Zommers, Roberts, 1899-1984latviaBizet, Georges, 1838-1875National Library of Latvia1934Shellac record, audio/mpeg...10032551audio discEuropeana SoundsRīga : Skaņu ateljē Da CapoOperas -- ExcerptsZiedu ārija no Ž. Bizē ope...1934classical
/2059201/data_sounds_53431Priednieks-Kavara, Artūrs,...latviaBeethoven, Ludwig van, 177...National Library of Latvia[19--]Shellac record, audio/mpeg...E.K.1083, 32182557audio discEuropeana Sounds[England] : His Master&apo...Songs (High voice) with or...Tā debess izteic tā mūžīgā...classical
\n", "
" ], "text/plain": [ " contributor country created \\\n", "id \n", "/2059201/data_sounds_11079 Vētra, Mariss, 1901-1965, ... latvia \n", "/2059201/data_sounds_28043 Sakss, Pauls, 1878-1966 latvia \n", "/2059201/data_sounds_48216 Vējš, Teodors, 1902-1965, ... latvia \n", "/2059201/data_sounds_51040 Zommers, Roberts, 1899-1984 latvia \n", "/2059201/data_sounds_53431 Priednieks-Kavara, Artūrs,... latvia \n", "\n", " creator \\\n", "id \n", "/2059201/data_sounds_11079 Beethoven, Ludwig van, 177... \n", "/2059201/data_sounds_28043 Offenbach, Jacques, 1819-1880 \n", "/2059201/data_sounds_48216 Bizet, Georges, 1838-1875 \n", "/2059201/data_sounds_51040 Bizet, Georges, 1838-1875 \n", "/2059201/data_sounds_53431 Beethoven, Ludwig van, 177... \n", "\n", " dataProvider date description \\\n", "id \n", "/2059201/data_sounds_11079 National Library of Latvia [19--] \n", "/2059201/data_sounds_28043 National Library of Latvia [19--] \n", "/2059201/data_sounds_48216 National Library of Latvia [19--] \n", "/2059201/data_sounds_51040 National Library of Latvia 1934 \n", "/2059201/data_sounds_53431 National Library of Latvia [19--] \n", "\n", " format identifier \\\n", "id \n", "/2059201/data_sounds_11079 Shellac record, audio/mpeg... 3795, 32182557 \n", "/2059201/data_sounds_28043 Shellac record, audio/mpeg... 3336, 10034167 \n", "/2059201/data_sounds_48216 Shellac record, audio/mpeg... 3315, 10032551 \n", "/2059201/data_sounds_51040 Shellac record, audio/mpeg... 10032551 \n", "/2059201/data_sounds_53431 Shellac record, audio/mpeg... E.K.1083, 32182557 \n", "\n", " language medium provider \\\n", "id \n", "/2059201/data_sounds_11079 audio disc Europeana Sounds \n", "/2059201/data_sounds_28043 audio disc Europeana Sounds \n", "/2059201/data_sounds_48216 audio disc Europeana Sounds \n", "/2059201/data_sounds_51040 audio disc Europeana Sounds \n", "/2059201/data_sounds_53431 audio disc Europeana Sounds \n", "\n", " publisher relation spatial \\\n", "id \n", "/2059201/data_sounds_11079 [Latvija] : Bellaccord Ele... \n", "/2059201/data_sounds_28043 Latvija : Bellaccord Electro \n", "/2059201/data_sounds_48216 [Latvija] : Bellaccord Ele... \n", "/2059201/data_sounds_51040 Rīga : Skaņu ateljē Da Capo \n", "/2059201/data_sounds_53431 [England] : His Master&apo... \n", "\n", " subject \\\n", "id \n", "/2059201/data_sounds_11079 Songs (High voice) with or... \n", "/2059201/data_sounds_28043 Operas -- Excerpts \n", "/2059201/data_sounds_48216 Operas -- Excerpts \n", "/2059201/data_sounds_51040 Operas -- Excerpts \n", "/2059201/data_sounds_53431 Songs (High voice) with or... \n", "\n", " title type year \\\n", "id \n", "/2059201/data_sounds_11079 Tā debess izteic \n", "/2059201/data_sounds_28043 Mīlas nakts : barkarolla n... \n", "/2059201/data_sounds_48216 Nadira ārija no op. "... \n", "/2059201/data_sounds_51040 Ziedu ārija no Ž. Bizē ope... 1934 \n", "/2059201/data_sounds_53431 Tā debess izteic tā mūžīgā... \n", "\n", " genre \n", "id \n", "/2059201/data_sounds_11079 classical \n", "/2059201/data_sounds_28043 classical \n", "/2059201/data_sounds_48216 classical \n", "/2059201/data_sounds_51040 classical \n", "/2059201/data_sounds_53431 classical " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# add the column\n", "metadata[\"genre\"] = \"\"\n", "\n", "# set calculated \"genre\" labels accordingly\n", "\n", "# define a list of typical terms\n", "classical_composer_names = [\"mozart\", \"beethoven\", \"offenbach\", \"bizet\"]\n", "\n", "# the same boolean map which was previously used to display the result, \n", "# can be used to set the labels for the calculated results\n", "search_result_index = metadata.creator.str.lower().str.contains('|'.join(classical_composer_names))\n", "\n", "# set the labels accordingly\n", "metadata.genre[search_result_index] = \"classical\"\n", "\n", "# show results\n", "print \"number of classical tracks: %d\" % metadata[metadata.genre == \"classical\"].shape[0]\n", "\n", "metadata[metadata.genre == \"classical\"].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Audio Feature Processing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load and pre-process audio Features" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2016-09-30T12:34:24.798000", "start_time": "2016-09-30T12:34:23.702000" }, "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
12345678910...56575859606162636465
id
/2023601/oai_eu_dismarc_CHARM_DISC01SIDE02METSEE290.045898-0.0061660.034595-0.012495-0.314453-0.469206-0.724443-1.274571-1.821123-2.888282...8.8499057.7983887.0178146.2475225.7122785.1703484.8258314.4550294.1537093.743235
/2023601/oai_eu_dismarc_CHARM_DISC01SIDE02METSEEE540.0610280.0124680.046163-0.012373-0.331475-0.418479-0.776257-1.194863-2.746311-4.085886...5.6687314.7409533.9465933.3082682.7230582.2137261.7786791.3888691.0861350.831883
/2023601/oai_eu_dismarc_CHARM_DISC01SIDE02METSEG20310.0008290.0004080.0001850.000143-0.822003-6.464063-9.709557-11.803703-16.232235-17.719405...9.1486438.2431597.0830175.9993435.1241234.5459994.1884263.9869163.9010773.899819
/2023601/oai_eu_dismarc_CHARM_DISC01SIDE02METSEG31840.009110-0.000004-0.0005750.000027-0.584520-0.249008-4.699470-9.617082-13.266026-14.578073...6.5373915.3890404.1708823.1181952.4483382.0843951.9583242.0040012.2303892.541304
/2023601/oai_eu_dismarc_CHARM_DISC01SIDE02METSEG36870.008827-0.000392-0.033676-1.289619-2.237123-1.635143-0.407569-4.454713-9.577355-12.580112...3.4399763.3266323.0432952.7536922.4071032.1635291.9889041.8689261.7864031.736148
\n", "

5 rows × 65 columns

\n", "
" ], "text/plain": [ " 1 2 3 4 \\\n", "id \n", "/2023601/oai_eu_dismarc_CHA... 0.045898 -0.006166 0.034595 -0.012495 \n", "/2023601/oai_eu_dismarc_CHA... 0.061028 0.012468 0.046163 -0.012373 \n", "/2023601/oai_eu_dismarc_CHA... 0.000829 0.000408 0.000185 0.000143 \n", "/2023601/oai_eu_dismarc_CHA... 0.009110 -0.000004 -0.000575 0.000027 \n", "/2023601/oai_eu_dismarc_CHA... 0.008827 -0.000392 -0.033676 -1.289619 \n", "\n", " 5 6 7 8 \\\n", "id \n", "/2023601/oai_eu_dismarc_CHA... -0.314453 -0.469206 -0.724443 -1.274571 \n", "/2023601/oai_eu_dismarc_CHA... -0.331475 -0.418479 -0.776257 -1.194863 \n", "/2023601/oai_eu_dismarc_CHA... -0.822003 -6.464063 -9.709557 -11.803703 \n", "/2023601/oai_eu_dismarc_CHA... -0.584520 -0.249008 -4.699470 -9.617082 \n", "/2023601/oai_eu_dismarc_CHA... -2.237123 -1.635143 -0.407569 -4.454713 \n", "\n", " 9 10 ... 56 \\\n", "id ... \n", "/2023601/oai_eu_dismarc_CHA... -1.821123 -2.888282 ... 8.849905 \n", "/2023601/oai_eu_dismarc_CHA... -2.746311 -4.085886 ... 5.668731 \n", "/2023601/oai_eu_dismarc_CHA... -16.232235 -17.719405 ... 9.148643 \n", "/2023601/oai_eu_dismarc_CHA... -13.266026 -14.578073 ... 6.537391 \n", "/2023601/oai_eu_dismarc_CHA... -9.577355 -12.580112 ... 3.439976 \n", "\n", " 57 58 59 60 \\\n", "id \n", "/2023601/oai_eu_dismarc_CHA... 7.798388 7.017814 6.247522 5.712278 \n", "/2023601/oai_eu_dismarc_CHA... 4.740953 3.946593 3.308268 2.723058 \n", "/2023601/oai_eu_dismarc_CHA... 8.243159 7.083017 5.999343 5.124123 \n", "/2023601/oai_eu_dismarc_CHA... 5.389040 4.170882 3.118195 2.448338 \n", "/2023601/oai_eu_dismarc_CHA... 3.326632 3.043295 2.753692 2.407103 \n", "\n", " 61 62 63 64 \\\n", "id \n", "/2023601/oai_eu_dismarc_CHA... 5.170348 4.825831 4.455029 4.153709 \n", "/2023601/oai_eu_dismarc_CHA... 2.213726 1.778679 1.388869 1.086135 \n", "/2023601/oai_eu_dismarc_CHA... 4.545999 4.188426 3.986916 3.901077 \n", "/2023601/oai_eu_dismarc_CHA... 2.084395 1.958324 2.004001 2.230389 \n", "/2023601/oai_eu_dismarc_CHA... 2.163529 1.988904 1.868926 1.786403 \n", "\n", " 65 \n", "id \n", "/2023601/oai_eu_dismarc_CHA... 3.743235 \n", "/2023601/oai_eu_dismarc_CHA... 0.831883 \n", "/2023601/oai_eu_dismarc_CHA... 3.899819 \n", "/2023601/oai_eu_dismarc_CHA... 2.541304 \n", "/2023601/oai_eu_dismarc_CHA... 1.736148 \n", "\n", "[5 rows x 65 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# loading audio features. Pandas can read zipped csv-files. Manual unzipping is not required.\n", "features_mfcc = pd.read_csv(\"%s/mfcc.csv.gz\" % DATA_PATH, sep=\";\", header=None)\n", "\n", "# set index\n", "features_mfcc = features_mfcc.set_index(0)\n", "features_mfcc.index.name = \"id\"\n", "\n", "features_mfcc.head()" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2016-09-29T17:00:57.606000", "start_time": "2016-09-29T17:00:57.556000" } }, "source": [ "### pre-process features" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2016-09-30T12:34:28.127000", "start_time": "2016-09-30T12:34:24.800000" }, "collapsed": false }, "outputs": [], "source": [ "from sklearn.preprocessing import StandardScaler\n", "\n", "# create and fit the scaler\n", "scaler = StandardScaler()\n", "scaler.fit(features_mfcc)\n", "\n", "# sklearn returns a numpy array, thus we have to re-create the pandas DataFrame and its index\n", "features_mfcc = pd.DataFrame(scaler.transform(features_mfcc), index=features_mfcc.index)" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2016-09-29T17:22:25.362000", "start_time": "2016-09-29T17:22:25.335000" } }, "source": [ "### get subset of features for which we already know genre labels\n", "\n", "Using the previously introduced text processing approach." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**First: create the subset using text processing**" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2016-09-30T12:34:28.134000", "start_time": "2016-09-30T12:34:28.129000" }, "collapsed": false }, "outputs": [], "source": [ "def search_for_tracks(query_terms, columns):\n", " \n", " boolean_masks = []\n", " \n", " for column in columns:\n", " if len(query_terms) > 1:\n", " boolean_masks.append( metadata[column].str.lower().str.contains('|'.join(query_terms)) )\n", " else:\n", " boolean_masks.append( metadata[column].str.lower().str.contains(query_terms[0]) )\n", " \n", " return metadata[np.any(boolean_masks, axis=0)]" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2016-09-30T12:34:28.826000", "start_time": "2016-09-30T12:34:28.136000" }, "collapsed": false }, "outputs": [], "source": [ "# define the classes for the classifier\n", "classes = {}\n", "\n", "classes[\"polka\"] = search_for_tracks([\"polka\"], [\"title\", \"description\", \"subject\"])\n", "classes[\"opera\"] = search_for_tracks([\"oper\"], [\"title\", \"description\", \"subject\"])\n", "classes[\"spoken_word\"] = search_for_tracks([\"interview\"], [\"title\", \"description\", \"subject\"])\n", "classes[\"choral\"] = search_for_tracks([\"choral\"], [\"title\", \"description\", \"subject\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Second: use the indexes of the search result to get the corresponding feature vectors**" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2016-09-30T12:34:28.859000", "start_time": "2016-09-30T12:34:28.828000" }, "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "302 feature vectors for the 4 classes available\n" ] } ], "source": [ "features_for_classification = []\n", "labels_for_classification = []\n", "\n", "for class_name in classes.keys():\n", " \n", " # metadata and features have the same indexes, thus we can use the search result to index the feature-set\n", " # because there are less features than metadata, we need to use the intersection of the search result with the\n", " # feature set. Otherwise, we would receive a lot of nan-valued rows.\n", " features = features_mfcc.ix[classes[class_name].index.intersection(features_mfcc.index)]\n", " \n", " features_for_classification.append(features)\n", " labels_for_classification.extend( [class_name for _ in range(features.shape[0])] )\n", "\n", "# combine features into a feature-space\n", "features_for_classification = pd.concat(features_for_classification, axis=0)\n", "labels_for_classification = np.asarray(labels_for_classification)\n", "\n", "print \"%d feature vectors for the %d classes available\" % (features_for_classification.shape[0], len(classes.keys()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Third: train a classifier**" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2016-09-30T12:34:28.865000", "start_time": "2016-09-30T12:34:28.861000" }, "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "LabelEncoder()" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# first a little pre-processing is required to convert the textual labels into a decimal representation\n", "from sklearn.preprocessing import LabelEncoder\n", "\n", "encoder = LabelEncoder()\n", "encoder.fit(labels_for_classification)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2016-09-30T12:34:28.904000", "start_time": "2016-09-30T12:34:28.867000" }, "collapsed": true }, "outputs": [], "source": [ "# to test if classifier works accurately, we split the data into a training-set and a test-set\n", "from sklearn.model_selection import StratifiedShuffleSplit\n", "\n", "# create the splitter\n", "sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, train_size=0.8)\n", "\n", "# split the data\n", "splits = sss.split(features_for_classification, labels_for_classification)\n", "\n", "# this is a trick to get the splits without using a for-loop\n", "(train_index, test_index) = list(splits)[0]" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2016-09-30T12:34:29.083000", "start_time": "2016-09-30T12:34:28.905000" }, "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
expectedpredicted
id
/2059201/data_sounds_54536polkaopera
/2059201/data_sounds_59504operaopera
/2059201/data_sounds_59581polkaopera
/2051921/data_euscreenXL_5350spoken_wordspoken_word
/2051921/data_euscreenXL_5995spoken_wordpolka
/2059201/data_sounds_55008choralpolka
/2059201/data_sounds_55067polkapolka
/2059201/data_sounds_49583operaopera
/2059201/data_sounds_55054polkapolka
/2059201/data_sounds_64978operaopera
/2059201/data_sounds_55278operaopera
/2059201/data_sounds_60753operapolka
/2051921/data_euscreenXL_5403spoken_wordspoken_word
/2059201/data_sounds_61595operaopera
/2059201/data_sounds_28043operaopera
/2059201/data_sounds_62830operaopera
/2059201/data_sounds_29polkapolka
/2059201/data_sounds_53177polkapolka
/2059201/data_sounds_63748polkaopera
/2059201/data_sounds_63619polkapolka
/2051921/data_euscreenXL_5393spoken_wordspoken_word
/2059201/data_sounds_56117polkapolka
/2059207/data_sounds_T202_3operaopera
/2051921/data_euscreenXL_5013spoken_wordspoken_word
/2059201/data_sounds_51040operaopera
/2059201/data_sounds_62822polkapolka
/2059201/data_sounds_57876operaopera
/2051921/data_euscreenXL_5181spoken_wordspoken_word
/2059201/data_sounds_65006polkapolka
/2059201/data_sounds_61899polkapolka
.........
/2059213/data_sounds_8761polkapolka
/2059201/data_sounds_64034polkapolka
/2059201/data_sounds_53337polkapolka
/2059201/data_sounds_63597polkaopera
/2059201/data_sounds_58784operaopera
/2059201/data_sounds_54885operaopera
/2059201/data_sounds_49584operaopera
/2059201/data_sounds_60472operaopera
/2059201/data_sounds_70polkapolka
/2059201/data_sounds_59787operaopera
/2059201/data_sounds_64374polkapolka
/2059201/data_sounds_61135polkaopera
/2059201/data_sounds_63019operaspoken_word
/2059201/data_sounds_56106operaopera
/2059201/data_sounds_60096operaopera
/2051921/data_euscreenXL_6220spoken_wordspoken_word
/2051921/data_euscreenXL_6108spoken_wordspoken_word
/2059201/data_sounds_53411operaopera
/2059201/data_sounds_53391operapolka
/2059201/data_sounds_50815operapolka
/2059201/data_sounds_59782operaopera
/2059201/data_sounds_63193polkapolka
/2051921/data_euscreenXL_5402spoken_wordspoken_word
/2059207/data_sounds_T284_5polkapolka
/2059201/data_sounds_54887operaopera
/2059201/data_sounds_65086polkapolka
/2059201/data_sounds_48220polkaopera
/2059201/data_sounds_60520polkapolka
/2059201/data_sounds_49594operaopera
/2059201/data_sounds_54881operaopera
\n", "

61 rows × 2 columns

\n", "
" ], "text/plain": [ " expected predicted\n", "id \n", "/2059201/data_sounds_54536 polka opera\n", "/2059201/data_sounds_59504 opera opera\n", "/2059201/data_sounds_59581 polka opera\n", "/2051921/data_euscreenXL_5350 spoken_word spoken_word\n", "/2051921/data_euscreenXL_5995 spoken_word polka\n", "/2059201/data_sounds_55008 choral polka\n", "/2059201/data_sounds_55067 polka polka\n", "/2059201/data_sounds_49583 opera opera\n", "/2059201/data_sounds_55054 polka polka\n", "/2059201/data_sounds_64978 opera opera\n", "/2059201/data_sounds_55278 opera opera\n", "/2059201/data_sounds_60753 opera polka\n", "/2051921/data_euscreenXL_5403 spoken_word spoken_word\n", "/2059201/data_sounds_61595 opera opera\n", "/2059201/data_sounds_28043 opera opera\n", "/2059201/data_sounds_62830 opera opera\n", "/2059201/data_sounds_29 polka polka\n", "/2059201/data_sounds_53177 polka polka\n", "/2059201/data_sounds_63748 polka opera\n", "/2059201/data_sounds_63619 polka polka\n", "/2051921/data_euscreenXL_5393 spoken_word spoken_word\n", "/2059201/data_sounds_56117 polka polka\n", "/2059207/data_sounds_T202_3 opera opera\n", "/2051921/data_euscreenXL_5013 spoken_word spoken_word\n", "/2059201/data_sounds_51040 opera opera\n", "/2059201/data_sounds_62822 polka polka\n", "/2059201/data_sounds_57876 opera opera\n", "/2051921/data_euscreenXL_5181 spoken_word spoken_word\n", "/2059201/data_sounds_65006 polka polka\n", "/2059201/data_sounds_61899 polka polka\n", "... ... ...\n", "/2059213/data_sounds_8761 polka polka\n", "/2059201/data_sounds_64034 polka polka\n", "/2059201/data_sounds_53337 polka polka\n", "/2059201/data_sounds_63597 polka opera\n", "/2059201/data_sounds_58784 opera opera\n", "/2059201/data_sounds_54885 opera opera\n", "/2059201/data_sounds_49584 opera opera\n", "/2059201/data_sounds_60472 opera opera\n", "/2059201/data_sounds_70 polka polka\n", "/2059201/data_sounds_59787 opera opera\n", "/2059201/data_sounds_64374 polka polka\n", "/2059201/data_sounds_61135 polka opera\n", "/2059201/data_sounds_63019 opera spoken_word\n", "/2059201/data_sounds_56106 opera opera\n", "/2059201/data_sounds_60096 opera opera\n", "/2051921/data_euscreenXL_6220 spoken_word spoken_word\n", "/2051921/data_euscreenXL_6108 spoken_word spoken_word\n", "/2059201/data_sounds_53411 opera opera\n", "/2059201/data_sounds_53391 opera polka\n", "/2059201/data_sounds_50815 opera polka\n", "/2059201/data_sounds_59782 opera opera\n", "/2059201/data_sounds_63193 polka polka\n", "/2051921/data_euscreenXL_5402 spoken_word spoken_word\n", "/2059207/data_sounds_T284_5 polka polka\n", "/2059201/data_sounds_54887 opera opera\n", "/2059201/data_sounds_65086 polka polka\n", "/2059201/data_sounds_48220 polka opera\n", "/2059201/data_sounds_60520 polka polka\n", "/2059201/data_sounds_49594 opera opera\n", "/2059201/data_sounds_54881 opera opera\n", "\n", "[61 rows x 2 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create the classifier - for this example we use a Support Vector Machine\n", "from sklearn.svm import LinearSVC\n", "\n", "classifier = LinearSVC()\n", "\n", "# train the classifier\n", "classifier.fit(features_for_classification.ix[train_index], encoder.transform(labels_for_classification[train_index]))\n", "\n", "# test the classifier\n", "predictions = classifier.predict(features_for_classification.ix[test_index])\n", "\n", "pd.DataFrame({\"expected\": labels_for_classification[test_index], \n", " \"predicted\": encoder.inverse_transform(predictions)},\n", " index=features_for_classification.ix[test_index].index)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2016-09-30T12:34:29.089000", "start_time": "2016-09-30T12:34:29.085000" }, "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Classifier accuracy: 78.69%\n" ] } ], "source": [ "# estimate the precision of the classifier\n", "from sklearn.metrics import accuracy_score\n", "\n", "print \"Classifier accuracy: %.2f%%\" % (accuracy_score(predictions, encoder.transform(labels_for_classification[test_index])) * 100.0)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11" }, "toc": { "toc_cell": false, "toc_number_sections": true, "toc_threshold": 6, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 1 }