190 likes | 586 Views
Machine Learning on Sound. ... how hard can it be? Audio Information Seminar Thursday, June 8, 2006 Kaare Brandt Petersen. Agenda. Motivation The reason it might be hard: - From data and information - Features The good news: - Computer power and machine learning - Examples Conclusions.
E N D
Machine Learning on Sound ... how hard can it be? Audio Information Seminar Thursday, June 8, 2006 Kaare Brandt Petersen Kaare Brandt Petersen
Agenda • Motivation • The reason it might be hard:- From data and information- Features • The good news:- Computer power and machine learning- Examples • Conclusions Kaare Brandt Petersen
Motivation • What can we do with audio information? • News archive: Find the grumpy voice in a TV broadcasting from a busy street in the middle east. Search in newsarchives • Music: 6 billion friends. Navigating in the world landscape of music Kaare Brandt Petersen
Data -0.00076293945313 0.00231933593750 -0.00714111328125 0.00772094726563 0.00076293945313 -0.00772094726563 -0.00900268554688 -0.00527954101563 -0.00076293945313 -0.00231933593750 -0.00714111328125 0.00024414062500 0.01312255859375 0.00650024414063-0.01052856445313 -0.01089477539063 -0.00305175781250 -0.01052856445313 -0.01089477539063 -0.00305175781250 • Sound as perceived by humansand by computers 12 MonkeysMovie from 1995 Dialogue Sound events [ Beeps ] [ Male voice - indoor ] - "There's the televison" - "Its all right there" [ Steps ] - "All right there!" [ Music - violins ] - "Look. Listen. Neel. Pray" - "Commericals!" Kaare Brandt Petersen
Data • Is the data-to-information translation really necessary? Archive 1) Query by signal processing[ humans learn how computers think ] 2) Query by information[ computers learn how humans think ] 3) Query by example[ various approaches ] ZCR < 198 "happy jazz" Kaare Brandt Petersen
Going from 5 million real numbers to "Opera"Bridging the gap: From data to information Constructing soundfeatures the right way Data Meaning Context Information Kaare Brandt Petersen
12 Monkeys sound clip Features Waveform • Many shorttime featuresZero crossing rateSpectral flatnessSpectral bandwidthSpectral centroidsSpectral rolloffSpectral fluxEnergy...Mel Frequency Cepstral Coefficients (MFCC) [Foote97, Rabiner93]Real Cepstral Coefficients (RCC) Linear Prediction Coefficients (LPC)Wavelets Gamma-tone-filterbanksSone / BarkChroma features... Spec ZCR Sp-Flatness Sp-Bandwidth Sp-Centroid Chroma MFCC 1 MFCC 2-7 Kaare Brandt Petersen
Aggregating shorttime featuresAudio clip = data cloudDistribution of valuesBasic statistics [Wold96]Histograms and vector quantization [Foote97]Gaussian Mixture Models [Auc02]K-means clustering [Logan01]Anchors by Neural Networks [Beren03]Temporal modellingSVD of e.g. spectrogram [Gu04] AR-coefficients [Meng05] Features Kaare Brandt Petersen
Low-levelFeatures High-levelFeatures Information "Rough""Deep""Sparky" "Broad""Melancolic""Majestic""Jazz""Rock" ... Basic stats GMM KmeansAnchors AR coeffSVDHMM... ZCR Spectral MFCC ChromaSone/BarkRCCLPC... Features • What we are trying to do: From data to information Data -0.00076293945313 0.00231933593750 -0.00714111328125 0.00772094726563 0.00076293945313 -0.00772094726563 -0.00900268554688 -0.00527954101563 -0.00076293945313 -0.00231933593750 -0.00714111328125 0.00024414062500 0.01312255859375 0.00650024414063-0.01052856445313 -0.01089477539063 Kaare Brandt Petersen
Features • Music similarity example "Shape of my heart"Backstreet Boys, 2000 "Thats the way it is"Celine Dion, 2000 "The limitations observed in this paper (...) suggests that the usual route to timbre similarity may not be the optimal one" [Auc04] "Cantaloop"Us3, 1993 Kaare Brandt Petersen
The bad news • Sound data is far from the information • Not all features are useful • It is not obvious what the information labels should be Kaare Brandt Petersen
Computer power Signal processing- strong development in signal processing and machine learning in general- Large amounts of data- Increased interest in sound and music processing The good news Kaare Brandt Petersen
Example: Genre estimation • Genre estimation by temporal integrationPeter AhrendtAnders Meng[Meng05] • Processing:Sound -> MFCC -> AR Kaare Brandt Petersen
Example: Genre estimation • Genre estimation by temporal integration + kernel methods Jeronimo Arenas-GarciaTue Lehn-SchiølerKaare Brandt Petersen [ArGa06] • Processing:Sound -> MFCC -> AR -> KOPLS Btw: A data harvesting tool coming up - ISMIR 2006 Kaare Brandt Petersen
Original (mixed) Separated sources (Harp) (Flute) Example: Source separation • Spectrogram modelling with sparse NTF2DMorten MørupMikkel Schmidt, [Mørup06]W = time-frequency patternsH = time, amplitude, pitch Kaare Brandt Petersen
Example: CNN • Translating a CNN news broadcastKasper JørgensenLasse MølgaardLars Kai Hansen[Jorg06] • Music or Speech?Sound -> MFCC, STE, SpF, ZCR -> mean/var • Speaker change detectionSound -> MFCC -> VQ • Speech recognitionSphinx 4 (Carnegie Mellon) Kaare Brandt Petersen
Conclusions It is hard: • Sound data is far from the information • Good features are hard to find but machine learning is catching up: • Examples: Genre, Source separation, CNN-translation Kaare Brandt Petersen
References [Wold96] Wold, E.; Blum, T.; Keislar, D. & Wheaton, J. "Content-based Classification, Search, and Retrieval of Audio" IEEE Multimedia, 1996, 3, 27-36 [Foote97] Foote, J."Content-based retrieval of music and audio", Multimedia Storage and Archiving Systems II, Proc. of SPIE, 1997, 3229, 138-147[Logan01] Logan and Salomon, "A music similarity function based on signal analysis", ICME 2001[Beren03] Berenzweig, Ellis and Lawrence, "Anchorspace for classification and similarity measurement of music" ICME 2003[Rabiner93] Rabiner, L. & Juang, B.H. "Fundamentals of Speech Recognition", Prentice-Hall, 1993[Gu04] Gu, Lu, Cai and Zhang, "Dominant Feature vector based audio similarity measure", Proceedings of the Pacific Rim Conference on Multimedia, PCM, 2004[Tza02] Tzanetakis and Cook, "Music Genre Classification of Music", IEEE Transactions on Speech and Audio Processing, 2002, 10, 293-302[Auc02] Aucouturier and Pachet, "Music Similarity Measures: Whats the use?" ISMIR 2002[Meng05] Anders Meng, Peter Ahrendt and Jan Larsen: "Improving Music Genre Classification by Short-Time Feature Integration", ICASSP, 2005. [Auc04] Aucouturier, Pachet, "Improving Timbre Similarity: How high is the sky?", JNRSAS, 2004[Mørup06] Sparse Non-negative Tensor Factor Double Deconvolution (SNTF2D) for multi channel time-frequency analysis", submitted to JMLR 2006[ArGa06], "Reduced Kaernel Orthonormal Partial Least Squares", submitted for NIPS 2006[Jorg06] Kasper Jørgensen, Lasse Mølgaard, Lars Kai Hansen, "Unsupervised speaker change detection for broadcast news segmentation", EUSIPCO 2006 Kaare Brandt Petersen