Modeling Music with Words a multi-class naïve Bayes approach

Modeling Music with Wordsa multi-class naïve Bayes approach Douglas Turnbull Luke Barrington Gert Lanckriet Computer Audition Laboratory UC San Diego ISMIR 2006 October 11, 2006 Image from vintageguitars.org.uk

People use words to describe music • How would one describe “I’m a Believer” by TheMonkees? • We might use words related to: • Genre: ‘Pop’, ‘Rock’, ‘60’s’ • Instrumentation: ‘tambourine, ‘male vocals’, ‘electric piano’ • Adjectives: ‘catchy’, ‘happy’, ‘energetic’ • Usage: ‘getting ready to go out’ • Related Sounds: ‘The Beatles’, ‘The Turtles’, ‘Lovin’ Spoonful’ • We learn to associate certain words with the music we hear. Image: www.twang-tone.de/45kicks.html

Modeling music and words • Our goal is to design a statistical system that learns a relationship between music and words. • Given such a system, we can: • Annotation: Given a audio-content of a song, we can ‘annotate’ the song with semantically meaningful words. song  words • Retrieval: Given a text-based query, we can ‘retrieve’ relevant songs based on the audio content of the songs. words  songs Image from: http://www.lacoctelera.com/

Annotation Retrieval Query String: ‘jet’ Modeling images and words • Content-based image annotation and retrieval has been a hot topic in recent years [CV05, FLM04, BJ03, BDF+02, …]. • This application has benefitedfrom and inspired recent developments in machine learning. How can MIR benefit from and inspire new developments in machine learning? *Images from [CV05], www.oldies.com

Related work: • Modeling music and words is at the heart of MIR research. • jointly modeling semantic labels and audio content • genre, emotion, style, usage classification • music similarity analysis • Whitman et al. have produced a large body of work that is closely related to our work [Whi05, WE04, WR05]. • Others have looked at joint model of words and sound effects. • Most focus on non-parametric models (kNN)[SAR-Sla02, AudioClas-CK04] Images from www.sixtiescity.com

Representing music and words • Consider a vocabulary and a heterogeneous data set of • song-caption pairs: • Vocabulary - predefined set of words • Song - set of audio feature vectors (X = {x1 ,…, xT}) • Caption - binary document vector (y) • Example: • “I’m a believer” by The Monkees is a happypop song that features tambourine. • Given the vocabulary {pop, jazz, tambourine, saxophone, happy, sad} • X = set of MFCC vectors extracted from audio track • y = [1, 0, 1, 0, 1, 0] Image from www.bluesforpeace.com

Overview of our system: Representation Data Features Training Data Vocabulary T T Caption Document Vectors (y) Audio-Feature Extraction (X)

Probabilistic model for music and words • Consider a vocabulary and a set of song-caption pairs • Vocabulary - predefined set of words • Song - set of audio feature vectors (X = {x1 ,…, xT}) • Caption - binary document vector (y) • For the i-th word in our vocabulary, we estimate a ‘word’ distribution,P(x|i). • Probability distribution over audio feature vector space • Modeled with a Gaussian Mixture Model (GMM) • GMM estimated using Expectation Maximization (EM) • Key idea: training data for each ‘word’ distribution is the set of all feature vectors from all songs that are labeled with that word. • Multiple Instance Learning: includes some irrelevant feature vectors • Weakly Labeled Data: excludes some relevant feature vectors • Our probabilistic model is a set of ‘word’ distributions (GMMs) Image from www.freewebs.com

Overview of our system: Modeling Data FeaturesModeling Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X)

Overview of our system: Annotation Data FeaturesModeling Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song (annotation) Inference Caption

Inference: Annotation Given ‘word’ distributions P(x|i) and a query song (x1,…,xT), we annotate with word i*: Naïve BayesAssumption: we assume xi and xj are conditionally independent, given i: Assuming a uniform prior and taking a log transform, we have Using this equation, we annotate the query song with the top N words. www.cascadeblues.org

Overview of our system: Annotation Data FeaturesModeling Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song (annotation) Inference Caption

Overview of our system: Retrieval Data FeaturesModeling Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song (annotation) Inference (retrieval) Caption Text Query

Inference: Retrieval • We would like to rank test songs by the posterior probability P(x1, …,xT|q) given a query word q. • Problem: this results in almost the same ranking for all query words. • There are two reasons: • Length Bias • Longer songs will have proportionately lower likelihood resulting from the sum of additional log terms. • This results from the naïve Bayes assumption of conditional independence between audio feature vectors [RQD00]. Image from www.rockakademie-owl.de

Inference: Retrieval • We would like to rank test songs by the posterior probability P(x1, …,xT|q) given a query word q. • Problem: this results in almost the same ranking for all query words. • There are two reasons: • Length Bias • Song Bias • Many conditional word distributions P(x|q) are similar to the generic song distribution P(x) • High probability (e.g. generic) songs under P(x) often have high probability under P(x|q) • Solution: Rank by likelihoodP(q|x1, …,xT) instead. • Normalize P(x1, …,xT|q) by P(x1, …,xT) Image from www.rockakademie-owl.de

Overview of our system Data FeaturesModeling Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song (annotation) Inference (retrieval) Caption Text Query

Overview of our system: Evaluation Data FeaturesModelingEvaluation Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song Evaluation (annotation) Inference (retrieval) Caption Text Query

Experimental Setup • Data: 2131 song-review pairs • Audio: popular western music from the last 60 years • DMFCC feature vectors [MB03] • Each feature vector summarize 3/4 seconds of audio content • Each song is represent by between 320-1920 feature vectors • Text: song reviews from AMG Allmusic database • We create a vocabulary of 317 ‘musically relevant‘ unigrams and bigrams • A review is a natural language document written by a musical expert • Each review is converted into a binary document vector • 80% Training Set: used for parameters estimation • 20% Testing Set: used for model evaluation Image from www.chrisbarber.net

Experimental Setup • Tasks: • Annotation: annotate each test song with 10 words • Retrieval: rank order all test songs given a query word • Metrics: We adopt evaluation metrics developed for image annotation and retrieval [CV05]. • Annotation: • mean per-word precision and recall • Retrieval: • mean average precision • mean area under the ROC curve Image from www.chrisbarber.net

Quantitative Results Annotation Retrieval Recall Precision maPrec AROC • Our model performs significantly better than random for all metrics. • one-sided paired t-test with  = 0.1 • recall & precision are bounded by a value less 1 • AROC is perhaps the most intuitive metric Image from sesentas.ururock.com

Discussion • Music is inherently subjective • Different people will use different words to describe the same song. • We are learning and evaluating using a very noisy text corpus • Reviewer do not make explicit decisions about the relationships between individual words when reviewing a song. • “This song does not rock.” • Mining the web may not suffice. • Solution: manually label data (e.g., MoodLogic, Pandora) Image from www.16-bits.com.ar

Discussion 3. Our system performs much better when we annotate & retrieve sound effects • BBC sound effect library • More objective task • Cleaner text corpus • Area under the ROC = 0.80 (compare with 0.61 for music) 4. Best results for content-based image annotation and retrieval are comparable to our sound effect results. Image from www.16-bits.com.ar

“Talking about music is like dancing about architecture”- origins unknown Please send your questions and comments to Douglas Turnbull - dturnbul@cs.ucsd.edu Image from vintageguitars.org.uk

References

Modeling Music with Words a multi-class naïve Bayes approach