1 / 25

Modeling Music with Words a multi-class naïve Bayes approach

Modeling Music with Words a multi-class naïve Bayes approach. Douglas Turnbull Luke Barrington Gert Lanckriet Computer Audition Laboratory UC San Diego ISMIR 2006 October 11, 2006. Image from vintageguitars.org.uk. People use words to describe music.

Download Presentation

Modeling Music with Words a multi-class naïve Bayes approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modeling Music with Wordsa multi-class naïve Bayes approach Douglas Turnbull Luke Barrington Gert Lanckriet Computer Audition Laboratory UC San Diego ISMIR 2006 October 11, 2006 Image from vintageguitars.org.uk

  2. People use words to describe music • How would one describe “I’m a Believer” by TheMonkees? • We might use words related to: • Genre: ‘Pop’, ‘Rock’, ‘60’s’ • Instrumentation: ‘tambourine, ‘male vocals’, ‘electric piano’ • Adjectives: ‘catchy’, ‘happy’, ‘energetic’ • Usage: ‘getting ready to go out’ • Related Sounds: ‘The Beatles’, ‘The Turtles’, ‘Lovin’ Spoonful’ • We learn to associate certain words with the music we hear. Image: www.twang-tone.de/45kicks.html

  3. Modeling music and words • Our goal is to design a statistical system that learns a relationship between music and words. • Given such a system, we can: • Annotation: Given a audio-content of a song, we can ‘annotate’ the song with semantically meaningful words. song  words • Retrieval: Given a text-based query, we can ‘retrieve’ relevant songs based on the audio content of the songs. words  songs Image from: http://www.lacoctelera.com/

  4. Annotation Retrieval Query String: ‘jet’ Modeling images and words • Content-based image annotation and retrieval has been a hot topic in recent years [CV05, FLM04, BJ03, BDF+02, …]. • This application has benefitedfrom and inspired recent developments in machine learning. How can MIR benefit from and inspire new developments in machine learning? *Images from [CV05], www.oldies.com

  5. Related work: • Modeling music and words is at the heart of MIR research. • jointly modeling semantic labels and audio content • genre, emotion, style, usage classification • music similarity analysis • Whitman et al. have produced a large body of work that is closely related to our work [Whi05, WE04, WR05]. • Others have looked at joint model of words and sound effects. • Most focus on non-parametric models (kNN)[SAR-Sla02, AudioClas-CK04] Images from www.sixtiescity.com

  6. Representing music and words • Consider a vocabulary and a heterogeneous data set of • song-caption pairs: • Vocabulary - predefined set of words • Song - set of audio feature vectors (X = {x1 ,…, xT}) • Caption - binary document vector (y) • Example: • “I’m a believer” by The Monkees is a happypop song that features tambourine. • Given the vocabulary {pop, jazz, tambourine, saxophone, happy, sad} • X = set of MFCC vectors extracted from audio track • y = [1, 0, 1, 0, 1, 0] Image from www.bluesforpeace.com

  7. Overview of our system: Representation Data Features Training Data Vocabulary T T Caption Document Vectors (y) Audio-Feature Extraction (X)

  8. Probabilistic model for music and words • Consider a vocabulary and a set of song-caption pairs • Vocabulary - predefined set of words • Song - set of audio feature vectors (X = {x1 ,…, xT}) • Caption - binary document vector (y) • For the i-th word in our vocabulary, we estimate a ‘word’ distribution,P(x|i). • Probability distribution over audio feature vector space • Modeled with a Gaussian Mixture Model (GMM) • GMM estimated using Expectation Maximization (EM) • Key idea: training data for each ‘word’ distribution is the set of all feature vectors from all songs that are labeled with that word. • Multiple Instance Learning: includes some irrelevant feature vectors • Weakly Labeled Data: excludes some relevant feature vectors • Our probabilistic model is a set of ‘word’ distributions (GMMs) Image from www.freewebs.com

  9. Overview of our system: Modeling Data FeaturesModeling Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X)

  10. Overview of our system: Annotation Data FeaturesModeling Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song (annotation) Inference Caption

  11. Inference: Annotation Given ‘word’ distributions P(x|i) and a query song (x1,…,xT), we annotate with word i*: Naïve BayesAssumption: we assume xi and xj are conditionally independent, given i: Assuming a uniform prior and taking a log transform, we have Using this equation, we annotate the query song with the top N words. www.cascadeblues.org

  12. Overview of our system: Annotation Data FeaturesModeling Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song (annotation) Inference Caption

  13. Overview of our system: Retrieval Data FeaturesModeling Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song (annotation) Inference (retrieval) Caption Text Query

  14. Inference: Retrieval • We would like to rank test songs by the posterior probability P(x1, …,xT|q) given a query word q. • Problem: this results in almost the same ranking for all query words. • There are two reasons: • Length Bias • Longer songs will have proportionately lower likelihood resulting from the sum of additional log terms. • This results from the naïve Bayes assumption of conditional independence between audio feature vectors [RQD00]. Image from www.rockakademie-owl.de

  15. Inference: Retrieval • We would like to rank test songs by the posterior probability P(x1, …,xT|q) given a query word q. • Problem: this results in almost the same ranking for all query words. • There are two reasons: • Length Bias • Song Bias • Many conditional word distributions P(x|q) are similar to the generic song distribution P(x) • High probability (e.g. generic) songs under P(x) often have high probability under P(x|q) • Solution: Rank by likelihoodP(q|x1, …,xT) instead. • Normalize P(x1, …,xT|q) by P(x1, …,xT) Image from www.rockakademie-owl.de

  16. Overview of our system Data FeaturesModeling Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song (annotation) Inference (retrieval) Caption Text Query

  17. Overview of our system: Evaluation Data FeaturesModelingEvaluation Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song Evaluation (annotation) Inference (retrieval) Caption Text Query

  18. Experimental Setup • Data: 2131 song-review pairs • Audio: popular western music from the last 60 years • DMFCC feature vectors [MB03] • Each feature vector summarize 3/4 seconds of audio content • Each song is represent by between 320-1920 feature vectors • Text: song reviews from AMG Allmusic database • We create a vocabulary of 317 ‘musically relevant‘ unigrams and bigrams • A review is a natural language document written by a musical expert • Each review is converted into a binary document vector • 80% Training Set: used for parameters estimation • 20% Testing Set: used for model evaluation Image from www.chrisbarber.net

  19. Experimental Setup • Tasks: • Annotation: annotate each test song with 10 words • Retrieval: rank order all test songs given a query word • Metrics: We adopt evaluation metrics developed for image annotation and retrieval [CV05]. • Annotation: • mean per-word precision and recall • Retrieval: • mean average precision • mean area under the ROC curve Image from www.chrisbarber.net

  20. Quantitative Results Annotation Retrieval Recall Precision maPrec AROC • Our model performs significantly better than random for all metrics. • one-sided paired t-test with  = 0.1 • recall & precision are bounded by a value less 1 • AROC is perhaps the most intuitive metric Image from sesentas.ururock.com

  21. Discussion • Music is inherently subjective • Different people will use different words to describe the same song. • We are learning and evaluating using a very noisy text corpus • Reviewer do not make explicit decisions about the relationships between individual words when reviewing a song. • “This song does not rock.” • Mining the web may not suffice. • Solution: manually label data (e.g., MoodLogic, Pandora) Image from www.16-bits.com.ar

  22. Discussion 3. Our system performs much better when we annotate & retrieve sound effects • BBC sound effect library • More objective task • Cleaner text corpus • Area under the ROC = 0.80 (compare with 0.61 for music) 4. Best results for content-based image annotation and retrieval are comparable to our sound effect results. Image from www.16-bits.com.ar

  23. “Talking about music is like dancing about architecture”- origins unknown Please send your questions and comments to Douglas Turnbull - dturnbul@cs.ucsd.edu Image from vintageguitars.org.uk

  24. References

  25. References

More Related