Modeling Music with Words a multi-class naïve Bayes approach

Modeling Music with Wordsa multi-class naïve Bayes approach Douglas Turnbull Luke Barrington Gert Lanckriet Computer Audition Laboratory UC San Diego ISMIR 2006 October 11, 2006 Image from vintageguitars.org.uk

People use words to describe music • How would one describe “I’m a Believer” by TheMonkees? • We might use words related to: • Genre: ‘Pop’, ‘Rock’, ‘60’s’ • Instrumentation: ‘tambourine, ‘male vocals’, ‘electric piano’ • Adjectives: ‘catchy’, ‘happy’, ‘energetic’ • Usage: ‘getting ready to go out’ • Related Sounds: ‘The Beatles’, ‘The Turtles’, ‘Lovin’ Spoonful’ • We learn to associate certain words with the music we hear. Image: www.twang-tone.de/45kicks.html

Modeling music and words • Our goal is to design a statistical system that learns a relationship between music and words. • Given such a system, we can: • Annotation: Given a audio-content of a song, we can ‘annotate’ the song with semantically meaningful words. song  words • Retrieval: Given a text-based query, we can ‘retrieve’ relevant songs based on the audio content of the songs. words  songs Image from: http://www.lacoctelera.com/

Annotation Retrieval Query String: ‘jet’ Modeling images and words • Content-based image annotation and retrieval has been a hot topic in recent years [CV05, FLM04, BJ03, BDF+02, …]. • This application has benefitedfrom and inspired recent developments in machine learning. How can MIR benefit from and inspire new developments in machine learning? *Images from [CV05], www.oldies.com

Related work: • Modeling music and words is at the heart of MIR research. • jointly modeling semantic labels and audio content • genre, emotion, style, usage classification • music similarity analysis • Whitman et al. have produced a large body of work that is closely related to our work [Whi05, WE04, WR05]. • Others have looked at joint model of words and sound effects. • Most focus on non-parametric models (kNN)[SAR-Sla02, AudioClas-CK04] Images from www.sixtiescity.com

Representing music and words • Consider a vocabulary and a heterogeneous data set of • song-caption pairs: • Vocabulary - predefined set of words • Song - set of audio feature vectors (X = {x1 ,…, xT}) • Caption - binary document vector (y) • Example: • “I’m a believer” by The Monkees is a happypop song that features tambourine. • Given the vocabulary {pop, jazz, tambourine, saxophone, happy, sad} • X = set of MFCC vectors extracted from audio track • y = [1, 0, 1, 0, 1, 0] Image from www.bluesforpeace.com

Overview of our system: Representation Data Features Training Data Vocabulary T T Caption Document Vectors (y) Audio-Feature Extraction (X)

Probabilistic model for music and words • Consider a vocabulary and a set of song-caption pairs • Vocabulary - predefined set of words • Song - set of audio feature vectors (X = {x1 ,…, xT}) • Caption - binary document vector (y) • For the i-th word in our vocabulary, we estimate a ‘word’ distribution,P(x|i). • Probability distribution over audio feature vector space • Modeled with a Gaussian Mixture Model (GMM) • GMM estimated using Expectation Maximization (EM) • Key idea: training data for each ‘word’ distribution is the set of all feature vectors from all songs that are labeled with that word. • Multiple Instance Learning: includes some irrelevant feature vectors • Weakly Labeled Data: excludes some relevant feature vectors • Our probabilistic model is a set of ‘word’ distributions (GMMs) Image from www.freewebs.com

Overview of our system: Modeling Data FeaturesModeling Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X)

Overview of our system: Annotation Data FeaturesModeling Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song (annotation) Inference Caption

Inference: Annotation Given ‘word’ distributions P(x|i) and a query song (x1,…,xT), we annotate with word i*: Naïve BayesAssumption: we assume xi and xj are conditionally independent, given i: Assuming a uniform prior and taking a log transform, we have Using this equation, we annotate the query song with the top N words. www.cascadeblues.org

Overview of our system: Annotation Data FeaturesModeling Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song (annotation) Inference Caption

Overview of our system: Retrieval Data FeaturesModeling Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song (annotation) Inference (retrieval) Caption Text Query

Inference: Retrieval • We would like to rank test songs by the posterior probability P(x1, …,xT|q) given a query word q. • Problem: this results in almost the same ranking for all query words. • There are two reasons: • Length Bias • Longer songs will have proportionately lower likelihood resulting from the sum of additional log terms. • This results from the naïve Bayes assumption of conditional independence between audio feature vectors [RQD00]. Image from www.rockakademie-owl.de

Inference: Retrieval • We would like to rank test songs by the posterior probability P(x1, …,xT|q) given a query word q. • Problem: this results in almost the same ranking for all query words. • There are two reasons: • Length Bias • Song Bias • Many conditional word distributions P(x|q) are similar to the generic song distribution P(x) • High probability (e.g. generic) songs under P(x) often have high probability under P(x|q) • Solution: Rank by likelihoodP(q|x1, …,xT) instead. • Normalize P(x1, …,xT|q) by P(x1, …,xT) Image from www.rockakademie-owl.de

Overview of our system Data FeaturesModeling Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song (annotation) Inference (retrieval) Caption Text Query

Overview of our system: Evaluation Data FeaturesModelingEvaluation Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song Evaluation (annotation) Inference (retrieval) Caption Text Query

Experimental Setup • Data: 2131 song-review pairs • Audio: popular western music from the last 60 years • DMFCC feature vectors [MB03] • Each feature vector summarize 3/4 seconds of audio content • Each song is represent by between 320-1920 feature vectors • Text: song reviews from AMG Allmusic database • We create a vocabulary of 317 ‘musically relevant‘ unigrams and bigrams • A review is a natural language document written by a musical expert • Each review is converted into a binary document vector • 80% Training Set: used for parameters estimation • 20% Testing Set: used for model evaluation Image from www.chrisbarber.net

Experimental Setup • Tasks: • Annotation: annotate each test song with 10 words • Retrieval: rank order all test songs given a query word • Metrics: We adopt evaluation metrics developed for image annotation and retrieval [CV05]. • Annotation: • mean per-word precision and recall • Retrieval: • mean average precision • mean area under the ROC curve Image from www.chrisbarber.net

Quantitative Results Annotation Retrieval Recall Precision maPrec AROC • Our model performs significantly better than random for all metrics. • one-sided paired t-test with  = 0.1 • recall & precision are bounded by a value less 1 • AROC is perhaps the most intuitive metric Image from sesentas.ururock.com

Discussion • Music is inherently subjective • Different people will use different words to describe the same song. • We are learning and evaluating using a very noisy text corpus • Reviewer do not make explicit decisions about the relationships between individual words when reviewing a song. • “This song does not rock.” • Mining the web may not suffice. • Solution: manually label data (e.g., MoodLogic, Pandora) Image from www.16-bits.com.ar

Discussion 3. Our system performs much better when we annotate & retrieve sound effects • BBC sound effect library • More objective task • Cleaner text corpus • Area under the ROC = 0.80 (compare with 0.61 for music) 4. Best results for content-based image annotation and retrieval are comparable to our sound effect results. Image from www.16-bits.com.ar

“Talking about music is like dancing about architecture”- origins unknown Please send your questions and comments to Douglas Turnbull - dturnbul@cs.ucsd.edu Image from vintageguitars.org.uk

References

Related work: • Whitman et al. have produced a large body of work that is closely related to our work [Whi05, WE04, WR05]. • Uses web documents associated with artists, not songs • Focus on vocabulary selection • learn a binary classifier for each word • a word is ‘grounded’ if the classifier can separate the audio data. • Produces some tools for a “query-by-description” system, but • no quantitative results on a complete system • How do we combine the outputs of the binary classifiers? • Approach would be sensitive to ‘weakly labeled’ data • Others have looked at joint model of words and sound effects. • Most focus on non-parametric models (kNN)[AudioClas - CK04, Sla02]

Qualitative Annotation Results

Qualitative Retrieval Results

Text-Feature Extraction Let our vocabulary V be a set of unigram and bigram tokens. For each song review, we: • parse the review string into a set of tokens • apply a custom stemming algorithm to the tokens • create a binary document vector d in [0,1]|V|, where di is • 1 if the the ith token is present in the review • 0 otherwise Example: Vocab = {blues, guitar, jazz, blues;guitar, banjo, lick} Review = “This is a great blues song filled with sweet guitar licks.” Document Vector = [1,1,0,0,0,1] Discussion: Latent Semantic Analysis (LSA) offers an alternative that captures notions of synonymy and polysemy.

Audio-Feature Extraction Each song is time series of samples that represent 1-12 minutes of high-fidelity audio content. • CD audio: If we consider a 5 minutes song sample at 44,100 sample per second, our song lives in a 13.2 million dimensional space. • We generally considered downsampled, single channel audio signals. We reduce the dimension of our song by • extracting a d-dimensional feature vector for each ¾ second window of audio • applying a linear transform to the d-dimensional feature vector and retain a d-dimensional feature vector. • The linear transform is found using principle components analysis (PCA). The resulting representation is a matrix with d’ rows and a varying number of columns. • Returning to the example above, we extract about 60 features for each ¾ sec window, reduce the dimensionality to 12 features, and output a feature matrix in R12 x 400

Audio-Feature Extraction We consider two perceptually-motivated feature sets that have show superior performance on the task of music classification by genre [McKinney & Breebaart ’03]: • Dynamic Mel-frequency cepstral coefficients (dMFCC) • Auditory filterbank temporal envelopes (AFTE) • Not discussed today

Audio-Feature Extraction Dynamic MFCC (dMFCC) features: 1) For each short-time window (23-msec) extract MFCC[Logan ’00] • Find the spectrum using the Discrete Fourier Transform (DFT) • Early stages of auditory system perform analysis in the frequency domain. • Calculate the log spectrum (know as the cepstrum) • Perceptual loudness has been found to be related to the log(magnitude) of a signal. • Apply Mel-scaling • Mapping between true frequency and perceived frequency • Separate frequency components into 40 bins • Apply discrete cosine transform (DCT) • Reduces dimensionality – efficient coding 2) For time series of 64 13-D MFCC vectors, compute power spectrum and integrate power within frequency bands • DC (0 Hz) – average of each feature • 1-2 Hz – rate of musical beats • 3-15 Hz – speech syllabic rates • 20-43 Hz – range used to detect perceptual ‘roughness’ Result is a 52-D vector for each ¾ second window of audio content.

Audio-Feature Extraction Auditory Filterbank Temporal Envelope (AFTE) features: • Each half-overlapping, ¾ second window of audio is passed through a bank of 18 GammaTone filters[Paterson et. al 1988] • The gammatone filterbank output is an array of filtered waves that simulate the motion of the basilar membrane in the cochlea as a function of time. • Center frequencies are spaced logarithmically from 26 to 11025 Hz • The windowed FFT spectrum of each gammatone filter is summarized by summing the energy in 4 bands: 0 Hz (DC) 3-15Hz 20-150Hz 150-1000Hz Result is a 72-D vector for each ¾ second window of audio content.

Inference: Retrieval • To correct for length &song bias, we normalize the posterior P(x1, …,xT|q) by P(x1, …,xT). • Since P(q) is constant for all songs, normalization can be interpreted as ranking songs by likelihood P(q|x1, …,xT): • Intuition: Normalizing by P(x1, …,xT) allows each song to place emphasis (e.g. weight) on words that increase the likelihood of the audio features, {x1, …,xT}. Image from www.rockakademie-owl.de

Intellectual Motivation • Novel Computer Audition Techniques • Sound subdomains - sound effects, animal vocalizations, speech • Audition problem - monitoring, identification, characterization • Musical Knowledge Discovery • Finding semantically meaningful words that we use to describe music • Learning compact representations of an audio track • Models of Human Audition • Low-level feature extraction and high-level modeling • Introducing results from machine learning, computer vision, and natural language processing to audition researchers • Improving existing commercial applications.

Related Research This work has been inspired by research on image annotation and music classification by genre. Content-based Image Annotation • Segment an image in to blocks or ‘blobs’ • Extract image features from each segment • Model the joint probability of words and image features Recent work on this problem: • Object Recognition as Machine Translation [Duygulu, Barnard, de Freitas, Forsyth ‘02] • Correspondence-Latent Dirichlet Allocation [Blei & Jordan ‘03] • Supervised M-ary Model [Carneiro and Vasconcelos ‘05]

Related Research Music classification by genre: Given a novel song, is the song a rock, rap, reggae, classical, country, blues, or disco song? • Research focus was on audio feature extraction 1) Feature Extraction- a low-dimensional representation of audio information. • Short-time feature design - extracted over ~25 msec of audio • Fourier-based: Timbral, Pitch and Rhythm Features [Tzanetakis & Cook ‘02] • Wavelet-based audio features [Li, Ogihara, and Li ‘03] • Models based on Human perception - loudness, roughness, etc [McKinney & Breebaart ‘03] • Feature Integration - merging short-time feature vectors over medium time (~1 sec) window • Simple Statistics - Mean, Variance, Skewness, Kurtois [Tzanetakis & Cook ‘02] • Filterbank Transform on time series of feature vectors [McKinney & Breebaart ’03] • Autocorrelation and Linear Predictive Coding [Meng, Ahrendt, and Larsen ‘05] • Dimensionality reduction using Principle Component Analysis (PCA)

Related Research Music classification by genre 2) Supervised Learning: use labeled feature vectors (x,y) to train a model. The model can then be used to predict labels (ŷ) for an unlabeled song (x). • Labels: ‘rock’, ‘country’, ‘jazz’, ‘blues’, ’classical’, … • Models in practice include SVMs, KNNs, GMMs, LDA, etc. • Warning: The concept of genre is ill-defined since it is a subjective concept. Since authors make varying assumptions about genre (number of genres, names of genres, hierarchical vs. flat taxonomy), it is hard to directly compare classification results.

Query-by-text: a novel approach Music Information Retrieval (MIR) research involves the retrieval, classification, and management of music. [Goto & Hirata ‘04] Retrieval methods have focused on • Query by humming • Query by fragment • Query by similarity • Query using collaborative filtering Our approach can be described as “query by text”. One other MIR research that use a heterogeneous dataset of text reviews and audio-content is [Whitman & Ellis ‘04]. They focus on finding • semantically meaningful words. • unbiased sentences that describing audio content.

Four Parameter Estimation Techniques Direct Model for word w: • Merge all of the ¾ second feature vectors S from all the songs that have w in the associated song reviews. • Learn a GMM using this set of feature vectors Naïve Average Model for word w: • Estimate a ‘song-level’ GMM for each song that has w in the associated song review • Merge set of GMM components and rescale component priors. = Word-Level GMM Distribution (Width represents # of components) EM for GMM = Feature vectors for a song (width represents # of vectors = Song-Level GMM Distribution (Width represents # of components) EM for GMM

Four Parameter Estimation Techniques Center Model for word w: • Estimate song-level GMMs • Estimate word-level model using centers from song-level GMMs. Mixture Hierarchy Model for word w: • Estimate a ‘song-level’ GMMs • Estimate ‘word-level’ model using means and covariances from song mixtures. • Use ‘mixture hierarchies EM’ to soft cluster mixture components [Vasconcelos ’01]. EM for GMM = component centers from a song mixture (width represents # of vectors) EM for GMM MixHier EM EM for GMM

Evaluation Evaluating the performance of our system is difficult because music is inherently subjective. Annotation: Ask two people to review the same song, and you will find that the reviews may have little in common. [Whitman & Ellis ‘04] Did the reviewer “forget” to use a relevant word in a human review? Retrieval: How do we measure “song similarity”? A jazz purist may think that “Oops I did it again” by Britney Spears and “Hanging Tough” by NKOTB are similar when, in fact, they are not.

Retrieval Evaluation Annotation: mean per-wordprecision and recall • Annotate each test song with N words • For each word w in our vocabulary compute • |wH| = # of human annotations with word w • |wA| = # of automatic annotations with word w • |wC| = # of correct automatic annotations • Calculate average over all words in V • Recall = |wC| / |wH| • Precision = |wC| / |wA |

Retrieval Evaluation Retrieval: mean per-wordarea under the ROC curve (mAROC) and mean per-word average precision (mAP). • Using each word as a query q, rank songs in test set. • Calculate: • Area Under the ROC curve: ROC plots true positive rate as a function of the false positive rate. • Average Precision: record precision each time we correctly identify a song that matches the query. Average the precisions. • Calculate average over all words in V

Discussion Representing Music with Words: • Our reviews represent a “noisy” version of our ideal human annotations • valid words are missing • erroneous words appear (e.g ‘this song does not rock’) • What is a good annotation? • commercial databases: moodlogic, pandora • psychological experiments Vocabulary • Discover Semantically Meaningful Words [Whitman Ellis ‘04][Barnard ‘06] • We may be able to “artificially” enlarge our vocabulary using synonyms and antonyms (i.e WordNet) Dataset: We extract features from our dataset of compressed MP3 files. This is like extracting image features from JPEG files. Are we introducing a potential “data leaker” based on the compression algorithm?

Discussion Modeling/Inference • For existing GMM-based models, how do we best estimate the parameters. • Specific: four parameter estimation techniques for estimating each word model • General: Numerous heuristics for learning a GMM using EM • There are other ways to model audio and text features: • Hidden Markov models (HMMs) - by using GMMs, we are ignoring longer-term temporal information. One idea is to use HMMs to model the trajectories of acoustic features over time. • Audio Segmentation - We used “block-based” music decomposition, but employing automatic segmentation may prove useful. • Latent Semantic Analysis - alternative representation that is useful for uncovering synonymy and polysemy.

Template Text

Modeling Music with Words a multi-class naïve Bayes approach