380 likes | 394 Views
Towards Musical Query-by-Semantic-Description using the CAL500 Dataset. Douglas Turnbull Computer Audition Lab UC San Diego Work with Luke Barrington, David Torres, and Gert Lanckriet SIGIR June 25, 2007. How do we find music?. Query-by-Metadata - artist, song, album, year
E N D
TowardsMusical Query-by-Semantic-Descriptionusing the CAL500 Dataset Douglas Turnbull Computer Audition Lab UC San Diego Work with Luke Barrington, David Torres, and Gert Lanckriet SIGIR June 25, 2007
How do we find music? • Query-by-Metadata - artist, song, album, year • We must know what we want • Query-by-(Humming, Tapping, Beatboxing) • Requires talent • Query-by-Song-Similarity • We must possess ‘acoustically’ similar songs • Query-by-Semantic-Description • Google seems to work pretty well for text • Semantic Image Labeling is a hot topic in Computer Vision • Can it work for music?
Semantic Music Annotation and Retrieval Frank Sinatra ‘Fly Me to the Moon’ ‘Jazz’ ‘Male Vocals’ ‘Sad’ ‘Mellow’ ‘Slow Tempo’ Annotation Retrieval Our goal is build a system that can • Annotate a song with meaningful words • Retrieve songs given a text-based query Plan: Learn a probabilistic model that captures a relationship between the audio content of a song and words that describe the song. We consider this as a supervised multi-class, multi-label problem.
System Overview Data Features Modeling Evaluation Parametric Model: Set of GMMs Training Data T T Parameter Estimation: EM Algorithm Annotation Annotation Vectors (y) Audio-Feature Extraction (X) Novel Song Evaluation Inference Music Review (annotation) (retrieval) Text Query Vocabulary
System Overview Data Training Data T T Annotation
The CAL500 data set The Computer Audition Lab 500-song (CAL500) data set • 500 ‘Western Popular’ songs • 174-word vocabulary • genre, emotion, usage, instrumentation, rhythm, pitch, vocal characteristics • 3 or more annotations per song • 55 paid undergrads annotate music for 120 hours Other Techniques • Text-mining of web documents • ‘Human Computation’ Games - (e.g., Listen Game )
System Overview Data Features Training Data T T Annotation Document Vectors (y) Audio-Feature Extraction (X) Vocabulary
Semantic Representation We choose a vocabulary of ‘musically relevant’ words Each annotation is converted to a real-valued vector. • Each element represents the ‘semantic association’ between a word and the song. Example: Frank Sinatra’s ‘Fly Me to the Moon” Vocab = {funk, jazz, guitar, female vocals, sad, passionate } Annotation Vector = [0/4 , 3/4, 4/4 , 0/4 , 2/4, 1/4]
Acoustic Representation Each songs is represented as a bag-of-feature-vectors • Pass a short time window over the audio signal • Extract a feature vector for each short-time audio segment Specifically, we calculate Delta MFCC feature vectors • Mel-frequency Cepstral Coefficients (MFCC) represent the shape of a short-term (23 msec) spectrum • Popular for both representing speech, music, and sound effects • Instantaneous derivatives (deltas) encode short-time temporal info • 10,000 39-dimensional vector per minute
System Overview Data Features Modeling Parametric Model: Set of GMMs Training Data T T Parameter Estimation: EM Algorithm Annotation Document Vectors (y) Audio-Feature Extraction (X) Vocabulary
Statistical Model We adapt the Supervised Multi-class Labeling (SML) model • Set of probability distributions over the audio feature space • One Gaussian Mixture Model (GMM) per word - p(x|w) • Estimate parameters for GMM using the set of training songs that are positively associated with the word Notes: • Developed for image annotation by Carneiro and Vasconcelos • Modified for real-value semantic weights rather than binary class labels • Extended formulation to handle multi-word queries
Gaussian Mixture Model (GMM) A GMM is often used to model arbitrary probability distributions over high dimensional spaces: • A GMM is a weighted combo of R Gaussian distributions • r is the r-th mixing weight • r is the r-th mean • r is the r-th a covariance matrix • These parameters are usually estimated using the Expectation Maximization (EM) algorithm.
Step 1 - Song GMMs Bag of MFCC vectors GMM + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + To model each song: Segment audio signal Extract short-time feature vectors Estimate the GMM distribution using a ‘standard’ EM
Word GMMs - p(x|w) romantic Efficient Hierarchical Estimation Semantic Class Model p(x|w) For each word w, we learn a word model p(x|w) • Identify all songs associated with w • i.e., all ‘romantic’ songs • Estimate song-level GMMs • Use Weigthed Mixture Hierarchies EM to estimate p(x|w) • Soft clustering of Gaussian components from song GMMs
System Overview Data Features Modeling Parametric Model: Set of GMMs Training Data T T Parameter Estimation: EM Algorithm Annotation Annotation Vectors (y) Audio-Feature Extraction (X) Novel Song Inference Music Review (annotation) Vocabulary
Annotation Given a set of word-GMMs and a novel song X = {x1, …, xT}, we calculate the likelihood of the song under each word GMM: Assuming Uniform word prior P(w) Feature vectors are conditionally independent given a word (Naïve Bayes) These likelihoods can be interpreted as a semantic multinomial distribution over the vocabulary of words words. Annotation involves picking the peaks of the semantic multinomial.
Annotation Semantic Multinomial for “Give it Away” by the Red Hot Chili Peppers
Annotation: Automatic Music Reviews Dr. Dre (feat. Snoop Dogg) - Nuthin' but a 'G' thang This is adancepoppy, hip-hop song that is arousing and exciting. It features drum machine, backing vocals, male vocal, a nice acoustic guitar solo, and rapping, strong vocals. It is a song that is very danceable and with a heavy beat that you might like listen to while at a party. Frank Sinatra - Fly me to the moon This is a jazzy,singer / songwriter song that is calming and sad. It features acoustic guitar,piano, saxophone, a nice male vocal solo, and emotional, high-pitched vocals. It is a song with a light beat and a slow tempo that you might like listen to while hanging with friends.
System Overview Data Features Modeling Parametric Model: Set of GMMs Training Data T T Parameter Estimation: EM Algorithm Annotation Annotation Vectors (y) Audio-Feature Extraction (X) Novel Song Inference Music Review (annotation) (retrieval) Text Query Vocabulary
Retrieval • Annotate each songs in corpus with a semantic multinomial p • p = {P(w1|X), …, P(wV|X)}. • Given a text-based query, construct as query multinomialq • qw = 1/|w| , if word w appears in the query string • qw = 0, otherwise • We rank order all songs by the Kullback-Leibler (KL) divergence between the query multinomial and all semantic multinomals The compact semantic multinomial representation of a song allow us to quickly rank order songs.
Retrieval The top 3 results for - “pop, female vocals, tender” 0.33 1. Shakira - The One 0.02 2. Alicia Keys - Fallin’ 0.02 3. Evanescence - My Immortal 0.02
System Overview Data Features Modeling Evaluation Parametric Model: Set of GMMs Training Data T T Parameter Estimation: EM Algorithm Annotation Annotation Vectors (y) Audio-Feature Extraction (X) Novel Song Evaluation Inference Music Review (annotation) (retrieval) Text Query Vocabulary
Quantifying Annotation Our system annotates the CAL500 songs with 10 words from our vocabulary of 174 words.
Quantifying Retrieval We rank order song according to songs once for each query.
Demos • CAL Music Search Engine - a content-based semantic music search engine. • Listen Game - a ‘game with a purpose’ to collect semantic annotations of music.
What’s on tap… Large-scale system • Web-based, large-scale collection of reliable human annotations => Multiplayer, online game - Listen Game - ISMIR 07 • Prune and extend vocabulary (automatically) - ISMIR 07 • Novel Applications - Music Search Engine / Radio Player Personalized search • Model homogeneous groups / individuals rather than population • Personalized Audio Search • Adjust to affective state of the user Novel query paradigms • Query-by-semantic-example - ICASSP 07 • Heterogeneous queries
“Talking about music is like dancing about architecture”- origins unknown cosmal.ucsd.edu/cal
System Overview Data Features Modeling Evaluation Parametric Model: Set of GMMs Training Data T T Parameter Estimation: EM Algorithm Annotation Annotation Vectors (y) Audio-Feature Extraction (X) Novel Song Evaluation Inference Music Review (annotation) (retrieval) Text Query Vocabulary
Acoustic Representation Calculating Delta MFCC feature vectors • Calculate a time-series for 13 MFCCs • Append 1st and 2nd instantaneous derivatives • 5,200 39-dimensional feature vectors per minute of audio content • Denoted by X = {x1,…, xT} where T depends on the length of the song Short-Time Fourier Transform Time Series of MFCCs Reconstructed based on MFCCs (log frequency)
Three Approaches to Parameter Estimation For each word w, we want to estimate the parameters of a GMM p(x|w). • Direct Estimation • Take the union of sets of feature vectors for each song that are semantically associated with w. • Use the ‘standard’ Expectation Maximization(EM) for estimating the parameters for a GMM Direct Estimation is computationally difficult and empirically converges to poor local optima.
Three Approaches to Parameter Estimation For each word w, we want to estimate the parameters of a GMM. 2. Model Averaging • For each song associated with w, estimate a ‘song GMM’ using the standard EM algorithm - p(x|s) • Concatenate mixture components and renormalize mixture weights Model averaging produces a distribution with a variable number of mixture component. As the training set size grows, evaluating this distribution become prohibitively expensive.
Mixture Hierarchices Density Estimation For each word w, we want to estimate the parameters of a GMM. Mixture Hierarchies EM • For each song associated with w, estimate a ‘song GMM’ using the standard EM algorithm. • Learn a ‘mixture of mixture components’ using the Mixture Hierarchies EM algorithm [Vasconcelos01] Notes: • Computationally efficient for both parameter estimation and inference. • Each song is re-represented as a ‘smoothed’ estimate of bag-of-feature vectors. • Combining song models abstracts the semantic of a common word.
Quantifying Annotation Our system annotates the Cal-500 songs with 10 words from our vocabulary of 173 words. • ‘Population Annotation’ Ground Truth Metric: ‘Word’ Precision & Recall Consider word w, Precision = # songs correctly annotated with w # songs annotated with w Recall = # songs correctly annotated with w # songs that should have been annotated w Mean Word Recall and Word Precision are the averages over all words in our vocabulary.
Quantifying Annotation Our system annotates the Cal-500 songs with 10 words from our vocabulary of 174 words. Compared with a human, our model is • worse on objective categories - instrumentation, genre • about the same on subjective categories - emotion, usage
Quantifying Retrieval For each 1-, 2-, & 3-word query for which there is at least 5 songs in the ground truth, we rank order test set songs according KL divergence between the query multinomial and the semantic multinomial Metric: Area under the ROC Curve (AROC) • An ROC curve is a plot of the true positive rate as a function of the false positive rate as we move down this ranked list of songs. • Integrating the curve gives us a scalar between 0 and 1 where 0.5 is the expected value when randomly guessing. Mean AROC is the average AROC over a large number of queries.
A biased view of Music Classification 2000-03: Music classification (by genre, emotion, instrumentation) becomes a popular MIR task • Undergrad Thesis on Genre Classification with G. Tzanetakis 2003-04: MIR community starts to criticize music classification problems • ill-posed problem due to subjectivity • not an end in itself • performance ‘glass ceiling’ 2004-06: Focus turns to Music Similarity research • Recommendation • Playlist generation 2006-07: We view Music Annotation as a supervised multi-class labeling problem • Like classification but with large, less-restrictive vocabulary
Modeling Semantic Classes Given a set of word models p(x|w) over a vocabulary of words, Annotation: Given a novel song, we pick words by comparing the likelihood of the audio features under each word model. Retrieval: Given a text query, we pick songs that are likely under the word models associated with the words in the query.