190 likes | 262 Views
Identifying Words that are Musically Meaningful. David Torres, Douglas Turnbull , Luke Barrington, Gert Lanckriet Computer Audition Lab UC San Diego ISMIR September 25, 2007. Introduction. Our Goal: Create a content-based music search engine for natural language queries.
E N D
Identifying Words that are Musically Meaningful David Torres, Douglas Turnbull, Luke Barrington, Gert Lanckriet Computer Audition Lab UC San Diego ISMIR September 25, 2007
Introduction Our Goal: Create a content-based music search engine for natural language queries. • CAL Music Search Engine[SIGIR07] Problem: picking a vocabulary of musically meaningful words? • Word is present pattern in audio content Solution: find words that are correlated with a set of acoustic signals
Two-View Representation Riverdance Bill Whelan Riverdance Bill Whelan Mustang Sally The Commitments Mustang Sally The Commitments Hot Pants James Brown Hot Pants James Brown Consider a set of annotated songs. Each song is represented by: • Annotation vector in a Semantic Space • Audio feature vector(s) in an Acoustic Space Semantic Space (2D) Acoustic Space (2D) ‘Ireland’ y ‘funky’ x
Semantic Representation - s1 - . - si - . - sN - Vocabulary of words: • CAL500: 174 phrases from a human survey • Instrumentation, genre, emotion, usages, vocal characteristics • LastFM: ~15,000 tags from social music site • Web Mining: 100,000+ words mined from text documents Annotation Vector, denoted s • Each element represents the ‘semantic association’ between a word and the song. • Dimension (DS) = size of vocabulary • Example: Frank Sinatra’s ‘Fly Me to the Moon” • Vocabulary = {funk, jazz, guitar, female vocals, sad, passionate } • Annotation (si) = [0/4 , 3/4, 4/4 , 0/4 , 2/4, 1/4] Data is represented by a N x DS Matrix S =
Acoustic Representation - a1 - . - ai - . - aN - Semantic Space (2D) Acoustic Space (2D) ‘Ireland’ y Mustang Sally The Commitments ‘funky’ x Mustang Sally The Commitments Each song is represented by an audio feature vector a that is automatically extracted from the audio-content. Data is represented by NxDA matrix A =
Canonical Correlation Analysis (CCA) CCA is a technique for exploring dependencies between two related spaces. • Generalization of PCA to multiple spaces • Constrained optimization problem • Find vectorsweight vectorsws and wa: • 1-D projection of data in the semantic space - Sws • 1-D projection of data in the acoustic space - Awa • Maximizecorrelationof the projections • max (Sws)T(Awa) • Constrainws and wa to prevent infinite correlation max (Sws)T (Awa) wa, ws subject to: (Sws)T (Sws) = 1 (Awa)T(Awa) = 1
CCA Visualization Sparse Solution S ws A wa (Sws)T (Awa) 1 1 0 -1 0 -1 -1 -1 1 0 1 0 0 -1 1 -1 1 1 -1 -1 -1 1 1 -1 2 0 0 -2 2 0 0 -2 1 0 0 -1 = 4 Semantic space Audio feature space ‘Ireland’ y b d d a c a d c b b ‘funky’ x c a c b = =
What Sparsity means… In the previous example, • ws,’funky’ 0 ‘funky’ is correlated w/ audio signals a musically meaningful word • ws,’Ireland’ = 0 ‘Ireland’ is not correlated No linear relationship with the acoustic representation In practice, ws is dense even if most words are uncorrelated • ‘dense’ means many non-zero values • due to random variability in the data Key Idea: reformulate CCA to produce a sparse solution.
Introducing Sparse CCA [ICML07] Plan: penalize the objective function for each non-zero semantic dimensions • Pick a penalty function f(ws) • Penalizes each non-zero dimension • Take 1: Cardinality of ws: f(ws) = |ws|0 • Combinatorial problem - np-hard • Take 2: L1 relaxation: f(ws) = |ws|1 • Non-convex, not very tight approximation • Take 3: SDP relaxation • Prohibitive expensive for large problem • Solution:f(ws) = i log |ws,i| • Non-convex problem, but • Can be solved efficiently with DC program • Tight approximation
Introducing Sparse CCA [ICML07] Plan: penalize the objective function for each non-zero semantic dimensions • Pick a penalty function f(ws) • Penalizes each non-zero dimension • f(ws) = i log |ws,i| • Use tuning parameter to control importance of sparsity • Increasing smaller set of ‘musically relevant’ words max (Sws)T (Awa) wa, ws subject to: (Sws)T (Sws) = 1 (Awa)T(Awa) = 1 - f(ws)
Experimental Setup CAL500 Data Set [SIGIR07] • 500 songs by 500 Artists • Semantic Representation • 173 words • genre, instrumentation, usages, emotions, vocals, etc… • Annotation vector is average from 3+ listeners • Word Agreement Score • measures how consistently listeners apply a word to songs • AcousticRepresentation • Bag of Dynamic MFCC Vectors [McKinney03] • 52-D vector spectral modulation intensities • 160 vectors per minute of audio content • Duplicate annotation vector for each Dynamic MFCC
Experiment 1: Qualitative Results Words with highacoustic correlation hip-hop, arousing, sad, drum machine, heavy beat, at a party, rapping Words with no acoustic correlation classic rock, normal, constant energy, going to sleep, falsetto
Experiment 2: Vocabulary Pruning AMG2131 Text Corpus [ISMIR06] • AMG Allmusic song reviews for most of CAL500 songs • 315 word vocabulary • Annotation vector based on the presence or absence of a word in the review • More noisy word-song relationships then CAL500 Experimental Design: • Merge vocabularies: 173+315 = 488 words • Prune noisy words as we increase amount of sparsity in CCA Hypothesis: • AMG words will be pruned before CAL500 words
Experiment 2: Vocabulary Pruning Experimental Design: • Merge vocabularies: 488 words • Prune noisy words as we increase amount of sparsity in CCA Result: As Sparse CCA is more aggressive, more AMG words are pruned.
Experiment 3: Vocabulary Selection .76 AROC .68 173 120 70 20 Vocab Size Experimental Design: • Rank words by • how aggressive Sparse CCA is before word gets pruned. • how consistently humansuse a word across CAL500 corpus. • As we decrease vocabulary size, calculate Average AROC Result: Sparse CCA does predict words that have better AROC
Recap Constructing a ‘meaningful vocabulary’ is the first step in building a content-based, natural-language search engine for music. Given a semantic representation and acoustic representation Sparse CCA can be used to find ‘musically meaningful’ words. • i.e., semantic dimensions linearly correlated with audio features Automatically pruning words is important when using noisy sources of semantic information • e.g., LastFM Tags or Web Documents
Future Work Theory: moving beyond linear correlation with kernel methods Application: Sparse CCA can be used to find ‘musically meaningful’ audio features by imposing sparsity in the acoustic space Practice: handling large, noisy semantically annotated music corpora
Identifying Words that are Musically Meaningful David Torres, Douglas Turnbull, Luke Barrington, Gert Lanckriet Computer Audition Lab UC San Diego ISMIR September 25, 2007
Experiment 3: Vocabulary Selection Our content-based music search engine rank orders songs given a text-based query [SIGIR 07] • Area under the ROC curve (AROC) measures quality of each ranking • 0.5 is random, 1.0 is perfect • 0.68 is average AROC for all 1-word queries Can Sparse CCA pick words that will have higher AROC? • Idea: words with high correlation have more signal in the audio representation and will be easier to model. • How does it compare picking words that humans consistently use to label songs.