10 likes | 100 Views
A coustic M odels. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +.
E N D
Acoustic Models + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Each song, s, is represented as a probability distribution, P(a|s), over the audio feature space, approximated as a Gaussian mixture model (GMM): A bag-of-features extracted from the song’s audio content and the expectation maximization (EM) algorithm are used to train song-level GMMs. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + bag-of-features EM Sounds →Semantics Semantic Multinomials It’s All Semantics… Using learned word-level GMMs P(a|wi), compute the posterior probability of word wi, givensong Assume xm and xnare conditionally independent, given wi: Estimate the song prior , by summing over all words: Normalizing posteriors of all words, we represent songs as semantic multinomial distributions over the vocabulary: Semantic understanding of audio signals enables retrieval of songs that, while acoustically different, are semantically similar to a query song. Given a query with a high-pitched, wailing electric guitar solo, a system based on acoustics might retrieve songs with screechy violins or a screaming female singer. Our system retrieves songs with semantically similar content: acoustic, classical or distorted electric guitars. Semantic Models Query Each word, w, is represented as a probability distribution, P(a|w), over the same audio feature space. The training data for word-level GMM is the set of all song-level GMMs fromsongs labeled with word w. Song-level GMMs are combined to train word-level GMMs using the mixture-hierarchies EM algorithm. The semantic model - a set of word-level GMMs - is used as the basis for song similarity. “Romantic” song-level GMMs “Romantic” word-level GMM Similar Mixture-Hierarchies EM Semantic Similarity p(a|s2) p(a|s1) We represent every song as a semantic distribution: a point in a semantic space. A natural similarity measure in this space is the Kullback-Leibler (KL) divergence; Given a query song, we retrieve the database songs that minimize the KL divergence with the query. p(a|s3) p(a|s4) p(a|“romantic”) http://cosmal.ucsd.edu/cal/ p(a|s5) p(a|s6) References Carneiro & Vasconcelos (2005). Formulating semantic image annotation as a supervised learning problem. IEEE CVPR. Rasiwasia, Vasconcelos & Moreno (2006). Query by Semantic Example. ACM ICIVR. Barrington, Chan, Turnbull & Lanckriet (2007). Audio Information Retrieval using Semantic Similarity. IEEE ICASSP Turnbull, Barrington, Torres & Lanckriet (2007). Towards Musical Query-by-Semantic-Description using the CAL500 Data Set. ACM SIGIR Not Similar Luke Barrington, Doug Turnbull, David Torres & Gert Lanckriet Electrical & Computer Engineering University of California, San Diego lbarrington@ucsd.edu Semantic Similarity for Music Retrieval Audio & Text Features Our models are trained on the CAL500 dataset, a heterogeneous data set of song / caption pairs: 500 popular western songs, 146-word vocabulary Each track has been annotated by at least 3 humans Audio content is represented as a bag of feature vectors: MFCC features plus 1st and 2nd time deltas 10,000 feature vectors per minute of audio Annotations are represented as a bag of words: Binary document vector of length 146