220 likes | 410 Views
Issues in Text Similarity and Categorization. Jordan Smith – MUMT 611 – 27 March 2008. Outline. Why text? Text categorization: Some sample problems Comparison to MIR Document indexing Detailed example. Why text?. 28.9% of MIR queries refer to lyric fragments (Bainbridge et al. 2003)
E N D
Issues in Text Similarity and Categorization Jordan Smith – MUMT 611 – 27 March 2008
Outline • Why text? • Text categorization: • Some sample problems • Comparison to MIR • Document indexing • Detailed example
Why text? • 28.9% of MIR queries refer to lyric fragments (Bainbridge et al. 2003) • Easy to collect! (Knees et al. 2005, Geleijnse & Korst 2006) • Accurate ground truth (Logan et al. 2004) • Information about mood, “content”
Why text? Potential applications: • Genre, mood categorization (Maxwell 2007) • Similarity searches (Mahadero et al. 2005) • Hit-song prediction (Dhanaraj & Logan 2004) • Musical document retrieval (Google) • Accompany query-by-humming (Suzuki et al. 2007, Fujihara et al. 2006)
Some text categorization problems • Indexing • Document organization • Filtering • Web content hierarchy • Language identification etc.
What is text categorization? “ Text categorization may be defined as the task of assigning a Boolean value to each pair <dj, ci> ∈ D x C, where D is a domain of documents and C = {c1, . . . , c|C|}is a set of pre-defined categories. ” (Sebastiani 2002)
Not the same Same Text vs. music Text categorization: • extract features • train classifiers • evaluate classifier Music classification: • extract features • train classifiers • evaluate classifier
Text feature extraction • Convert each document dj into a vector dj = <w1j, w2j, …, w|T|j> where T is the set of terms {t1, t2, … t|T|}. • Different indexing systems: • Definition of set of terms • Computation of weights
Indexing techniques • “Set of words” indexing • Terms: every word that occurs in the corpus • Weights: binary
Normalization: Frequency of term tk in document dj Number of documents that tk occurs in Indexing techniques • “Bag of words” indexing • Terms: every word that occurs in the corpus • Weights: tf-idf • term frequency / inverse document frequency: tf-idf(tk, dj) = #(tk, dj) · log( |Tr| / #Tr(tk) )
Indexing techniques • Phrase indexing • Terms: all word sequences that occur in the corpus • Weights: binary, tf-idf
Indexing techniques • “The Darmstadt Indexing Approach” • Terms: properties of the words, documents, categories • Weights: various
Feature reduction techniques • Remove function words (the, for, in, etc.) • Remove words that are least frequent: • in each document • in the corpus Remainder: low and mid-range frequency words
Feature reduction techniques Sebastiani 2002
Feature reduction techniques Latent Semantic Analysis (LSA): • Search: • Demographic shifts in the U.S. with economic impact. • Result: • The nation grew to 249.6 million people in the 1980s as more Americans left the industrial and agricultural heartlands for the South and West. Sebastiani 2002
A word on speech • “Expert” feature reduction: • Rhymingness • Iambicness of meter
Example: Hit song prediction Dhanaraj, R. and B. Logan. 2005. Automatic Prediction of Hit Songs. International Conference on Music Information Retrieval, London UK. 488-91. Goal: • Measure some unknown, global, intrinsic property Features: • Acoustic -Mel-Frequency Cepstral Coefficient • Lyric -Probabilistic Latent Semantic Analysis Classifiers: • Support vector machines • Boosting classifiers Corpus: • 1700 #1 hits from 1956 to 2004
Example: Hit song detection • Results of PLSA: Best features are for contraindication
Example: Genre classification Logan, B., A. Kositsky and P. Moreno. 2004. Semantic Analysis of Song Lyrics. Proceedings of IEEE International Conference on Multimedia and Expo. 1-7.
References Sebastiani, F. 1999. Machine learning in automated text categorization. Technical report, Consiglio Nazionale delle Ricerche. Pisa, Italy. 1–59. Dhanaraj, R., and B. Logan. 2005. Automatic prediction of hit songs. International Conference on Music Information Retrieval, London UK. 488–91. Logan, B., A. Kositsky, and P. Moreno. 2004. Semantic analysis of song lyrics. Proceedings of IEEE International Conference on Multimedia and Expo. 1–7. Mahadero, J., Á. Martínez, and P. Cano. 2005. Natural language processing of lyrics. Proceedings of the 13th Annual ACM International Conference on Multimedia. 475–8. Maxwell, T. 2007. Exploring the music genome: Lyric clustering with heterogeneous features. M.Sc. Thesis. University of Edinburgh.