Issues in Text Similarity and Categorization

Issues in Text Similarity and Categorization Jordan Smith – MUMT 611 – 27 March 2008

Outline • Why text? • Text categorization: • Some sample problems • Comparison to MIR • Document indexing • Detailed example

Why text? • 28.9% of MIR queries refer to lyric fragments (Bainbridge et al. 2003) • Easy to collect! (Knees et al. 2005, Geleijnse & Korst 2006) • Accurate ground truth (Logan et al. 2004) • Information about mood, “content”

Why text? Potential applications: • Genre, mood categorization (Maxwell 2007) • Similarity searches (Mahadero et al. 2005) • Hit-song prediction (Dhanaraj & Logan 2004) • Musical document retrieval (Google) • Accompany query-by-humming (Suzuki et al. 2007, Fujihara et al. 2006)

Some text categorization problems • Indexing • Document organization • Filtering • Web content hierarchy • Language identification etc.

What is text categorization? “ Text categorization may be defined as the task of assigning a Boolean value to each pair <dj, ci> ∈ D x C, where D is a domain of documents and C = {c1, . . . , c|C|}is a set of pre-defined categories. ” (Sebastiani 2002)

Not the same Same Text vs. music Text categorization: • extract features • train classifiers • evaluate classifier Music classification: • extract features • train classifiers • evaluate classifier

Text feature extraction • Convert each document dj into a vector dj = <w1j, w2j, …, w|T|j> where T is the set of terms {t1, t2, … t|T|}. • Different indexing systems: • Definition of set of terms • Computation of weights

Indexing techniques • “Set of words” indexing • Terms: every word that occurs in the corpus • Weights: binary

Normalization: Frequency of term tk in document dj Number of documents that tk occurs in Indexing techniques • “Bag of words” indexing • Terms: every word that occurs in the corpus • Weights: tf-idf • term frequency / inverse document frequency: tf-idf(tk, dj) = #(tk, dj) · log( |Tr| / #Tr(tk) )

Indexing techniques • Phrase indexing • Terms: all word sequences that occur in the corpus • Weights: binary, tf-idf

Indexing techniques • “The Darmstadt Indexing Approach” • Terms: properties of the words, documents, categories • Weights: various

Feature reduction techniques • Remove function words (the, for, in, etc.) • Remove words that are least frequent: • in each document • in the corpus Remainder: low and mid-range frequency words

Feature reduction techniques Sebastiani 2002

Feature reduction techniques Latent Semantic Analysis (LSA): • Search: • Demographic shifts in the U.S. with economic impact. • Result: • The nation grew to 249.6 million people in the 1980s as more Americans left the industrial and agricultural heartlands for the South and West. Sebastiani 2002

A word on speech • “Expert” feature reduction: • Rhymingness • Iambicness of meter

Example: Hit song prediction Dhanaraj, R. and B. Logan. 2005. Automatic Prediction of Hit Songs. International Conference on Music Information Retrieval, London UK. 488-91. Goal: • Measure some unknown, global, intrinsic property Features: • Acoustic -Mel-Frequency Cepstral Coefficient • Lyric -Probabilistic Latent Semantic Analysis Classifiers: • Support vector machines • Boosting classifiers Corpus: • 1700 #1 hits from 1956 to 2004

Example: Hit song detection • Results of PLSA: Best features are for contraindication

Example: Genre classification Logan, B., A. Kositsky and P. Moreno. 2004. Semantic Analysis of Song Lyrics. Proceedings of IEEE International Conference on Multimedia and Expo. 1-7.

References Sebastiani, F. 1999. Machine learning in automated text categorization. Technical report, Consiglio Nazionale delle Ricerche. Pisa, Italy. 1–59. Dhanaraj, R., and B. Logan. 2005. Automatic prediction of hit songs. International Conference on Music Information Retrieval, London UK. 488–91. Logan, B., A. Kositsky, and P. Moreno. 2004. Semantic analysis of song lyrics. Proceedings of IEEE International Conference on Multimedia and Expo. 1–7. Mahadero, J., Á. Martínez, and P. Cano. 2005. Natural language processing of lyrics. Proceedings of the 13th Annual ACM International Conference on Multimedia. 475–8. Maxwell, T. 2007. Exploring the music genome: Lyric clustering with heterogeneous features. M.Sc. Thesis. University of Edinburgh.

Query-by-asking

Issues in Text Similarity and Categorization

Issues in Text Similarity and Categorization

Presentation Transcript

Text Categorization

Text Categorization and Images

Text Categorization

Text Categorization (TC)

Text Similarity

Text Similarity

Text Categorization

Text Categorization

Text Categorization

Document Categorization Issues

Text Categorization

text categorization

Statistical Text Categorization

Text Categorization

Text Categorization

Text Similarity

Text Categorization

Text Categorization

Text Categorization

Text Categorization (continued)