1 / 21

Issues in Text Similarity and Categorization

Issues in Text Similarity and Categorization. Jordan Smith – MUMT 611 – 27 March 2008. Outline. Why text? Text categorization: Some sample problems Comparison to MIR Document indexing Detailed example. Why text?. 28.9% of MIR queries refer to lyric fragments (Bainbridge et al. 2003)

finola
Download Presentation

Issues in Text Similarity and Categorization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Issues in Text Similarity and Categorization Jordan Smith – MUMT 611 – 27 March 2008

  2. Outline • Why text? • Text categorization: • Some sample problems • Comparison to MIR • Document indexing • Detailed example

  3. Why text? • 28.9% of MIR queries refer to lyric fragments (Bainbridge et al. 2003) • Easy to collect! (Knees et al. 2005, Geleijnse & Korst 2006) • Accurate ground truth (Logan et al. 2004) • Information about mood, “content”

  4. Why text? Potential applications: • Genre, mood categorization (Maxwell 2007) • Similarity searches (Mahadero et al. 2005) • Hit-song prediction (Dhanaraj & Logan 2004) • Musical document retrieval (Google) • Accompany query-by-humming (Suzuki et al. 2007, Fujihara et al. 2006)

  5. Some text categorization problems • Indexing • Document organization • Filtering • Web content hierarchy • Language identification etc.

  6. What is text categorization? “ Text categorization may be defined as the task of assigning a Boolean value to each pair <dj, ci> ∈ D x C, where D is a domain of documents and C = {c1, . . . , c|C|}is a set of pre-defined categories. ” (Sebastiani 2002)

  7. Not the same Same Text vs. music Text categorization: • extract features • train classifiers • evaluate classifier Music classification: • extract features • train classifiers • evaluate classifier

  8. Text feature extraction • Convert each document dj into a vector dj = <w1j, w2j, …, w|T|j> where T is the set of terms {t1, t2, … t|T|}. • Different indexing systems: • Definition of set of terms • Computation of weights

  9. Indexing techniques • “Set of words” indexing • Terms: every word that occurs in the corpus • Weights: binary

  10. Normalization: Frequency of term tk in document dj Number of documents that tk occurs in Indexing techniques • “Bag of words” indexing • Terms: every word that occurs in the corpus • Weights: tf-idf • term frequency / inverse document frequency: tf-idf(tk, dj) = #(tk, dj) · log( |Tr| / #Tr(tk) )

  11. Indexing techniques • Phrase indexing • Terms: all word sequences that occur in the corpus • Weights: binary, tf-idf

  12. Indexing techniques • “The Darmstadt Indexing Approach” • Terms: properties of the words, documents, categories • Weights: various

  13. Feature reduction techniques • Remove function words (the, for, in, etc.) • Remove words that are least frequent: • in each document • in the corpus Remainder: low and mid-range frequency words

  14. Feature reduction techniques Sebastiani 2002

  15. Feature reduction techniques Latent Semantic Analysis (LSA): • Search: • Demographic shifts in the U.S. with economic impact. • Result: • The nation grew to 249.6 million people in the 1980s as more Americans left the industrial and agricultural heartlands for the South and West. Sebastiani 2002

  16. A word on speech • “Expert” feature reduction: • Rhymingness • Iambicness of meter

  17. Example: Hit song prediction Dhanaraj, R. and B. Logan. 2005. Automatic Prediction of Hit Songs. International Conference on Music Information Retrieval, London UK. 488-91. Goal: • Measure some unknown, global, intrinsic property Features: • Acoustic -Mel-Frequency Cepstral Coefficient • Lyric -Probabilistic Latent Semantic Analysis Classifiers: • Support vector machines • Boosting classifiers Corpus: • 1700 #1 hits from 1956 to 2004

  18. Example: Hit song detection • Results of PLSA: Best features are for contraindication

  19. Example: Genre classification Logan, B., A. Kositsky and P. Moreno. 2004. Semantic Analysis of Song Lyrics. Proceedings of IEEE International Conference on Multimedia and Expo. 1-7.

  20. References Sebastiani, F. 1999. Machine learning in automated text categorization. Technical report, Consiglio Nazionale delle Ricerche. Pisa, Italy. 1–59. Dhanaraj, R., and B. Logan. 2005. Automatic prediction of hit songs. International Conference on Music Information Retrieval, London UK. 488–91. Logan, B., A. Kositsky, and P. Moreno. 2004. Semantic analysis of song lyrics. Proceedings of IEEE International Conference on Multimedia and Expo. 1–7. Mahadero, J., Á. Martínez, and P. Cano. 2005. Natural language processing of lyrics. Proceedings of the 13th Annual ACM International Conference on Multimedia. 475–8. Maxwell, T. 2007. Exploring the music genome: Lyric clustering with heterogeneous features. M.Sc. Thesis. University of Edinburgh.

  21. Query-by-asking

More Related