1 / 29

Text Classification

Text Classification. Eric Doi Harvey Mudd College November 20th, 2008. Kinds of Classification. Language Hello. My name is Eric. Hola. Mi nombre es Eric. こんにちは。 私の名前はエリックである . 你好。 我叫 Eric 。. Kinds of Classification. Type

theo
Download Presentation

Text Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Classification Eric Doi Harvey Mudd College November 20th, 2008

  2. Kinds of Classification • Language • Hello. My name is Eric. • Hola. Mi nombre es Eric. • こんにちは。私の名前はエリックである. • 你好。我叫Eric。

  3. Kinds of Classification • Type • “approaches based on n-grams obtain generalization by concatenating”* • To: eric_k_doi@hmc.edu Subject: McCain and Obama use it too You have received this message because you opted in to receives Sund Design pecial offers via email. Login to your member account to edit your email subscription . Click here to unsubscribe. • ACAAGATGCCATTGTCCCCCGGCCTCCTG *(Bengio)

  4. Difficulties • Dictionary? Generalization? • Over 500,000 words in English language (and over one million if counting scientific words) • Typos/OCR errors • Loan words • We practice ballet at the café. • Nous pratiquons le ballet au café.

  5. Approaches • Unique letter combinations Language String English “ery” French “eux” Gaelic “mh” Italian “cchi“ Dunning, Statistical Identification of Language

  6. Approaches • “Unique” letter combinations Language String English “ery” French “milieux” Gaelic “farmhand” Italian “zucchini“ Dunning, Statistical Identification of Language

  7. Approaches • “Unique” letter combinations Language String English “ery” French “milieux” Gaelic “farmhand” Italian “zucchini“ • Requires hand-coding; what about other languages (6000+)? Dunning, Statistical Identification of Language

  8. Approaches • Try to minimize: • Hand-coded knowledge • Training data • Input data (isolating phrases?) • Dunning, “Statistical Identification of Language.” 1994. • Bengio, “A Neural Probabilistic Language Model.” 2003.

  9. Statistical Approach: N-Grams • N-grams are sequences of n elements Professor Keller is not a goth. • Word-level bigrams: • Char-level trigrams:

  10. Statistical Approach: N-Grams • N-grams are sequences of n elements Professor Keller is not a goth. • Word-level bigrams: (Professor, Keller) • Char-level trigrams:

  11. Statistical Approach: N-Grams • N-grams are sequences of n elements Professor Keller is not a goth. • Word-level bigrams:(Keller, is) • Char-level trigrams:

  12. Statistical Approach: N-Grams • N-grams are sequences of n elements Professor Keller is not a goth. • Word-level bigrams:(is, not) • Char-level trigrams:

  13. Statistical Approach: N-Grams • N-grams are sequences of n elements Professor Keller is not a goth. • Word-level bigrams:(not, a) • Char-level trigrams:

  14. Statistical Approach: N-Grams • N-grams are sequences of n elements Professor Keller is not a goth. • Word-level bigrams:(a, goth) • Char-level trigrams:

  15. Statistical Approach: N-Grams • N-grams are sequences of n elements Professor Keller is not a goth. • Word-level bigrams: • Char-level trigrams:

  16. Statistical Approach: N-Grams • N-grams are sequences of n elements Professor Keller is not a goth. • Word-level bigrams: • Char-level trigrams: (P, r, o)

  17. Statistical Approach: N-Grams • N-grams are sequences of n elements Professor Keller is not a goth. • Word-level bigrams: • Char-level trigrams: (r, o, f)

  18. Statistical Approach: N-Grams • N-grams are sequences of n elements Professor Keller is not a goth. • Word-level bigrams: • Char-level trigrams: (o, f, e)

  19. Statistical Approach: N-Grams • N-grams are sequences of n elements Professor Keller is not a goth. • Word-level bigrams: • Char-level trigrams: (f, e, s)

  20. Statistical Approach: N-Grams • N-grams are sequences of n elements Professor Keller is not a goth. • Word-level bigrams: • Char-level trigrams:

  21. Statistical Approach: N-Grams • Mined from 1,024,908,267,229 words • Sample 4-grams serve as the infrastructure 500 serve as the initial 5331 serve as the injector 56

  22. Statistical Approach: N-Grams • Informs some notion of probability • Normalize frequencies • P(serve as the initial) > P(serve as the injector) • Classification P(English | serve as the initial) > P(Spanish | serve as the initial) P(Spam | serve as the injector) < P(!Spam | serve as the injector)

  23. Statistical Approach: N-Grams • But what about P(serve as the ink)? • = 0? • P(serve as the ink) = P(vxvw aooa *%^$) = 0? • How about P(sevre as the initial)?

  24. Statistical Approach: N-Grams • How do we smooth out sparse data? • Additive smoothing • Interpolation • Good-Turing estimate • Backoff • Witten-Bell smoothing • Absolute discounting • Kneser-Key smoothing MacCartney

  25. Statistical Approach: N-Grams • Additive smoothing • Interpolation- consider smaller n-grams as well, e.g. (serve as the), (serve) • Backoff- use interpolation only if necessary MacCartney

  26. Statistical Approach: Results • Dunning: Compared parallel translated texts in English and Spanish • 20 char input, 50K training: 92% accurate • 500 char input, 50K training: 99.9% • Modified for comparing DNA sequences of Humans, E-Coli, and Yeast

  27. Neural Network Approach Bengio et. al, “A Neural Probabilistic Language Model.” 2003: • N-gram does handle sparse data well • However, there are problems: • Narrow consideration of context (~1–2 words) • Does not consider semantic/grammatical similarity: “A cat is walking in the bedroom” “A dog was running in a room”

  28. Neural Network Approach • The general idea: 1. Associate with each word in the vocabulary (e.g. size 17,000) a feature vector (30–100 features) 2. Express the joint probability function of word sequences in terms of feature vectors 3. Learn simultaneously the word feature vectors and the parameters of the probability function

  29. References • Dunning, “Statistical Identification of Language.” 1994. • Bengio, “A Neural Probabilistic Language Model.” 2003. • MacCartney, “NLP Lunch Tutorial: Smoothing.” 2005.

More Related