1 / 82


807 - TEXT ANALYTICS. Massimo Poesio Lecture 2: Text Classification. TEXT CLASSIFICATION. From: "" <takworlld@hotmail.com> Subject: real estate is the only way... gem oalvgkay Anyone can buy real estate with no money down Stop paying rent TODAY !

Download Presentation


An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. 807 - TEXT ANALYTICS Massimo PoesioLecture 2: Text Classification

  2. TEXT CLASSIFICATION From: "" <takworlld@hotmail.com> Subject: real estate is the only way... gem oalvgkay Anyone can buy real estate with no money down Stop paying rent TODAY ! There is no need to spend hundreds or even thousands for similar courses I am 22 years old and I have already purchased 6 properties using the methods outlined in this truly INCREDIBLE ebook. Change your life NOW ! ================================================= Click Below to order: http://www.wholesaledaily.com/sales/nmd.htm =================================================


  4. HOW DOCUMENT TAGS ARE PRODUCED • Traditionally: by HAND by EXPERTS (curators) • E.g., Yahoo!, Looksmart, about.com, ODP, Medline • Also: Bridgeman, UK Data Archive • very accurate when job is done by experts • consistent when the problem size and team is small • difficult and expensive to scale • Now: • AUTOMATICALLY • By CROWDS

  5. CLASSIFICATION AND CATEGORIZATION • Given: • A description of an instance, xX, where X is the instance language or instance space. • Issue: how to represent text documents. • A fixed set of categories: C ={c1, c2,…, cn} • Determine: • The category of x: c(x)C, where c(x) is a categorization function whose domain is X and whose range is C. • We want to know how to build categorization functions (“classifiers”).

  6. AUTOMATIC TEXT CLASSIFICATION • Supervised • Training set • Test set • Clustering: next week

  7. Text Categorization: attributes • Representations of text are very high dimensional (one feature for each word). • High-bias algorithms that prevent overfitting in high-dimensional space are best. • For most text categorization tasks, there are many irrelevant and many relevant features. • Methods that combine evidence from many or all features (e.g. naive Bayes, kNN, neural-nets) tend to work better than ones that try to isolate just a few relevant features (standard decision-tree or rule induction)* *Although one can compensate by using many rules

  8. Bayesian Methods • Learning and classification methods based on probability theory. • Bayes theorem plays a critical role in probabilistic learning and classification. • Build a generative model that approximates how data is produced • Uses prior probability of each category given no information about an item. • Categorization produces a posterior probability distribution over the possible categories given a description of an item.

  9. Bayes’ Rule

  10. Maximum a posteriori Hypothesis

  11. Maximum likelihood Hypothesis If all hypotheses are a priori equally likely,we only need to consider the P(D|h) term:

  12. Naive Bayes Classifiers Task: Classify a new instance based on a tuple of attribute values

  13. Naïve Bayes Classifier: Assumptions • P(cj) • Can be estimated from the frequency of classes in the training examples. • P(x1,x2,…,xn|cj) • O(|X|n•|C|) • Could only be estimated if a very, very large number of training examples was available. Conditional Independence Assumption:  Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities.

  14. Flu X1 X2 X3 X4 X5 runnynose sinus cough fever muscle-ache The Naïve Bayes Classifier • Conditional Independence Assumption: features are independent of each other given the class:

  15. C X1 X2 X3 X4 X5 X6 Learning the Model • Common practice:maximum likelihood • simply use the frequencies in the data

  16. Flu X1 X2 X3 X4 X5 runnynose sinus cough fever muscle-ache Problem with Max Likelihood • What if we have seen no training cases where patient had no flu and muscle aches? • Zero probabilities cannot be conditioned away, no matter the other evidence!

  17. Smoothing to Avoid Overfitting # of values of Xi overall fraction in data where Xi=xi,k • Somewhat more subtle version extent of “smoothing”

  18. Using Naive Bayes Classifiers to Classify Text: Basic method • Attributes are text positions, values are words. • Naive Bayes assumption is clearly violated. • Example? • Still too many possibilities • Assume that classification is independent of the positions of the words • Use same parameters for each position

  19. Text Classification Algorithms: Learning • From training corpus, extract Vocabulary • Calculate required P(cj)and P(xk | cj)terms • For each cj in Cdo • docsjsubset of documents for which the target class is cj • Textj single document containing all docsj • for each word xkin Vocabulary • nk number of occurrences ofxkin Textj

  20. Text Classification Algorithms: Classifying • positions  all word positions in current document which contain tokens found in Vocabulary • Return cNB, where

  21. Underflow Prevention • Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow. • Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities. • Class with highest final un-normalized log probability score is still the most probable.

  22. Naïve Bayes Posterior Probabilities • Classification results of naïve Bayes (the class with maximum posterior probability) are usually fairly accurate. • However, due to the inadequacy of the conditional independence assumption, the actual posterior-probability numerical estimates are not. • Output probabilities are generally very close to 0 or 1.

  23. Feature Selection • Text collections have a large number of features • 10,000 – 1,000,000 unique words – and more • Make using a particular classifier feasible • Some classifiers can’t deal with 100,000s of feat’s • Reduce training time • Training time for some methods is quadratic or worse in the number of features (e.g., logistic regression) • Improve generalization • Eliminate noise features • Avoid overfitting

  24. Recap: Feature Reduction • Standard ways of reducing feature space for text • Stemming • Laugh, laughs, laughing, laughed -> laugh • Stop word removal • E.g., eliminate all prepositions • Conversion to lower case • Tokenization • Break on all special characters: fire-fighter -> fire, fighter

  25. Feature Selection • Yang and Pedersen 1997 • Comparison of different selection criteria • DF – document frequency • IG – information gain • MI – mutual information • CHI – chi square • Common strategy • Compute statistic for each term • Keep n terms with highest value of this statistic

  26. Feature selection via Mutual Information • We might not want to use all words, but just reliable, good discriminators • In training set, choose k words which best discriminate the categories. • One way is in terms of Mutual Information: • For each word w and each category c

  27. (Pointwise) Mutual Information

  28. Feature selection via MI (contd.) • For each category we build a list of k most discriminating terms. • For example (on 20 Newsgroups): • sci.electronics: circuit, voltage, amp, ground, copy, battery, electronics, cooling, … • rec.autos: car, cars, engine, ford, dealer, mustang, oil, collision, autos, tires, toyota, … • Greedy: does not account for correlations between terms • In general feature selection is necessary for binomial NB, but not for multinomial NB

  29. Information Gain

  30. Chi-Square X^2 = N(AD-BC)^2 / ( (A+B) (A+C) (B+D) (C+D) ) Use either maximum or average X^2 Value for complete independence?

  31. Document Frequency • Number of documents a term occurs in • Is sometimes used for eliminating both very frequent and very infrequent terms • How is document frequency measure different from the other 3 measures?

  32. Yang&Pedersen: Experiments • Two classification methods • kNN (k nearest neighbors; more later) • Linear Least Squares Fit • Regression method • Collections • Reuters-22173 • 92 categories • 16,000 unique terms • Ohsumed: subset of medline • 14,000 categories • 72,000 unique terms • Ltc term weighting

  33. Yang&Pedersen: Experiments • Choose feature set size • Preprocess collection, discarding non-selected features / words • Apply term weighting -> feature vector for each document • Train classifier on training set • Evaluate classifier on test set

  34. Discussion • You can eliminate 90% of features for IG, DF, and CHI without decreasing performance. • In fact, performance increases with fewer features for IG, DF, and CHI. • Mutual information is very sensitive to small counts. • IG does best with smallest number of features. • Document frequency is close to optimal. By far the simplest feature selection method. • Similar results for LLSF (regression).

  35. Results Why is selecting common terms a good strategy?

  36. IG, DF, CHI Are Correlated.

  37. Information Gain vs Mutual Information • Information gain is similar to MI for random variables • Independence? • In contrast, pointwise MI ignores non-occurrence of terms • E.g., for complete dependence, you get: • P(AB)/ (P(A)P(B)) = 1/P(A) – larger for rare terms than for frequent terms • Yang&Pedersen: Pointwise MI favors rare terms

  38. Feature Selection:Other Considerations • Generic vs Class-Specific • Completely generic (class-independent) • Separate feature set for each class • Mixed (a la Yang&Pedersen) • Maintainability over time • Is aggressive features selection good or bad for robustness over time? • Ideal: Optimal features selected as part of training

  39. Yang&Pedersen: Limitations • Don’t look at class specific feature selection • Don’t look at methods that can’t handle high-dimensional spaces • Evaluate category ranking (as opposed to classification accuracy)

  40. Feature Selection: Other Methods • Stepwise term selection • Forward • Backward • Expensive: need to do n^2 iterations of training • Term clustering • Dimension reduction: PCA / SVD

  41. FEATURES IN SPAM ASSASSIN • SpamAssassin is a spam filtering program written in Perl, developed by Justin Mason in 1997 and entirely rewritten by Mark Jeftovic in 2001 when it went on Sourceforge – now maintained by the Apache foundation • It uses Naïve Bayes and a variety of other methods • It uses features coming both from content and from other properties of the email message

  42. SpamAssassin Features 100 From: address is in the user's black-list 4.0 Sender is on www.habeas.com Habeas Infringer List 3.994 Invalid Date: header (timezone does not exist) 3.970 Written in an undesired language 3.910 Listed in Razor2, see http://razor.sf.net/ 3.801 Subject is full of 8-bit characters 3.472 Claims compliance with Senate Bill 1618 3.437 exists:X-Precedence-Ref 3.371 Reverses Aging 3.350 Claims you can be removed from the list 3.284 'Hidden' assets 3.283 Claims to honor removal requests 3.261 Contains "Stop Snoring" 3.251 Received: contains a name with a faked IP-address 3.250 Received via a relay in list.dsbl.org 3.200 Character set indicates a foreign language

  43. SpamAssassin Features 3.198 Forged eudoramail.com 'Received:' header found 3.193 Free Investment 3.180 Received via SBLed relay, seehttp://www.spamhaus.org/sbl/ 3.140 Character set doesn't exist 3.123 Dig up Dirt on Friends 3.090 No MX records for the From: domain 3.072 X-Mailer contains malformed Outlook Expressversion 3.044 Stock Disclaimer Statement 3.009 Apparently, NOT Multi Level Marketing 3.005 Bulk email software fingerprint (jpfree) found inheaders 2.991 exists:Complain-To 2.975 Bulk email software fingerprint (VC_IPA) found inheaders 2.968 Invalid Date: year begins with zero 2.932 Mentions Spam law "H.R. 3113" 2.900 Received forged, contains fake AOL relays 2.879 Asks for credit card details

  44. Measuring Performance • Classification accuracy: What % of messages were classified correctly? • Is this what we care about? • Which system do you prefer?

  45. Measuring Performance • Precision = good messages kept all messages kept • Recall =good messages kept all good messages Trade off precision vs. recall by setting threshold Measure the curve on annotated dev data (or test data) Choose a threshold where user is comfortable

  46. F-measure = 1 / (average(1/precision, 1/recall)) OK for search engines (maybe) would prefer to be here! point where precision=recall (sometimes reported) OK for spam filtering and legal search Measuring Performance high threshold: all we keep is good, but we don’t keep much low threshold: keep all the good stuff,but a lot of the bad too 600.465 - Intro to NLP - J. Eisner

  47. Micro- vs. Macro-Averaging • If we have more than one class, how do we combine multiple performance measures into one quantity? • Macroaveraging: Compute performance for each class, then average. • Microaveraging: Collect decisions for all classes, compute contingency table, evaluate.

  48. Micro- vs. Macro-Averaging: Example Class 1 Class 2 Micro.Av. Table • Macroaveraged precision: (0.5 + 0.9)/2 = 0.7 • Microaveraged precision: 100/120 = .83 • Why this difference?

  49. Confusion matrix • Function of classifier, topics and test docs. • For a perfect classifier, all off-diagonal entries should be zero. • For a perfect classifier, if there are n docs in category j than entry (j,j) should be n. • Straightforward when there is 1 category per document. • Can be extended to n categories per document.

More Related