1 / 63

Statistical Methods for Text Mining

Statistical Methods for Text Mining. David Madigan Rutgers University & DIMACS www.stat.rutgers.edu/~madigan. David D. Lewis www.daviddlewis.com. joint work with Alex Genkin, Vladimir Menkov, Aynur Dayanik, Dmitriy Fradkin. Statistical Analysis of Text.

Download Presentation

Statistical Methods for Text Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Methods for Text Mining David Madigan Rutgers University & DIMACS www.stat.rutgers.edu/~madigan David D. Lewis www.daviddlewis.com joint work with Alex Genkin, Vladimir Menkov, Aynur Dayanik, Dmitriy Fradkin

  2. Statistical Analysis of Text Statistical text analysis has a long history in literary analysis and in solving disputed authorship problems First (?) is Thomas C. Mendenhall in 1887

  3. Mendenhall Mendenhall was Professor of Physics at Ohio State and at University of Tokyo, Superintendent of the USA Coast and Geodetic Survey, and later, President of Worcester Polytechnic Institute Mendenhall Glacier, Juneau, Alaska

  4. X2 = 127.2, df=12

  5. Used Naïve Bayes with Poisson and Negative Binomial model Out-of-sample predictive performance

  6. Today • Statistical methods routinely used for textual analyses of all kinds • Machine translation, part-of-speech tagging, information extraction, question-answering, text categorization, etc. • Not reported in the statistical literature (no statisticians?)

  7. Outline • Part-of-Speech Tagging, Entity Recognition • Text categorization • Logistic regression and friends • The richness of Bayesian regularization • Sparseness-inducing priors • Word-specific priors: stop words, IDF, domain knowledge, etc. • Polytomous logistic regression

  8. Part-of-Speech Tagging • Assign grammatical tags to words • Basic task in the analysis of natural language data • Phrase identification, entity extraction, etc. • Ambiguity: “tag” could be a noun or a verb • “a tag is a part-of-speech label” – context resolves the ambiguity

  9. The Penn Treebank POS Tag Set

  10. POS Tagging Process Berlin Chen

  11. POS Tagging Algorithms • Rule-based taggers: large numbers of hand-crafted rules • Probabilistic tagger: used a tagged corpus to train some sort of model, e.g. HMM. tag3 tag2 tag1 word3 word2 word1 clever tricks for reducing the number of parameters (aka priors)

  12. some details… Charniak et al., 1993, achieved 95% accuracy on the Brown Corpus with: number of times word j appears with tag i number of times word j appears number of times a word that had never been seen with tag i gets tag i number of such occurrences in total plus a modification that uses word suffixes r1 s1

  13. Recent Developments • Toutanova et al., 2003, use a dependency network and richer feature set Log-linear model for ti | t-i, w Model included, for example, a feature for whether the word contains a number, uppercase characters, hyphen, etc. Regularization of the estimation process critical 96.6% accuracy on the Penn corpus

  14. Named-Entity Classification • “Mrs. Frank” is a person • “Steptoe and Johnson” is a company • “Honduras” is a location • etc. • Bikel et al. (1998) from BBN “Nymble” statistical approach using HMMs

  15. nc3 nc2 nc1 word3 word2 word1 • “name classes”: Not-A-Name, Person, Location, etc. • Smoothing for sparse training data + word features • Training = 100,000 words from WSJ • Accuracy = 93% • 450,000 words  same accuracy

  16. training-development-test

  17. Text Categorization • Automatic assignment of documents with respect to manually defined set of categories • Applications automated indexing, spam filtering, content filters, medical coding, CRM, essay grading • Dominant technology is supervised machine learning: Manually classify some documents, then learn a classification rule from them (possibly with manual intervention)

  18. Terminology, etc. Binary versus Multi-Class Single-Label versus Multi-Label Document representation via “bag of words:” wi’s might be 0/1, counts, or weights (e.g tf/idf, LSI) Phrases, syntactic information, synonyms, NLP, etc. ? Stopwords, stemming

  19. Test Collections • Reuters-21578 • 9603 training, 3299 test, 90 categories, ~multi-label • New Reuters – 800,000 documents • Medline – 11,000,000 documents; MeSH headings • TREC conferences and collections • Newsgroups, WebKB

  20. Reuters Evaluation • binary classifiers: recall=d/(b+d) precision=d/(c+d) macro-precision = 1.0+0.5 micro-averaged precision = 2/3 “sensitivity” “predictive value positive” multiple binary classifiers: true predict 1 1 1 1 2 0 0 1 0 p=1.0 p=0.5 r =1.0 r =1.0 F1 Measure – harmonic mean of precision and recall

  21. Reuters Results

  22. Naïve Bayes Naïve Bayes for document classification dates back to the early 1960’s The NB model assumes features are conditionally independent given class Estimation is simple; scales well Empirical performance usually not bad High bias-low variance (Friedman, 1997; Domingos & Pazzani, 1997)

  23. Poisson NB Natural extension of binary model to word frequencies ML-equivalent to the multinomial model with Poisson-distributed document length Bayesian equivalence requires constraints on conjugate priors (Poisson NB has 2p hyper-parameters per class; Multinomial-Poisson has p+2)

  24. Poisson NB - Reuters over-dispersion Different story for FAA dataset

  25. AdaBoost.MH Multiclass-Multilabel At each iteration learns a simple score-producing classifier on weighed training data and the updates the weights Final decision averages over the classifiers data initial weights score from simple classifier revised weights

  26. AdaBoost.MH Schapire and Singer, 2000

  27. AdaBoost.MH’s weak learner is a stump two words!

  28. AdaBoost.MH Comments Software implementation: BoosTexter Some theoretical support in terms of bounds on generalization error 3 days of cpu time for Reuters with 10,000 boosting iterations

  29. Document Representation Documents usually represented as “bag of words:” xi’s might be 0/1, counts, or weights (e.g. tf/idf, LSI) Many text processing choices: stopwords, stemming, phrases, synonyms, NLP, etc.

  30. Classifier Representation For instance, linear classifier: xi’s derived from text of document yiindicates whether to put document in category βj are parameters chosen to give good classification effectiveness

  31. Logistic Regression Model Linear model for log odds of category membership: Equivalent to Conditional probability model

  32. Logistic Regression as a Linear Classifier If estimated probability of category membership is greater than p, assign document to category: • Choose p to optimize expected value of your effectiveness measure (may need different form of test) • Can change measure w/o changing model

  33. Maximum Likelihood Training • Choose parameters (βj's) that maximize probability (likelihood) of class labels (yi's) given documents (xi’s) • Maximizing (log-)likelihood can be viewed as minimizing a loss function

  34. Hastie, Friedman & Tibshirani

  35. Shrinkage Methods • Subset selection is a discrete process – individual variables are either in or out. Combinatorial nightmare. • This method can have high variance – a different dataset from the same source can result in a totally different model • Shrinkage methods allow a variable to be partly included in the model. That is, the variable is included but with a shrunken co-efficient • Elegant way to tackle over-fitting

  36. Ridge Regression subject to: Equivalently: This leads to: Choose  by cross-validation. works even when XTX is singular

  37. Ridge Regression = Bayesian MAP Regression • Suppose we believe each βj is a small value near 0 • Encode this belief as separate Gaussian probability distributions over values of βj • Choosing maximum a posteriori value of the β gives same result as ridge logistic regression

  38. Least Absolute Shrinkage & Selection Operator (LASSO) subject to: Quadratic programming algorithm needed to solve for the parameter estimates q=0: var. sel. q=1: lasso q=2: ridge Learn q?

  39. Ridge & LASSO - Theory • Lasso estimates are consistent • But, Lasso does not have the “oracle property.” That is, it does not deliver the correct model with probability 1 • Fan & Li’s SCAD penalty function has the Oracle property

  40. LARS • New geometrical insights into Lasso and “Stagewise” • Leads to a highly efficient Lasso algorithm for linear regression

  41. LARS • Start with all coefficients bj = 0 • Find the predictor xj most correlated with y • Increase bj in the direction of the sign of its correlation with y. Take residuals r=y-yhat along the way. Stop when some other predictor xk has as much correlation with r as xj has • Increase (bj,bk) in their joint least squares direction until some other predictor xm has as much correlation with the residual r. • Continue until all predictors are in the model

  42. Zhang & Oles Results Reuters-21578 collection Ridge logistic regression plus feature selection

  43. Bayes! • MAP logistic regression with Gaussian prior gives state of the art text classification effectiveness • But Bayesian framework more flexible than SVM for combining knowledge with data : • Feature selection • Stopwords, IDF • Domain knowledge • Number of classes • (and kernels.)

  44. Data Sets • ModApte subset of Reuters-21578 • 90 categories; 9603 training docs; 18978 features • Reuters RCV1-v2 • 103 cats; 23149 training docs; 47152 features • OHSUMED heart disease categories • 77 cats; 83944 training docs; 122076 features • Cosine normalized TFxIDF weights

  45. Dense vs. Sparse Models (Macroaveraged F1, Preliminary)

More Related