560 likes | 653 Views
Text Classification with Limited Labeled Data. Andrew McCallum mccallum@cs.cmu.edu Just Research (formerly JPRC) Center for Automated Learning and Discovery, Carnegie Mellon University.
E N D
Text Classificationwith Limited Labeled Data Andrew McCallum mccallum@cs.cmu.edu Just Research (formerly JPRC) Center for Automated Learning and Discovery, Carnegie Mellon University Joint work with Kamal Nigam, Tom Mitchell, Sebastian Thrun, Roni Rosenfeld, Andrew Ng, Larry Wasserman, Kristie Seymore, and Jason Rennie
“grow corn tractor…” The Task: Document Classification (also “Document Categorization”, “Routing” or “Tagging”) Automatically placing documents in their correct categories. Testing Data: (Crops) Categories: Crops Botany Evolution Magnetism Relativity Irrigation Training Data: water grating ditch farm tractor... corn wheat silo farm grow... corn tulips splicing grow... selection mutation Darwin Galapagos DNA... ... ...
A Probabilistic Approach to Document Classification Pick the most probable class, given the evidence: - a class (like “Crops”) - a document (like “grow corn tractor...”) Bayes Rule: “Naïve Bayes”: (1) One mixture-component per class (2) Independence assumption - the ith word in d (like “corn”)
Parameter Estimation in Naïve Bayes Naïve Bayes Maximum a posteriori estimate of Pr(w|c), with a Dirichlet prior, (AKA “Laplace smoothing”) whereN(w,d)is number of times word w occurs in document d. Two ways to improve this method: (A) Make less restrictive assumptions about the model (B) Get better estimates of the model parameters, i.e. Pr(w|c)
The Rest of the Talk Two Methods for Improving Parameter Estimation when Labeled Data is Sparse (1) Borrow data from related classes in a hierarchy (2) Use unlabeled data.
Improving Document Classification by Shrinkage in a Hierarchy Andrew McCallum Roni Rosenfeld Tom Mitchell Andrew Ng (Berkeley) Larry Wasserman (CMU Statistics)
“corn grow tractor…” The Idea: “Shrinkage” / “Deleted Interpolation” We can improve the parameter estimates in a leaf by averaging them with the estimates in its ancestors. This represents a tradeoff between reliability and specificity. (Crops) Testing Data: Science Agriculture Biology Physics Categories: Crops Botany Evolution Magnetism Relativity Irrigation Training Data: water grating ditch farm tractor... corn wheat silo farm grow... corn tulips splicing grow... selection mutation Darwin Galapagos DNA... ... ...
“Shrinkage” / “Deleted Interpolation” [James and Stein, 1961] / [Jelinek and Mercer, 1980] (Uniform) Science Agriculture Biology Physics Crops Botany Evolution Magnetism Relativity Irrigation
Learning Mixture Weights Learn the l’s via EM, performing the E-step with leave-one-out cross-validation. Uniform E-step Use the current l’s to estimate the degree to which each node was likely to have generated the words in held out documents. Science Agriculture M-step Use the estimates to recalculate new values for the l’s. Crops corn wheat silo farm grow...
Learning Mixture Weights E-step M-step
Newsgroups Data Set (Subset of Ken Lang’s 20 Newsgroups set) computers religion sport politics motor mac atheism misc guns misc ibm X baseball auto hockey mideast motorcycle graphics christian windows • 15 classes, 15k documents,1.7 million words, 52k vocabulary
Industry Sector Data Set www.marketguide.com … (11) transportation utilities consumer energy services ... ... ... water electric gas coal integrated air misc appliance film furniture communication railroad water trucking oil&gas • 71 classes, 6.5k documents,1.2 million words, 30k vocabulary
Yahoo Science Data Set www.yahoo.com/Science … (30) agriculture biology physics CS space ... ... ... ... ... dairy botany cell AI courses crops craft magnetism HCI missions agronomy evolution forestry relativity • 264 classes, 14k documents,3 million words, 76k vocabulary
Related Work • Shrinkage in Statistics: • [Stein 1955], [James & Stein 1961] • Deleted Interpolation in Language Modeling: • [Jelinek & Mercer 1980], [Seymore & Rosenfeld 1997] • Bayesian Hierarchical Modeling for n-grams • [MacKay & Peto 1994] • Class hierarchies for text classification • [Koller & Sahami 1997] • Using EM to set mixture weights in a hierarchical clustering model for unsupervised learning • [Hofmann & Puzicha 1998]
Future Work • Learning hierarchies that aid classification. • Using more complex generative models. • Capturing word dependancies • Clustering words in each ancestor
Shrinkage Conclusions • Shrinkage in a hierarchy of classes can dramatically improve classification accuracy. • Shrinkage helps especially when training data is sparse. In models more complex than naïve Bayes, it should be even more helpful. • [The hierarchy can be pruned for exponential reduction in computation necessary for classification; only minimal loss in accuracy.]
The Rest of the Talk Two Methods for Improving Parameter Estimation when Labeled Data is Sparse (1) Borrow data from related classes in a hierarchy. (2) Use unlabeled data.
Text Classification with Labeled and Unlabeled Documents Kamal Nigam Andrew McCallum Sebastian Thrun Tom Mitchell
The Scenario Training data with class labels Data available at training time, but without class labels Web pages user says are interesting Web pages user says are uninteresting Web pages user hasn’t seen or said anything about Can we use the unlabeled documents to increase accuracy?
Using the Unlabeled Data Build a classification model using limited labeled data Use model to estimate the labels of the unlabeled documents Use all documents to build a new classification model, which is often more accurate because it is trained using more data.
An Example Labeled Data Unlabeled Data Baseball Ice Skating Tara Lipinski’s substitute ice skates didn’t hurt her performance. She graced the ice with a series of perfect jumps and won the gold medal. Fell on the ice... The new hitter struck out... Perfect triple jump... Struck out in last inning... Katarina Witt’s gold medal performance... Homerun in the first inning... Tara Lipinski bought a new house for her parents. New ice skates... Pete Rose is not as good an athlete as Tara Lipinski... Practice at the ice rink every day... After EM: Pr ( Lipinski | Ice Skating) = 0.02 Before EM: Pr ( Lipinski | Baseball ) = 0.003 Pr ( Lipinski) = 0.01 Pr ( Lipinski) = 0.001
Filling in Missing Labels with EM [Dempster et al ‘77], [Ghahramani & Jordan ‘95], [McLachlan & Krishnan ‘97] • E-step: Use current estimates of model parameters to “guess” value of missing labels. • M-step: Use current “guesses” for missing labels to calculate new estimates of model parameters. • Repeat E- and M-steps until convergence. Expectation Maximization is a class of iterative algorithms for maximum likelihood estimation with incomplete data. Finds the model parameters that locally maximize the probability of both the labeled and the unlabeled data.
EM for Text Classification Expectation-step (estimate the class labels) Maximization-step (new parameters using the estimates)
WebKB Data Set student faculty course project • 4 classes, 4199 documents • from CS academic departments
Word Vector Evolution with EM Iteration 0 intelligence DD artificial understanding DDw dist identical rus arrange games dartmouth natural cognitive logic proving prolog Iteration 1 DD D lecture cc D* DD:DD handout due problem set tay DDam yurtas homework kfoury sec Iteration 2 D DD lecture cc DD:DD due D* homework assignment handout set hw exam problem DDam postscript (D is a digit)
EM as Clustering X X X = unlabeled
20 Newsgroups Data Set … sci.med sci.crypt sci.space alt.atheism sci.electronics comp.graphics talk.politics.misc comp.windows.x rec.sport.hockey talk.politics.guns talk.religion.misc rec.sport.baseball talk.politics.mideast comp.sys.mac.hardware comp.os.ms-windows.misc comp.sys.ibm.pc.hardware • 20 class labels, 20,000 documents • 62k unique words
Newsgroups Classification Accuracyvarying # labeled documents
Newsgroups Classification Accuracyvarying # unlabeled documents
WebKB Classification Accuracyvarying weight of unlabeled data
WebKB Classification Accuracyvarying # labeled documentsand selecting unlabeled weight by CV
Reuters 21578 Data Set earn interest ship acq crude grain wheat … corn • 135 class labels, 12902 documents
Reuters 21578 Precision-Recall Breakeven # mixture components for negative class
Related Work • Using EM to reduce the need for training examples: • [Miller & Uyar 1997], [Shahshahani & Landgrebe 1994] • Using EM to fill in missing values • [Ghahramani & Jordan 1995] • AutoClass - unsupervised EM with Naïve Bayes: • [Cheeseman et al. 1988] • Co-Training • [Blum & Mitchell COLT’98] • Relevance Feedback for Information Retrieval • [Salton & Buckley 1990]
Unlabeled Data Conclusions & Future Work • Combining labeled and unlabeled data with EM can greatly reduce the need for labeled training data. • Exercise caution: EM can sometimes hurt. • Weight the unlabeled data. • Choose parametric model carefully. • Vary EM likelihood surface for different tasks. • Use similar techniques for other text tasks: e.g. Information Extraction.
Populating a hierarchy • Naïve Bayes • Simple, robust document classification. • Many principled enhancements (e.g. shrinkage). • Requires a lot of labeled training data. • Keyword matching • Requires no labeled training data. • Human effort to select keywords (acc/cov) • Brittle, breaks easily