180 likes | 276 Views
Improving Text Classification by Shrinkage in a Hierarchy of Classes. Roni Rosenfeld CMU Andrew Y. Ng MIT AI Lab. Andrew McCallum Just Research & CMU Tom Mitchell CMU. “grow corn tractor…”. The Task: Document Classification
E N D
Improving Text Classification by Shrinkage in aHierarchy of Classes Roni Rosenfeld CMU Andrew Y. Ng MIT AI Lab Andrew McCallum Just Research & CMU Tom Mitchell CMU
“grow corn tractor…” The Task: Document Classification (also “Document Categorization”, “Routing” or “Tagging”) Automatically placing documents in their correct categories. Testing Data: (Crops) Categories: Crops Botany Evolution Magnetism Relativity Irrigation Training Data: water grating ditch farm tractor... corn wheat silo farm grow... corn tulips splicing grow... selection mutation Darwin Galapagos DNA... ... ...
“corn grow tractor…” The Idea: “Shrinkage” / “Deleted Interpolation” We can improve the parameter estimates in a leaf by averaging them with the estimates in its ancestors. (Crops) Testing Data: Science Agriculture Biology Physics Categories: Crops Botany Evolution Magnetism Relativity Irrigation Training Data: water grating ditch farm tractor... corn wheat silo farm grow... corn tulips splicing grow... selection mutation Darwin Galapagos DNA... ... ...
A Probabilistic Approach toDocument Classification Naïve Bayes wherecjis a class, d is a document, wdi is the ith word of document d Maximum a posteriori estimate of Pr(w|c), with a Dirichlet prior, a=1 (AKA Laplace smoothing) whereN(w,d)is number of times word w occurs in document d.
“Shrinkage” / “Deleted Interpolation” [James and Stein, 1961] / [Jelinek and Mercer, 1980] (Uniform) Science Agriculture Biology Physics Crops Botany Evolution Magnetism Relativity Irrigation
Learning Mixture Weights Learn the l’s via EM, performing the E-step with leave-one-out cross-validation. Uniform E-step Use the current l’s to estimate the degree to which each node was likely to have generated the words in held out documents. Science Agriculture M-step Use the estimates to recalculate new values for the l’s. Crops corn wheat silo farm grow...
Learning Mixture Weights E-step M-step
Newsgroups Data Set (Subset of Ken Lang’s 20 Newsgroups set) computers religion sport politics motor mac atheism misc guns misc ibm X baseball auto hockey mideast motorcycle graphics christian windows • 15 classes, 15k documents,1.7 million words, 52k vocabulary
Newsgroups HierarchyMixture Weights • 235 training documents • (15/class) • 7497 training documents • (~500/class)
Industry Sector Data Set www.marketguide.com … (11) transportation utilities consumer energy services ... ... ... water electric gas coal integrated air misc appliance film furniture communication railroad water trucking oil&gas • 71 classes, 6.5k documents,1.2 million words, 30k vocabulary
Yahoo Science Data Set www.yahoo.com/Science … (30) agriculture biology physics CS space ... ... ... ... ... dairy botany cell AI courses crops craft magnetism HCI missions agronomy evolution forestry relativity • 264 classes, 14k documents,3 million words, 76k vocabulary
Related Work • Shrinkage in Statistics: • [Stein 1955], [James & Stein 1961] • Deleted Interpolation in Language Modeling: • [Jelinek & Mercer 1980], [Seymore & Rosenfeld 1997] • Bayesian Hierarchical Modeling for n-grams • [MacKay & Peto 1994] • Class hierarchies for text classification • [Koller & Sahami 1997] • Using EM to set mixture weights in a hierarchical clustering model for unsupervised learning • [Hofmann & Puzicha 1998]
Conclusions • Shrinkage in a hierarchy of classes can dramatically improve classification accuracy (29%) • Shrinkage helps especially when training data is sparse. In models more complex than naïve Bayes, it should be even more helpful. • [The hierarchy can be pruned for exponential reduction in computation necessary for classification; only minimal loss of accuracy.]
Future Work • Learning hierarchies that aid classification. • Using more complex generative models. • Capturing word dependancies • Clustering words in each ancestor