630 likes | 643 Views
Learn how to improve parameter estimation and classification accuracy in text classification using the Expectation Maximization (EM) algorithm for the Naive Bayes Classifier.
E N D
Text Classification from Labeled and Unlabeled Documents usingEM by Nigam, McCallum, Thrun,Mitchell Machine Learning,2000 Ablai | Kristie | Cindy | Shayan
ProblemStatement • Text Classification - is a fundamental problem in NLP, which involves assigning tags or categories to text according to itscontent. • Broad applicationarea: • Spamdetection • Sentimentanalysis • Topic labeling and soon
TextClassification Example: spamfiltering Naive BayesClassifier:
Labels inNLP • Labels areexpensive • Labeling isslow • Human iserror-prone • Usuallyfree • Large amount of unlabeleddata • Could be categorized bydomain
Overview M-step: train a NB classifier on labeleddata E-step: label unlabeleddata M-step: train a new NB classifier on thedata E-step: relabel thedata Repeat until convergence Contributions: 𝝀-EM, significant performanceimprovements
MixtureModel Data modelled in terms of a mixture of components, where each component has a simple parametric form (such as multinomial, Gaussian) Each cluster (component) is a generativemodel
GenerativeModel • Why called ‘generative’model? • We assume there are underlying models that may have generated the givendata • Each cluster is parameterized by a disjoint subset ofθ • But we don’t know these parameters of the underlying model • So estimate the parameters for all componentclusters
Naive BayesClassifier a generativeclassifier
With training data, a certain probability distribution is assumed • multinomial distributions → Naive Bayes classifier(a mixture ofmultinomials) • Distribution’s required parameters are calculated to be used in theclassifier: • plug parameters into Bayes’ rule(later)
Assumptions data produced by a mixturemodel 1-1 correspondence between mixture components ofthe mixture model and classes of the classificationproblem naive Bayes assumption of word independence → reduces number ofparameters
ModelParameters • Parameters of a mixturecomponent: • word probabilities (probability of a word given a component):P(x|c) • mixture weight, ie class prior probabilities (probability of selecting a component):P(c)
TrainingClassifier Learning = estimating parameters P(X|Y),P(Y) Want to find parameter values that are most probable given the trainingdata How: use ratios of counts from labeled training data +smoothing (example tofollow)
Example Label email that contains text: “Free $$$bonus” Training data: Example adapted from “A practical explanation of a Naive Bayes classifier”, BrunoStecanella
Example Goal: calculate whether the email has higher probability of being spam/not,using Bayes’rule
Example Make naive Bayesassumption
Example In order to estimate parameters, find ratios ofcounts = =
Example Need to apply Laplacesmoothing!
Example Augment numerator and denominator of ratios with“pseudo-count” = =
Example P(spam | free $$$ bonus) = P(free | spam) * P($$$ | spam) * P(bonus | spam) *P(spam) P(not spam | free $$$ bonus) = P(free | notspam) * P($$$ | notspam) * P(bonus | notspam) *P(notspam) P(spam | free $$$ bonus) > P(not spam | free $$$ bonus) → label asspam
UsingClassifier Calculate probability that a particular mixture component generated the document using Bayes’ rule, using estimatedparameters Label: class with the highest posterior probability of generating thedocument Naive Bayes classifier was shown to do a good job at text classification but can do better… (Next: by applying EM to naive Bayes to improve parameter estimation/classificationaccuracy)
Preface to ExpectationMaximization: RevisitK-Means Assign each point to onecluster based ondistance. Recompute center basedon average of pointsinside. Randomly initializecenters Image:http://stanford.edu/~cpiech/cs221/handouts/kmeans.html
Preface to ExpectationMaximization: RevisitK-Means Iteration Membership: Fix Centers. Assign points to one class. Readjust center: Fix point memberships. Recomputecenter. What if we want to estimate a probability for how likely the point belongs to each class?
Hard Clustering vs. SoftClustering HardClustering Every object is assigned to one clusteri Ai = 0or 1 ∑(Ai) = 1 for all clustersi SoftClustering 0 ≤ Ai≤1 ∑(Ai) =1 for all clustersi Q: How do you do this softclustering?
MixtureModels • Each cluster is a generativemodel • model: Gaussian orMultinomial • Parameters of the model are unknown -- to beestimated How to estimate? ExpectationMaximization!
Expectation Maximization: BasicExample Assume we use 1-D Gaussian model. Assume we know how many clustersk. If we know trueassignments, Images from VictorLavrenko
Expectation Maximization: BasicExample If we don’t know true assignments BUT know Gaussianparameters, You can guess how likely each point belongs to eachcluster Likelihood Prior Posterior Likelihood: Images from VictorLavrenko
Expectation Maximization: BasicExample Issue: What if we don’t know those Gaussianparameters? We need to know those Gaussian parameters… to calculate those posterior probabilities But we need to know cluster posterior probabilities to estimate Gaussian parameters You can guess how likely each point belongs to eachcluster Likelihood Prior Posterior Likelihood: Images from VictorLavrenko
Expectation Maximization: BasicExample • Initialization: Randomly initialize k Gaussians (assume Gaussian) • Each has their own mean andvariance • E-step: Fix model parameters. “Soft” assign points to clusters. • Each point has probability of belonging to eachcluster • M-step: Fix membership probabilities. Adjust parametersthat • maximize the expectedlikelihood
Expectation Maximization: BasicExample Recalculate means and variances for eachcluster. Images from VictorLavrenko
Break When we come back: Wrapping up BasicEM
Why pickEM? • Unlabeled data -- want to discoverclusters • Assume each cluster has underlyingmodel • To estimate, need iterative method forparameters • If interested in which model belongs to which classlabel… Happy Volunteer Cheat Scam (cursewords) Samaritan
Spam?Not spam? EM andText • Words in adocument • Wordcount • Multinomialdistribution Free$$$!! Freewhile supplies last! Given this class label, how likely will you generate this bag ofwords?
Naive Bayes andEM • Initialization: Naive Bayes - estimate classifier parametersfrom • labeleddata • Loop: • Assign probabilistically-weighted class labels to each unlabeled document usingEM • Estimate new classifier parameters using BOTH labeled and unlabeleddata
Limitations to BasicEM • Let’s look at the assumptions… • All data are generated by the mixturemodel • Generated data uses the same parametric model used inclassification 2. 1-to-1 correspondence between mixture components andclasses Unlabeled data helps when there is very limited labeled data… But what if there are lots of labeleddata?
Augmented EM - Part1 1. Assumption: All data are generated by the mixturemodel When enough labeled data is already provided, unlabeled data overwhelms and badly skewsestimates → Introduce a parameter 0 ≤ λ ≤ 1 to decrease unlabeleddocuments contribution Labeleddata Prior Unlabeleddata
Augmented EM - Part1 → By weighing unlabeled documents by λ, you are weighing the word counts of unlabeled documents less by a factor ofλ λ is selected based oncross-validation → When setting 0 < λ < 1 , classificationimproves
Augmented EM - Part2 Assumption: 1-to-1 correspondence between mixture components andclasses → Many-to-onecorrespondence Ex: One class may be comprised of several differentsub-topics. Machine Learning → neural networks, Bayesian, regression,… ReLU ANOVA activation F-statistic One multinomial distribution may not beenough!
Experiments A discussion on the practicalresults of thisapproach
Empirical Validation of the ProposedSystem • Validation of all theirclaims: • Unlabeled data and overallefficacy • Weighting • Multiple mixturecomponents
Datasets • Task: Text Classification • We needdatasets!
Datasets: UseNet - GeneralInformation • Available at:http://qwone.com/~jason/20Newsgroups/ • 20 Different newsgroups (Thelabels) • 20017articles • Not a considerable class imbalance(important)
Datasets: UseNet - In thiswork • They… • 62258 uniquewords • Used a test set of 4000 articles in the latest portion (20%) of thetimeline • The task is usually predicting the future classes not thepast. • Train set is composed of • 10000 randomly selected articles from the rest, asunlabeled • 6000 documents used for labeledexamples
Datasets: WebKB - GeneralInformation • Available athttp://www.cs.cmu.edu/~webkb/ • 8145 Webpages from CSdepratments • Categories: • Student, faculty, staff, course, project, department,other
Datasets: WebKB - In thiswork • Only the four main categories are used (that have moredata) • 4199pages • Numbers are converted to either time or phone numbertoken • Did not perform stemming orstoplist • Showed that actually hurts theperformance • Vocabulary is limited to the main 300 words (most informativewords) • This vocabulary size is selectedempirically • Test using the leave-one-university-outapproach • One complete CS departmentdata • 2500 randomly selected from the rest: unlabeledset • Trainset: same asbefore
Datasets: Reuters - GeneralInformation • Available at http://www.daviddlewis.com/resources/testcollections/reuters21578/ • 12902articles • 90 categories from the Reutersnewswire