220 likes | 239 Views
This paper presents a practical technique for partially supervised classification of text documents using the naive Bayes classifier and the Expectation-Maximization algorithm.
Partially Supervised Classification of Text DocumentsbyBing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005
Agenda • Problem Statement • Related Work • Theoretical Foundations • Proposed Technique • Evaluation • Conclusions
Problem Statement: Common Approach • Text categorization: automated assigning of text documents to pre-defined classes • Common Approach: Supervised Learning • Manually label a set of documents to pre-defined classes • Use a learning algorithm to build a classifier + _ + _ _ + _ + + + _ _ _ + + _ _ + _ + + _ + _ _ + _ + + _
Problem Statement: Common Approach (cont.) • Problem: bottleneck associated with large number of labeled training documents to build the classifier • Nigram, et al, have shown that using a large dose of unlabeled data can help . . . _ + _ . + . _ . + + . . _ _ . + _ _ + . + . . . _ . + _ . . + _
. . . . . + . . + . . . + + . + . . . . . + . + . . + . . . + . . + . . . + + . . A different approach:Partially supervised classification • Two class problem: positive and unlabeled • Key feature is that there is no labeled negative document • Can be posed as a constrained optimization problem • Develop a function that correctly classifies all positive docs and minimizes the number of mixed docs classified as positive will have an expected error rate of no more than e. • Examplar: Finding matching (i.e., positive documents) from a large collection such as the Web. • Matching documents are positive • All others are negative
Related Work • Text Classification techniques • Naïve Bayesian • K-nearest neighbor • Support vector machines • Each requires labeled data for all classes • Problem similar to traditional information retrieval • Rank orders documents according to their similarities to the query document • Does not perform document classification
Theoretical Foundations • Some discussion regarding the theoretical foundations. Focused primarily on • Minimization of the probability of error • Expected recall and precision of functions Pr[f(X)=Y] = Pr[f(X)=1] - Pr[Y=1] + 2Pr Pr[f(X)=0 | Y=1]Pr[Y=1] • Painful, painful… but it did show you can build accurate classifiers with high probability when sufficient documents in P (the positive document set) and M (the unlabeled set) are available. (1) /
Theoretical Foundations (cont.) • Two serious practical drawbacks to the theoretical method • Constrained optimization problem may not be easy to solve for the function class in which we are interested • Not easy to choose a desired recall level that will give a good classifier using the function class we are using
Proposed Technique • Theory be darned! • Paper introduces a practical technique based on the naïve Bayes classifier and the Expectation-Maximization (EM) algorithm • After introducing a general technique, the authors offer an enhancement using spies
Proposed Technique:Terms • D is the set of training documents • V = < w1, w2, …, w|V| > is the set of all words considered for classification • wdi,k is the word in position k in document di • N(wt, di) is the number of times wt occurs in di • C = {c1, c2} is the set of predefined classes • P is the set of positive documents • M is the set of unlabeled set of documents • S is the set of spy documents • Posterior probability Pr[cj | di] e {0,1} depends on the class label of the document
Proposed Technique:naïve Bayesian classifer (NB-C) Pr[cj] = Si Pr[cj|di] / |D| Pr[wt|cj] = 1 + Si=1P[cj|di] N(wt, di) |V| + Ss=1 Si=1 P[cj|di] N(ws, di) and assuming the words are independent given the class Pr[cj|di] = Pr[cj] Pk=1Pr[wdi,k|cj] Sr=1Pr[cr] Pk=1Pr[wdi,k|cr] The class with the highest Pr[cj|di] is assigned as the class of the doc (2) |D| (3) |V| |D| |di| (4) |C| |di|
Proposed Technique:EM algorithm • Popular class of iterative algorithms for maximum likelihood estimation in problems with incomplete data. • Two steps • Expectation: fills in the missing data • Maximization: parameters are estimated • Rinse and repeat • Using a NB-C, (2) and (3) equate to the E step, and (4) is the M step • Probability of a class now takes the value in [0,1] instead of {0,1}
Proposed Technique:EM algorithm (cont.) • All positive documents have the class value c1 • Need to determine class value of each doc in mixed set. • EM can help assign a probabilistic class label to each document dj in the mixed set • Pr[c1|dj] and Pr[c2|dj] • After a number of iterations, all the probabilities will converge
Proposed Technique:Step 1 - Reinitialization (I-EM) • Reinitialization • Build an initial NB-C using the documents sets M and P • For class P, Pr[c1|dj] = 1 and Pr[c2|dj] = 0 • For class M, Pr[c1|dj] = 0 and Pr[c2|dj] = 1 • Loop while classifier parameters change • For each document dje M • Compute Pr[c1|dj] using the current NB-C • Pr[c2|dj] = 1 - Pr[c1|dj] • Update Pr[wt|c1] and Pr[c1] given the probabilistically assigned class for dj (Pr[c1|dj]) and P (a new NB-C is being built in the process • Works well on easy datasets • Problem is that our initialization is strongly biased towards positive documents
Proposed Technique:Step 1 - Spies • Problem is that our initialization is strongly biased towards positive documents • Need to identify some very likely negative documents from the mixed set • We do this by sending “spy” documents from the positive set P and put in the mixed set M • (10% was used) • A threshold t is set and those documents with a probabilistic label less than t are identified as negative • 15% was the threshold used mix c2 likely negative c2 unlabeled spies spies positive c1 c1 positive
Proposed Technique:Step 1 - Spies (cont) • N (most likely negative docs) = U (unlabeled docs) = f • S (spies) = sample(P,s%) • MS = M U S • P = P - S • Assign every document di in P the class c1 • Assign every document dj in MS the class c2 • Run I-EM(MS,P) • Classify each document dj in MS • Determine the probability threshold t using S • For each document dj in M • If its probability Pr[c1|dj] < t • N = N U {dj} • Else U = U U {dj}
Proposed Technique:Step 2 - Building the final classifier • Using P, N and U as developed in the previous step • Put all the spy documents S back in P • Assign Pr[c1 | di] =1 for all documents in P • Assign Pr[c2 | di] =1 for all documents in N. This will change with each iteration of EM • Each doc dk in U is not assigned a label initially. At the end of the first iteration, it will have a probabilistic label Pr[c1 | dk] • Run EM using the document sets P, N and U until it converges • When EM stops, the final classifier has been produced. • This two step technique is called S-EM (Spy EM)
Proposed TechniqueSelecting a classifier • The local maximum that the final classifier may not cleanly separate the positive and negative documents • Likely if there are many local clusters • If so, from the set of classifiers developed over each iteration, select the one with the least probability of error • Refer to (1) Pr[f(X)=Y] = Pr[f(X)=1] - Pr[Y=1] + 2Pr Pr[f(X)=0 | Y=1]Pr[Y=1] /
EvaluationMeasurements • Breakeven Point • 0 = p - r, where p is precision and r is recall • Only evaluates sorting order of class probabilities of documents • Not appropriate • F score • F = 2pr / (p+r) • Measures performance on a particular class • Reflects average effect of both precision and recall • Only when both p and r are large will F be large • Accuracy
EvaluationResults • 2 large document corpora • 20NG • Removed UseNet headers and subject lines • WebKB • HTML tags removed • 8 iterations
EvaluationResults (cont) • Also varied the % of positive documents both in P (%a) and in M (%b)
Conclusions • This paper studied the problem of classification with only partial information: one class and a set of mixed documents • Technique • Naïve Bayes classifier • Expectation Maximization algorithm • Reinitialized using the positive documents and the most likely negative documents to compensate bias • Use estimate of classification error to select a good classifier • Extremely accurate results