PARTIALLY SUPERVISED CLASSIFICATION OF TEXT DOCUMENTS

PARTIALLY SUPERVISED CLASSIFICATION OF TEXT DOCUMENTS authors: B. Liu, W.S. Lee, P.S. Yu, X. Li presented by Rafal Ladysz

WHAT IT IS ABOUT • the paper shows: • document classification • one class of positively labeled documents • accompanied by a set of unlabeled, mixed documents • the above enables to build accurate classifiers • using EM algorithm based on NB classification • strengthening the EM by so called “spy documents” • experimental results for illustration • we will browse through the paper and • emphasize/refresh some of its theoretical aspects • try to understand the methods described • look at results obtained and interpret them

AGENDA (informally) • problem described • document classification • PSC - general assumptions • PSC - some theory • Bayes basics • EM in general • I- EM algorithm • introducing spies • I-S-EM algorithm • selecting classifier • experimental data • results and conclusions • references

KEY PROBLEM – a big picture • no labeled negative training data (text documents) • only a (small) set of relevant (positive) documents • necessity to classify unlabeled text documents • importance: • finding relevant text on the web • or digital libraries

DOCUMENT CLASSIFICATION – some techniques used • kNN (Nearest Neighbors) • Linear Least Squares Fit • SVM • Naive Bayes: utilized here

PARTIALLY SUPERVISED CLASSIFICATION (PSC) – theoretical foundations fixed distributionD over space X x Y, where Y = {0, 1} X, Y: sets of possible documents, classes (positive and negative), respectively ”example” is a labeled document two sets of documents: labeled as positiveP of size n1 drawn from DX|Y=1 unlabeledM of size n2 drawn indep. from X for DX remark: there might be some relevant documents in M(but we don’t know about their existence!)

PSC cont. • PrD[A]: A X x Y chosen randomlyaccordingto D • T: a finite sample being a subset of our dataset • PrT[A]: A T  X x Y chosen randomly • learning algorithm: deals with F, a class of functions and selects a function f from F: F: X  {0, 1}to be used by classifier • probability oferror: Pr[f(X) Y] = Pr[(f(X) = 1)  (Y = 0)] + Pr[(f(X) = 0)  (Y = 1)] • sum of “false positive” and “false negative” cases

PSC: approximations (1) • after transforming expression for probability of error: Pr[f(X)  Y] = =Pr[f(X) = 1] -Pr[Y = 1] + 2Pr[f(X) = 0|Y = 1]Pr[Y = 1] • notice: Pr[Y = 1] = const (no changes of criteria) • approximation 1: keepingPr[f(X) = 0|Y = 1]small: learning errorPr[f(X) = 1]-Pr[Y = 1] = = Pr[f(X) = 1] – const   minimizingPr[f(X) = 1]

PSC: approximations (2) • error: Pr[f(X)  Y] = =Pr[f(X) = 1] -Pr[Y = 1] + 2Pr[f(X) = 0|Y = 1]Pr[Y = 1] • approximation 2: keeping Pr[f(X) = 0|Y = 1]small ANDminimizingPr[f(X) = 1 minimizingPrM[f(X) = 1]) (assumption: most irrelevant) ANDkeepingPrP(ositive)[f(X) = 1])  r where r is a recall: (relevant retrieved) / (all relevant) for large enoughsets P(positive) and M (unlabeled)

CONSTRAINT OPTIMIZATION • simply summarizing what has just been said: a good learning results are achievable if: • the learning algorithm minimizes the number of unlabeled examples labeled as positive • the constrain that fraction of errors on the positive examples  1 – recall (declared upfront) is satisfied

COMPLEXITY FUNCTION (CF) • VC-dim: complexity measure of F (class of f.) • meaning: cardinality of the largest sample set T T  X such that |F|T| = 2|T| thus the larger T, the more functions  F (class of f.); conversely, the higher VC-dim, the more f. in F • Naive Bayes: VC-dim 2m + 1 where m is the cardinalityclassifier’sof vocabulary

CF – two cases • no noise: ftF (X, Y)D: Y = ft(X)(“perfect” f.) it can be shown that selecting f^ F which minimizes i= 1n2f(Xi)|M AND with total recall on set of positives (P) results in a function with small expected error • noise: Y may or may not equal ft(X) • F may or may not contain the target function f • labels are noisy • specifying target expected recall required

CF in noise – modus operandi • learning algorithm tries to output f^  F such that: • E[recall(f^)]  r (that’s why recall is required) • E[precision(f^)]  best available for fF recall(f)  r • how the algorithm achieves that • selecting a set of positives examples from DX|Y=1 and unlabeled examples from D|X • searches a function f which minimizes i=1n2 f(Zi) operating on unlabeled examples • under constrain: errors fraction on positives  1 - r

PROBABILITY vs. LIKELIHOOD • in the Webster dictionary: apparently synonims • from probabilistic point of view: • {si}: some collectively exclusive states of nature • assuming the prior probabilities P(si) are known • observing experimental outcomes {oj}: more info • suppose that oj si: P(oj|si) is known • it is the likelihood of the outcome oj given state si • Bayes theor. combines prior probab. with likelihood • and determines posterior probability for each si • likelihood: probability of observed experimental outcome

NAIVE BAYES in general • formally, Bayes’ theorem can be formulated P(Si|Oj) = P(Oj|Si)P(Si) / (k=1n P(Oj|Sk)P(Sk)) and is called Inverse Probability Law • NB model assumptions • words randomly selected from lexicon, with replacement • words’ independence (words as components of a feature vector) • even though simplistic works pretty well • NB together with EM will be emplloyed here

NB-basedtext classification - formalism • D: training set of • documents as ordered list of words wt • V = <w1, w2, ..., w|V|>: vocabulary used • wd i,k is a word  V in position k of doc. di • C = {c1, c2, ...,c|C|}: predif. classes, here: c1, c2 • Pr[cj|di]: posterior probab needed • total p.: Pr[cj] = iPr[cj|di] / |D| (indeed: Pr[di]1/|D|) • in NB model:class with the highest Pr[cj|di] is assigned to the document

ITERATIVE EXPECTATION-MAXIMIZATION ALGORITHM (I-EM): a concept • a general method of estimating max. likelihood • of an underlying distribution’s parameters • when the data is incomplete • two main applications of the EM algorithm: • when the data has missing values due to problems with the observation process • when optimizing the likelihood function is: • analytically hard • but the likelihood function can be simplified by assuming values for additional, hidden parameters

I-EM -mathematically • (i+1) =argmax zP(Z=z|x,(i))L(x,Z=z|) where: x is an observable, Z represents all hidden (unknown, missing) data,  stands for all (sought after) parameters • problem: determine parameter  on the base of observations y only, • i.e. without knowledge of complete data set x • solution: exploit y and determine iteratively x, 

I-EM properties • simple but computationally demanding • convergence behavior • no guarantee for global optimum • initial point (0) determines if global optimum is reachable (or algorithm gets stuck in local optimum) • stable: likelihood function increases in every iteration until (local if not global) optimum reached • M(aximum) L(ikelihood) are fixed points in EM

I-EM ALGORITHM – why and how • for the classification (main objective) posterior probabilityPr[cj|di] needed • probabilities will converge during iterations • EM: iterative algorithm for max. likelihood estimation for incomplete data (interpolates) • two steps: 1. expectation: filling in missing data 2. maximization: parameters estimating next iteration launched

I-EM: symbols used • symbols used: • D: training set of documents • each documant: ordered list of words • wdi,k: kth word in ith document • each wdi,k V = {w1, w2, ..., w|V|} (vocabulary) • vocabulary: all words to be classified • C = {c1, c2}: predefine dclasses (only 2)

I-EM - application • initial labeling: di P  c1, i.e. Pr[c1|di] = 1, Pr[c2di] = 0 dj M  c2, i.e. Pr[c1|dj] = 0, Pr[c2dj] = 1 (vice versa) • NB-C created, then applied to dataset M: • computing posterior probab. Pr[c1|dj] in M (eq. 5) • assigning computed new probabilistic label to dj • Pr[c1|di] = 1 (not affected) during the process • in each iteration: • Pr[c1|dj] is revised, then • new NB-C built based on new Pr[c1|dj]for M and P • iterating continues till convergence occurs

I-EM pseudocode I-EM(M, P) 1. build initial NB classifier NB-C using M and P sets 2. loop while NB-C parameters keep changing (i.e. as long as convergence is taking place) 3. for each document djM 4. compute Pr[c1|dj] using current NB-C (eq. 5) //Pr[c2|dj] = 1 - Pr[c1|dj]: c1 and c2 are collectively excl. //ifPr[c1|dj] > Pr[c2|djthen di is classified as c1 5. update Pr[wt|c1] and Pr[c1](eq. 3, 4) //given probabilistically assigned classes for // dj(Pr[c1|dj]) and setP, // a new NB-Cbuilt during processing

I-EM – benefits and limitations • EM A. helps assign probabilistic class labels to each dj in mixed set of documents: Pr[c1|dj] and Pr[c2|dj} • all the above probabilities converge (iterations) • the final result is sensitive to initial conditions assumed • conclusion: • good handling of “easy” data (+/- separable easily) • a niche for improvement for “hard” data • source of the limitation: initialization strongly biased towards positive data (documents) • solution: • balanced initialization (+/-) • find reliable negativedocuments for initialization c2 in EM

I-EM: extension • I-EM helps identify (most likely) negatives in M • issue: how to get as reliable as possible data (documents) to do so • idea: using “spy“ documents from P in M • approach: • select s  10% of documents from P; denoted S • add S-set to M-set • S behave as unknown positive documents do in M • enabling inference within M • I-EM still in use • but instead of M it operates on M  S

SPIES – determining threshold • set of spy documents S = {s1, s2, ..., sk} • Pr[c1|si]  si: probab. label assigned to each spy • in noiseless case: t = min{Pr[c1|si]}, i = 1, 2, ..., k • equivalent to retrieving all spy documents • in more realistic scenario: noise and outliers exist • minimum probability might be unreliable, because e.g.: for outlier si in S: posterior Pr[c1|si] might be << Pr[c1|dj]  M • setting t: • sort si in S according to Pr[c1|si] • set noise level l (e.g. 15%) so that l% of docs have probability < t • thus, Step-1 objective is: • identifying a set of reliable negative documents from the unlabeled set • unlabeled set to be treated as negative data (docs)

SPY DOCUMENTS and Step-1 algorithm • threshold t used for decision making: • if Pr[c1|dj] < t: denoted as N(egative) • if for dj  P Pr[c1|dj] > t: denoted as U(nlabeled) • algorithm Step-1 for identifying most likely negativesN from unlabeled U set

STEP-1 effect positives negatives BEFOREAFTER LN (likely negative) M (un-labeled) c2 U un-labeled c2 positive spies some spies positive c1 P (positive) c1 help of spies: most positives in M get into unlabeled set, while most negatives get into LN; purity of LN higher than that of M initial situation: M = P  N no clue which are P and N spies from P added to M

STEP-2: building and selecting final classifier • EM still in use, but now with P, LN and U • algorithm proceeds as follows: • put all spies S back to P (where they were before) • diP:  c1 (i.e. Pr[c1|di] = 1); (fixed thru iterations) • djLN:  c2 (i.e. Pr[c2|dj] = 1); (changing thru EM) • dkU: initially assigned no label (will be after EM(1)) • run EM using P, LN and U until it converges • final classifier is produced when EM stops • all these constitutes S-EM (spy EM)

STEP-2: comments • probabilities of sets U and LN are allowed to change • set U participates in EM since EM(2) on with its documents assigned probab. labels Pr[c1|dk] • experimenting with a = 5%, 10% or 20% gave similar results; why? • for the parameter a (% used for creating LN): when within a range of approximately 5%-20%: if too many positives in LN, then EM corrects it slowly adding them to positives

STEP-1 AND STEP-2 SUMMARY • Step 1: Identifying a set of reliable negative documents from the unlabeled set. The unlabeled set is treated as negative data. • Step 2: Building and selecting a classifier, consists of two sub-steps: • building a set of classifiers by iteratively applying a classification algorithm; the EM algorithm is used again. b) selecting a good classifier from the set of classifiers constructed above; this sub-step may be called "catching a good classifier".

SELECTING CLASSIFIER • as said, EM is prone to local maxima trap • if a local maximum separates the two classes well: no problem (or problem solved) • otherwise (i.e. positives and negatives consist of many clusters each) the data may be not separable • remedy: stop iterating of EM at some point • what point?

SELECTING CLASSIFIER continued • eq. (2) can be helpful: error probability Pr[f(X)  Y] = Pr[f(X) = 1] -Pr[Y = 1] + 2Pr[f(X) = 0|Y = 1]Pr[Y = 1] • it can be shown that knowing the component PrM[Y = c1]allows us to estimate the error • method: estimating the change of the probability error between iterations i and i+1 • i = can becomputed (formula in 4.5 of the paper) • if i > 0 for the first time, then i-th classifier produced is the last to add (no need to proceed beyond i)

EXPERIMENTAL DATA described • two large document corpora • 30 datasets created • e.g. 20 newsgroups subdivided into 4 groups • all headers removed • e.g. WebKB (CS depts.) subdivided into 7 categories • objective: • recovering positive documents placed into mixed sets • no need to separate test set (from training set) • unlabeled mixed set serves as the test set

DATA description cont. • for each experiment: • dividing full positive set into two subsets: P and R • P: positive set used in the algorithm with a% of the full positive set • R: set of remaining documents with b% have been put into negative set M (not all in R put to M) • belief: in reality M is large and has a small proportion of positive documents • parameters a and b have been varied to cover different scenarios

EXPERIMENTAL RESULTS • techniques used NB-C: applied directly to P (c1) and M(c2) to built classifier to be applied to classify data in set M I-EM: applies EM-A to P and M as long as converges (no spy yet); final classifier to be applied to M to identify its positives S-EM: spies used to re-initialize I-EM to build the final classifier; threshold t used

RESULTS cont. • Table 1: 30 results for diferent parametrs a, b • Table 2: summary of averages for other a, b settings • precisionF = 2pr/(p+r), where p, r are and recall, respectively • S-EM outperforms NB and I-EM in F dramatically • accuracy (of a classifier) A = c/(c+1) , where c, i are numbers of correct and incorrect decisions, respectively • S-EM outperforms NB and I-EM in A as well • comment: datasets skewed (positives are only a small fraction), thus A is not a reliable measure of classifier’s performance

RESULTS cont. • Table 3: F-score and accuracy A • results in this table show great effect of reinitialization with spies: S-EM outperforms I-EMbest • reinitialization is not, however, the only factor of improvement: S-EM outperforms S-EM4 • conclusions: both Step-1 (reinitializing) and Step-2 (selecting the best model) are needed!

REFERENCES other than in the paper http://www.cs.uic.edu/~liub/LPU/LPU-download.html http://www.ant.uni-bremen.de/teaching/sem/ws02_03/slides/em_mud.pdf http://www.mcs.vuw.ac.nz/~vignaux/docs/Adams_NLJ.html http://plato.stanford.edu/entries/bayes-theorem/ http://www.math.uiuc.edu/~hildebr/361/cargoat1sol.pdf http://jimvb.home.mindspring.com/monthall.htm http://www2.sjsu.edu/faculty/watkins/mhall.htm http://www.aei-potsdam.mpg.de/~mpoessel/Mathe/3door.html http://216.239.37.104/search?q=cache:aKEOiHevtE0J:ccrma-www.stanford.edu/~jos/bayes/Bayesian_Parameter_Estimation.html+bayes+likelihood&hl=pl&ie=UTF-8&inlang=pl

THANK YOU

PARTIALLY SUPERVISED CLASSIFICATION OF TEXT DOCUMENTS

PARTIALLY SUPERVISED CLASSIFICATION OF TEXT DOCUMENTS

Presentation Transcript

Classification (Supervised Clustering)

Text Classification

Supervised Classification

Soft-Supervised Learning for Text Classification

TEXT CLASSIFICATION

SUPERVISED CLASSIFICATION OF TEXT DOCUMENTS

ICA of Text Documents

Supervised classification

Supervised Multiattribute Classification

Text Classification

Pseudo-supervised Clustering for Text Documents

Partially Edentulous arches classification

Supervised classification

Chapter 5: Partially-Supervised Learning

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li

Classification of Business Documents

Classification Text

Supervised Classification

Partially Edentulous arches classification

Partially Supervised Classification of Text Documents

Text Classification

TEXT CLASSIFICATION