360 likes | 494 Views
Learning to Classify Documents with Only a Small Positive Training Set. Xiao-Li Li Institute for Infocomm Research, Singapore Nanyang Technological University, Singapore Joint work with Bing Liu (University of Illinois at Chicago) See-Kiong Ng (Institute for Infocomm Research).
E N D
Learning to Classify Documents with Only a Small Positive Training Set Xiao-Li Li Institute for Infocomm Research, Singapore Nanyang Technological University, Singapore Joint work with Bing Liu (University of Illinois at Chicago) See-Kiong Ng (Institute for Infocomm Research)
Outline • 1. Introduction of problem • 2. The Proposed Technique LPLP • 3. Evaluation Experiments • 4. Conclusions
1. Introduction • Traditional Supervised Learning • Given a set of labeled training documents of n classes, the system uses this set to build a classifier. • The classifier is then used to classify new documents into the n classes. • Typically require a large number of labeled examples, which can be an expensive and tedious process.
Positive-Unlabeled (PU) Learning • One way to reduce the amount of labeled training data is to develop classification algorithms that can learn from a set P of labeled positive examples augmented with a set U of unlabeled examples. • Then, build a classifier using P and U to classify the data in U as well as future test data. We call this the PU learning problem.
PU Learning • Positive documents: One has a set of documents of a class P, and • Unlabeled (or mixed) set: also has a set U of unlabeled documents containing documents from P and also not from P (negative documents). • Build a classifier: Build a classifier to classify the documents in U and future (test) data.
P (ECML) U (AAAI) An illustration of the typical PU Learning Machine learning papers in AAAI classify Classifier aims to automatically find hidden positives in U Classifier
Applications of the problem • Given a ECML proceedings, find all machine learning papers from AAAI, IJCAI, KDD • Given one's bookmarks, identify those documents that are of interest to him/her from Web sources. • A company has a database with details of its customers, try to find potential customers from a database consisting of details of people.
Related works • Theoretical study:Denis (1998), Muggleton (2001) and Liu et al (2002) show that this problem is learnable. • Scholkopf et al(1999) and others proposed one-class SVM • S-EM: In [ICML,Liu, Lee, Yu, Li, 2002], Liu et al. proposed a method (called S-EM) to solve the problem based on a spy technique, naïve Bayesian classification (NB) and the EM algorithm.
Related works • PEBL: Yu et al. (KDD, 2002) proposed a SVM based technique to classify Web pages given positive and unlabeled pages. • NBP: Denis’s group also built a NBP system. • Roc-SVM: Li and Liu (IJCAI, 2003) gives a Rocchio and SVM based method.
Can we use the current techniques in some real applications?
Printer Pages From Amazon CNET A real-life business intelligence application ---searching for information on related products A company that sells computer printers may want to do a product comparison among the various printers currently available in the market
Current techniques can not work well! Why???
The Assumption (1) of current techniques • There was a sufficiently large set of positive training examples • However, in practice, obtaining a large number of positive examples can be rather difficult in many real applications.
Current Assumption (1) The small positive set may not even adequately represent the whole positive class
H1 H2 PU learning with a small positive training set
Printer Pages From Amazon CNET The Assumption (2) of current techniques • Positive set (P) and hidden positive examples in the unlabeled set(U) are generated from the same distribution. • Reason: Different Web sites present similar products in different styles and have different focuses. Differentred color???
Printer Pages From Amazon CNET 2. The proposed techniques: Ideas E.g. share the representative word features such as “printer”, “inkjet”, “laser”, “ppm” etc, Both should still be similar in some underlying feature dimensions (or subspaces) as they belong to the same class.
The proposed techniques: Ideas (Cont.) • If we can find such a set of representative word features (RW) from the positive set P and U, then we can use them to extract other hidden positive documents from U (Share). • Method: LPLP Learning from Probabilistically Labeled Positive examples
The proposed techniques: LPLP • 1. Select the set of representative word featuresRW from the given positive set P. • 2. We extract the likely positive documents from U and probabilistically label them based on the set RW. • 3. We employ the EM algorithm to build an accurate classifier to identify the hidden positive examples from U.
Step1: Selecting a set of representative word features from P • The scoring function s() is based on TFIDF method • It gives high scores to those words that occur frequently • in the positive set P and not in the whole corpus since U contains many other unrelated documents.
Select representative word features from P printer Printer Pages From Amazon inkjet laser ppm CNET
Step2: identifying LP from U and probabilistically labeling the documents in LP • rd: representative document consists of all reprehensive features • Compare each document di in U with rd using the cosine similarity, which produces a set LP of probabilistically labeled documents with Pr(di|+) > 0. • The hidden positive examples in LP will be assigned high probabilities while the negative examples in LP will be assigned very low probabilities.
Reprehensive document rd Printer Inkjet Laser Ppm U Identifying likely positives RU low probability high probability LP
The Naïve Bayesian method (1) Classifier parameters (2) (3) Classifier
Step3: EM algorithm • Re-initialize the EM algorithm by treating the probabilistically labeled LP (with/without P) as positive documents. • LP has the similar distributions with other hidden positive documents in U • The remaining unlabeled set RU is also much purer than U as a negative set.
P Build a final classifier Negative set Positive set (two options) LP RU Option1 Option2: combine P and LP Classifier
3. EMPIRICAL EVALUATION Datasets Number of Web pages and their classes
Experiment setting • We experimented with different number of (randomly selected) positive documents in P, i.e. |P| = 5, 15, or 25 and allpos. • We conducted a comprehensive set of experiments using all the possible P and U combinations. That is, we selected every entry in Table 1 as the positive set P and use each of the other 4 Web sites as the unlabeled set U.
Performance of LPLP with different numbers of positive documents different numbers of positive documents
LP + P or LP only? • If there were only a small number of positive documents (|P| = 5, 15 and 25) available, we found that combining LP and P to construct the positive set for the classifier is better than using LP only. • If there is a large number of positive documents, then using LP only is better.
The number of the representative features • In general, 5-25 representative words would suffice. • Including the less representative word features beyond the top 25 most representative ones would introduce unnecessary noise in identifying the likely positive documents in U
Performance of LPLP, Roc-SVM and PEBL (using either P or LP) when using all positive PEBL and Roc-SVM use the likely positive documents LP which requires each document (d) from U to contain at least 5 (out of 10) selected representative words.
Comparative results when the number of the positive documents is small
Conclusions • In many real-world classification applications, it is often the case that the number of positive examples available for learning can be fairly limited • We proposed an effective technique LPLP that can learn effectively from positive and unlabeled examples with a small positive set for document classification.
Conclusions (cont.) • The likely positive documents LP can be used to help boost the performance of classification techniques for PU learning problems • LPLP algorithm benefited the most because of its ability to handle probabilistic labels and is thus better equipped to take advantage of the probabilistic LP set than the SVM-based approaches.
Thank you for your attention!