220 likes | 447 Views
Learning from Imbalanced, Only Positive and Unlabeled Data. Yetian Chen 04-29-2009. Outline. Introduction and Problem statement 2008 UC San Diego Data Ming Competition Task 1: Supervised Learning from Imbalanced Data Sets Over-sampling and Under-sampling
E N D
Learning from Imbalanced, Only Positive and Unlabeled Data Yetian Chen 04-29-2009
Outline • Introduction and Problem statement 2008 UC San Diego Data Ming Competition • Task 1: Supervised Learning from Imbalanced Data Sets Over-sampling and Under-sampling • Task 2: Semi-Supervised Learning from Only Positive and Unlabeled Data Two-step Strategy
Statement of Problems 2008 UC San Diego Data Ming Competition • Task 1:Standard Binary Classification A binary classification task that involves 20 real-valued features from an experiment in the physical sciences. The training data consist of 40,000 examples, but there are roughly ten times as many negative examples as positive. The test set, however, is evenly distributed between positive and negative examples. • Task 2:Positive-Only Semi-Supervised Task also a binary classification task, but most of the training examples are unlabeled. In fact, only a few of the positive examples have labels. There are both positive and negative unlabeled examples, but there are several times as many negative training examples as positive. This class distribution is reflected in the test sets.
Task 1: Learning from Imbalanced Data • Class imbalance is prevalent in many applications: fraud/intrusion detection, risk management, text classification, medical diagnosis/monitoring, etc. • Standard classifiers tend to be overwhelmed by the large classes and ignore the small ones, i.e., tend to produce high predictive accuracy over the majority class, but poor predictive accuracy over the minority class
Solutions to Class Imbalance Problem • At the data level (re-samplings) • Over-sampling: increases the number of minority instances by over-sampling them • Under-sampling: extract a smaller set of majority instances while preserving all the minority instances • At the algorithmic level • Cost-sensitive based: adjust the costs of the various classes so as to counter the class imbalance • ……
Over-sampling • SMOTE: Synthetic Minority Over-sampling Technique The minority class is over-sampled by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the k minority class nearest neighbors. • Over-sampling by duplicating the minority examples
Under-sampling • Randomly select a subset from the majority class. The size of the subset is roughly equal to the size of minority class. • After re-sampling, apply standard classifiers onto the rebalanced datasets, compare the accuracies. Decision Tree, Naïve Bayes, Neural Network(one hidden layer)
Results for Task 1 For Neural Network Classifiers, I experimented with different hidden units (5,11,15, 20), 11 gives the best accuracies.
Conclusion for Task 1 • For Naïve Bayes classifiers, re-sampling does not improve the accuracy significantly. • For Decision Tree Classifiers, random under-sampling and over-sampling with SMOTE significantly improve the accuracy. • For Neural Network, all three re-sampling techniques significantly improve the accuracy • Neural Network classifier with over-sampling with SMOTE gives the best accuracy compared to other classifiers and re-sampling techniques.
Task 2: Learning from Only Positive and Unlabeled Data • Positive examples: One has a set of examples of a class P, and • Unlabeled set: also has a set U of unlabeled (or mixed) examples with instances from P and also not from P (negative examples). • Build a classifier: Build a classifier to classify the examples in U and/or future (test) data. • Key feature of the problem: no labeled negative training data. • We call this problem, PU-learning.
Examples in Real Life • Specialized molecular biology database. Defines a set of positive examples ( genes/proteins related to certain disease or function ) No info about examples that should not be included and it is unnatural to build such set. • Learning user’s preference for web pages: • The user’s bookmarks can be considered as positive examples • All the rest web pages are unlabeled examples • Direct marketing: company’s current list of customers as positive examples • Text classification: labeling is labor intensive
x1 < 0 u + + + u + + u + + + + x2 > 0 u u u u u u u u Are Unlabeled Examples Helpful? • Function known to be either x1 < 0 or x2 > 0 • Which one is it? “Not learnable” with only positiveexamples. However, addition ofunlabeled examples makes it learnable.
Two-step strategy • Step 1: Identifying a set of reliable negative examples from the unlabeled set. • S-EM [Liu et al, 2002] uses a Spy technique, • PEBL [Yu et al, 2002] uses a 1-DNF technique • Roc-SVM [Li & Liu, 2003] uses the Rocchio algorithm. • … • Step 2: Building a sequence of classifiers by iteratively applying a classification algorithm and then selecting a good classifier. • S-EM uses the Expectation Maximization (EM) algorithm, with an error based classifier selection mechanism • PEBL uses SVM, and gives the classifier at convergence. I.e., no classifier selection. • Roc-SVM uses SVM with a heuristic method for selecting the final classifier.
positive negative Using P, RN and Q to build the final classifier iteratively or Using only P and RN to build a classifier Reliable Negative (RN) U positive Q =U - RN P Step 1 Step 2
Step 1: The Spy technique • Sample a certain % of positive examples and put them into unlabeled set to act as “spies”. • Run a classification algorithm assuming all unlabeled examples are negative, • We will know the behavior of those actual positive examples in the unlabeled set through the “spies”. • Use Expectation-Maximization (EM) algorithm to assign each unlabeled example a probabilistic class label • We can then extract reliable negative examples from the unlabeled set more accurately.
Step 2: Building the final classifier • Use Naïve Bayes classifiers to build the final classifier • Use P as the positive class, use N (reliable negative examples) as the negative class
Results and Conclusion for Task 2 • Use P as positive class, use U as the negative class, use SMOTE to over-sample P so that the size of P is roughly the same as U, the F1 score = 0.545 • Two-step algorithm gives F1 score = 0.651 • The highest score is F1=0.721 • Only positive and unlabeled data is learnable with the two-step strategy.
Future Work • For task 1, we can try Cost-sensitive based method • For task 2, two-step strategy • Step 1: 1-DNF, Rocchio algorithm • Step2: SVM
References • B. Liu, Y. Dai, X. Li, W. S. Lee, and P. S. Yu. Building text classifiers using positive and unlabeled examples. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), pages 179–188, 2003. • B. Liu, W.S.Lee, P.S. Wu, X. Li. Partially Classification of Text Documents. Proceedings of the Nineteenth International Conference on Machine Learning (ICML-2002), 8-12, July 2002, Sydney, Australia. • Wee Sun Lee, Bing Liu. Learning with Positive and Unlabeled Examples using Weighted Logistic Regression. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), August 21-24, 2003, Washington, DC USA. • Giang Hoang Nguyen, Abdesselam Bouzerdoum, Son Lam Phung: A supervised learning approach for imbalanced data sets. ICPR 2008: 1-4 • Nitesh V. Chawla, Nathalie Japkowicz, Aleksander Kotcz: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explorations 6(1): 1-6 (2004) • Nitesh V. Chawla et. al. (2002). "SMOTE: Synthetic Minority Over-sampling Technique". Journal of Artificial Intelligence Research . Vol.16, pp.321-357.