Jack Chongjie Xue †

Quantification and Semi-Supervised Classification Methods for Handling Changes in Class Distribution KDD-09, Paris, France Gary M. Weiss Jack Chongjie Xue† Department of Computer and Information Science †Also with the Office of Institutional Research Fordham University, USA KDD-09, Paris France

Important Research Problem • Distributions may change after model is induced • Our research problem/scenario: • Class distribution changes but “concept” does not • Let x represent an example and y its label. We assume: • P(y|x) is constant (i.e., concept does not change) • P(y) changes (which means that P(x) must change) • Assume unlabeled data available from new class distribution (training and separate test) KDD-09, Paris France

Two research questions: • How can we maximize classifier performance when class distribution changes but is unknown? • How can we utilize unlabeled data from the changed class distribution to accomplish this? • Our Goals • Outperform naïve methods that ignore these changes • Approach performance of “oracle” method which trains on labeled data from new distribution Research Questions and Goals KDD-09, Paris France

When Class Distribution Changes KDD-09, Paris France

Technical Approaches • Quantification [Forman KDD 06 & DMKD 08] • Task of estimating a class distribution(CD) • Much easier than classification • Adjust model to compensate for CD change [Elkan 01, Weiss & Provost 03] • New examples not used directly in training • We call class distribution estimation (CDE) methods • Semi-Supervised Learning (SSL) • Exploits unlabeled data, which are used for training • Other approaches discussed later KDD-09, Paris France

CDE Methods • CDE-Oracle (upper bound) • Determines new CD by peeking at class labels then adjusts model; CDE upper bound • CDE-Iterate-n • Iterative algorithm because changes to class distribution will be underestimated • Builds model M on orig. training data (using last NEWCD) • Labels new distribution to estimate NEWCD • Adjusts M using NEWCD estimate; Output M; • Increment n; Loop to step 1 KDD-09, Paris France

CDE Methods • CDE-AC • Based on Adjusted Count quantification • See [Forman KDD 06 and DMKD 08] for details • Adjusted Positive Rate pr* = (pr – fpr) / (tpr – fpr) • pr is calculated from the predicted class labels • fpr and tpr obtained via cross-validation of labeled training set • Essentially compensates for fact that pr will underestimate changes to class distribution KDD-09, Paris France

SSL Methods • SSL-Naïve • Build model from labeled training data • Label unlabeled data from new distribution • Build new model from predicted labels of new distr. • Note: Does not directly use original training data • SSL-Self-Train • Similar to SSL-Naïve, but original training data used and examples from new distribution with most confident predictions (above median) • Iterates until all examples merged or max iterations (4) KDD-09, Paris France

Hybrid Method • Combination of SSL-Self-Train and CDE-Iterate • Can view as SSL-Self-Train but at each iteration model adjusted to compensate for difference between CD of merged training data and model applied to new data KDD-09, Paris France

Experiment Methodology • Use 5 relatively large UCI data sets • Partition data to form “original” and “new” distributions • Original distribution made to be 50% positive • New distribution varied from 1% to 99% positive • Results averaged over 10 random runs • Use WEKA’s J48 for experiments (like C4.5) • Track accuracy and F-measure • F-measure places more emphasis on minority-class KDD-09, Paris France

Results: Accuracy (Adult Data Set) KDD-09, Paris France

Results: Accuracy (SSL-Naive) KDD-09, Paris France

Results: Accuracy (SSL-Self-Train) KDD-09, Paris France

Results: Accuracy (CDE-Iterate-1) KDD-09, Paris France

Results: Accuracy (CDE-Iterate-2) KDD-09, Paris France

Results: Accuracy (Hybrid) KDD-09, Paris France

Results: Accuracy (CDE-AC) KDD-09, Paris France

Results: Average Accuracy(99 pos rates) KDD-09, Paris France

Results: F-Measure (Adult Data Set) KDD-09, Paris France

Results: F-Measure (99 pos rates) KDD-09, Paris France

Why do Oracle Methods Perform Poorly? • Oracle method: • Oracle trains only on new distribution • New distribution often very unbalanced • F-measure should do best with balanced data • Weiss and Provost (2003) show balanced best for AUC • CDE-Oracle method: • CDE-Iterate underestimates change in class distr. • May be helpful for F-measure since will better balance importance of minority class KDD-09, Paris France

Conclusion • Can substantially improve performance by not ignoring changes to class distribution • Can exploit unlabeled data from new distribution, even if only to estimate NEWCD • Quantification methods can be very helpful and much better than semi-supervised learning alone KDD-09, Paris France

Future Work • Problem reduced with well-calibrated probability models (Zadrozny & Elkan ’01) • Decision trees do not produce these • Evaluate methods that produce good estimates • In our problem setting p(x) changes • Try methods that measure this change and compensate for it (e.g., via weighting the x’s) • Experiment with initial distribution not 1:1 • Especially highly skewed distributions (e.g. diseases) • Other issues: data streams/real time update KDD-09, Paris France

References [Forman 06] G. Forman, Quantifying trends accurately despite classifier error and class imbalance, KDD-06, 157-166. [Forman 08] G. Forman, Quantifying counts and costs via classification, Data Mining and Knowledge Discovery, 17(2), 164-206. [Weiss & Provost 03] G. Weiss & F. Provost, Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction, Journal of Artificial Intelligence Research, 19:315-354. [Zadrozny & Elkan 01] B. Zadrozny & C. Elkan, Obtaining calibrated probability estimates from decision trees and naïve bayesian classifiers, ICML-01, 609-616. KDD-09, Paris France

Jack Chongjie Xue †