1 / 24

Jack Chongjie Xue †

Quantification and Semi-Supervised Classification Methods for Handling Changes in Class Distribution. KDD-09, Paris, France. Gary M. Weiss. Jack Chongjie Xue †. Department of Computer and Information Science † Also with the Office of Institutional Research Fordham University , USA.

audra-ford
Download Presentation

Jack Chongjie Xue †

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Quantification and Semi-Supervised Classification Methods for Handling Changes in Class Distribution KDD-09, Paris, France Gary M. Weiss Jack Chongjie Xue† Department of Computer and Information Science †Also with the Office of Institutional Research Fordham University, USA KDD-09, Paris France

  2. Important Research Problem • Distributions may change after model is induced • Our research problem/scenario: • Class distribution changes but “concept” does not • Let x represent an example and y its label. We assume: • P(y|x) is constant (i.e., concept does not change) • P(y) changes (which means that P(x) must change) • Assume unlabeled data available from new class distribution (training and separate test) KDD-09, Paris France

  3. Two research questions: • How can we maximize classifier performance when class distribution changes but is unknown? • How can we utilize unlabeled data from the changed class distribution to accomplish this? • Our Goals • Outperform naïve methods that ignore these changes • Approach performance of “oracle” method which trains on labeled data from new distribution Research Questions and Goals KDD-09, Paris France

  4. When Class Distribution Changes KDD-09, Paris France

  5. Technical Approaches • Quantification [Forman KDD 06 & DMKD 08] • Task of estimating a class distribution(CD) • Much easier than classification • Adjust model to compensate for CD change [Elkan 01, Weiss & Provost 03] • New examples not used directly in training • We call class distribution estimation (CDE) methods • Semi-Supervised Learning (SSL) • Exploits unlabeled data, which are used for training • Other approaches discussed later KDD-09, Paris France

  6. CDE Methods • CDE-Oracle (upper bound) • Determines new CD by peeking at class labels then adjusts model; CDE upper bound • CDE-Iterate-n • Iterative algorithm because changes to class distribution will be underestimated • Builds model M on orig. training data (using last NEWCD) • Labels new distribution to estimate NEWCD • Adjusts M using NEWCD estimate; Output M; • Increment n; Loop to step 1 KDD-09, Paris France

  7. CDE Methods • CDE-AC • Based on Adjusted Count quantification • See [Forman KDD 06 and DMKD 08] for details • Adjusted Positive Rate pr* = (pr – fpr) / (tpr – fpr) • pr is calculated from the predicted class labels • fpr and tpr obtained via cross-validation of labeled training set • Essentially compensates for fact that pr will underestimate changes to class distribution KDD-09, Paris France

  8. SSL Methods • SSL-Naïve • Build model from labeled training data • Label unlabeled data from new distribution • Build new model from predicted labels of new distr. • Note: Does not directly use original training data • SSL-Self-Train • Similar to SSL-Naïve, but original training data used and examples from new distribution with most confident predictions (above median) • Iterates until all examples merged or max iterations (4) KDD-09, Paris France

  9. Hybrid Method • Combination of SSL-Self-Train and CDE-Iterate • Can view as SSL-Self-Train but at each iteration model adjusted to compensate for difference between CD of merged training data and model applied to new data KDD-09, Paris France

  10. Experiment Methodology • Use 5 relatively large UCI data sets • Partition data to form “original” and “new” distributions • Original distribution made to be 50% positive • New distribution varied from 1% to 99% positive • Results averaged over 10 random runs • Use WEKA’s J48 for experiments (like C4.5) • Track accuracy and F-measure • F-measure places more emphasis on minority-class KDD-09, Paris France

  11. Results: Accuracy (Adult Data Set) KDD-09, Paris France

  12. Results: Accuracy (SSL-Naive) KDD-09, Paris France

  13. Results: Accuracy (SSL-Self-Train) KDD-09, Paris France

  14. Results: Accuracy (CDE-Iterate-1) KDD-09, Paris France

  15. Results: Accuracy (CDE-Iterate-2) KDD-09, Paris France

  16. Results: Accuracy (Hybrid) KDD-09, Paris France

  17. Results: Accuracy (CDE-AC) KDD-09, Paris France

  18. Results: Average Accuracy(99 pos rates) KDD-09, Paris France

  19. Results: F-Measure (Adult Data Set) KDD-09, Paris France

  20. Results: F-Measure (99 pos rates) KDD-09, Paris France

  21. Why do Oracle Methods Perform Poorly? • Oracle method: • Oracle trains only on new distribution • New distribution often very unbalanced • F-measure should do best with balanced data • Weiss and Provost (2003) show balanced best for AUC • CDE-Oracle method: • CDE-Iterate underestimates change in class distr. • May be helpful for F-measure since will better balance importance of minority class KDD-09, Paris France

  22. Conclusion • Can substantially improve performance by not ignoring changes to class distribution • Can exploit unlabeled data from new distribution, even if only to estimate NEWCD • Quantification methods can be very helpful and much better than semi-supervised learning alone KDD-09, Paris France

  23. Future Work • Problem reduced with well-calibrated probability models (Zadrozny & Elkan ’01) • Decision trees do not produce these • Evaluate methods that produce good estimates • In our problem setting p(x) changes • Try methods that measure this change and compensate for it (e.g., via weighting the x’s) • Experiment with initial distribution not 1:1 • Especially highly skewed distributions (e.g. diseases) • Other issues: data streams/real time update KDD-09, Paris France

  24. References [Forman 06] G. Forman, Quantifying trends accurately despite classifier error and class imbalance, KDD-06, 157-166. [Forman 08] G. Forman, Quantifying counts and costs via classification, Data Mining and Knowledge Discovery, 17(2), 164-206. [Weiss & Provost 03] G. Weiss & F. Provost, Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction, Journal of Artificial Intelligence Research, 19:315-354. [Zadrozny & Elkan 01] B. Zadrozny & C. Elkan, Obtaining calibrated probability estimates from decision trees and naïve bayesian classifiers, ICML-01, 609-616. KDD-09, Paris France

More Related