1 / 12

Reflections

Reflections. Robert Holte University of Alberta holte@cs.ualberta.ca. “unbalanced” vs. “imbalanced”. Google: Searched the web for imbalanced .  … about 53,800. Searched the web for unbalanced .  … about 465,000. Shouldn’t we favour the minority class ???. Is “FP” meaningful ?.

murdocks
Download Presentation

Reflections

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reflections Robert Holte University of Alberta holte@cs.ualberta.ca

  2. “unbalanced” vs. “imbalanced” • Google: Searched the web for imbalanced.  … about 53,800. • Searched the web for unbalanced.  … about 465,000. Shouldn’t we favour the minority class ???

  3. Is “FP” meaningful ? • Elkan: individual examples have costs, so the number of misclassified positive examples is irrelevant • Moreover, if the testing distribution can differ from the training distribution the FP measured on training may have no relation to FP later. BUT…

  4. Babies and Bathwater… • Not every situation involves example-specific costs and drifting within-class distributions • ROC curves are far better than accuracy • and ROC curves are better than AUC or any scalar measure • and cost curves are even better ??

  5. And the question remains… How to select the examples for training which give the best classifier for your circumstances ? (Foster’s budgeted learning problem)

  6. Within-class imbalance • Elkan: subpopulations in test distribution not evenly represented in training • Other presenters: subpopulations in training are not equal size

  7. In Defense of studies of C4.5 and undersampling • Foster’s opening example (“budgeted learning”) is very common. • Undersampling is a common technique (SAS manual) • Different algorithms react differently to undersampling • C4.5’s reaction is not necessarily intuitive • Foster: appropriate sampling method depends on performance measure

  8. Endless Tweaking ? • Definitely a danger • Overtuning • Plethora of results/methods • But exploratory research is valid once a clear need is established • Some papers have presented specific hypotheses that can now be tested • 1-class SVM outperforms 2-class SVM when…

  9. Size matters • Having a small number of examples is a different problem than having an imbalance • Both cause problems • We should be careful to separate them in our experiments

  10. No problem ? • Foster: problem diminishes when datasets get large • Are some learning algorithms insensitive ? • Generative models ? • SVMs ? (it seems not after today) • Active learning, progressive sampling

  11. More problems ? • Imbalance detrimental to feature selection • Imbalance detrimental to clustering

  12. ELKAN: Bogosity about learning with unbalanced data • The goal is yes/no classification. • No: ranking, or probability estimation • Often, P(c=minority|x) < 0.5 for all examples x • Decision trees and C4.5 are well-suited • No: model each class separately, then use Bayes’ rule • P(c|x) = P(x|c)P(c) / [ P (x|c)P(c) + P(x|~c)P(~c) ] • No: avoid small disjuncts • With naïve Bayes: P(x|c) =  P(xi | c) • Under/over-sampling are appropriate • No: do cost-based example-specific sampling, then bagging • ROC curves and AUC are important

More Related