Navigating Imbalance & Unbalance: A Study on Class Distribution in Machine Learning Models

Reflections Robert Holte University of Alberta holte@cs.ualberta.ca

“unbalanced” vs. “imbalanced” • Google: Searched the web for imbalanced. … about 53,800. • Searched the web for unbalanced. … about 465,000. Shouldn’t we favour the minority class ???

Is “FP” meaningful ? • Elkan: individual examples have costs, so the number of misclassified positive examples is irrelevant • Moreover, if the testing distribution can differ from the training distribution the FP measured on training may have no relation to FP later. BUT…

Babies and Bathwater… • Not every situation involves example-specific costs and drifting within-class distributions • ROC curves are far better than accuracy • and ROC curves are better than AUC or any scalar measure • and cost curves are even better ??

And the question remains… How to select the examples for training which give the best classifier for your circumstances ? (Foster’s budgeted learning problem)

Within-class imbalance • Elkan: subpopulations in test distribution not evenly represented in training • Other presenters: subpopulations in training are not equal size

In Defense of studies of C4.5 and undersampling • Foster’s opening example (“budgeted learning”) is very common. • Undersampling is a common technique (SAS manual) • Different algorithms react differently to undersampling • C4.5’s reaction is not necessarily intuitive • Foster: appropriate sampling method depends on performance measure

Endless Tweaking ? • Definitely a danger • Overtuning • Plethora of results/methods • But exploratory research is valid once a clear need is established • Some papers have presented specific hypotheses that can now be tested • 1-class SVM outperforms 2-class SVM when…

Size matters • Having a small number of examples is a different problem than having an imbalance • Both cause problems • We should be careful to separate them in our experiments

No problem ? • Foster: problem diminishes when datasets get large • Are some learning algorithms insensitive ? • Generative models ? • SVMs ? (it seems not after today) • Active learning, progressive sampling

More problems ? • Imbalance detrimental to feature selection • Imbalance detrimental to clustering

ELKAN: Bogosity about learning with unbalanced data • The goal is yes/no classification. • No: ranking, or probability estimation • Often, P(c=minority|x) < 0.5 for all examples x • Decision trees and C4.5 are well-suited • No: model each class separately, then use Bayes’ rule • P(c|x) = P(x|c)P(c) / [ P (x|c)P(c) + P(x|~c)P(~c) ] • No: avoid small disjuncts • With naïve Bayes: P(x|c) =  P(xi | c) • Under/over-sampling are appropriate • No: do cost-based example-specific sampling, then bagging • ROC curves and AUC are important

Navigating Imbalance & Unbalance: A Study on Class Distribution in Machine Learning Models

Navigating Imbalance & Unbalance: A Study on Class Distribution in Machine Learning Models

Presentation Transcript

Reflections

Reflections

Reflections

REFLECTIONS

Reflections

Reflections

Reflections

Reflections

Reflections

Reflections

Reflections

Reflections

Reflections

Reflections

Reflections

Reflections

Reflections

Reflections

Reflections

Reflections

Reflections

Reflections