120 likes | 138 Views
Reflections. Robert Holte University of Alberta holte@cs.ualberta.ca. “unbalanced” vs. “imbalanced”. Google: Searched the web for imbalanced . … about 53,800. Searched the web for unbalanced . … about 465,000. Shouldn’t we favour the minority class ???. Is “FP” meaningful ?.
E N D
Reflections Robert Holte University of Alberta holte@cs.ualberta.ca
“unbalanced” vs. “imbalanced” • Google: Searched the web for imbalanced. … about 53,800. • Searched the web for unbalanced. … about 465,000. Shouldn’t we favour the minority class ???
Is “FP” meaningful ? • Elkan: individual examples have costs, so the number of misclassified positive examples is irrelevant • Moreover, if the testing distribution can differ from the training distribution the FP measured on training may have no relation to FP later. BUT…
Babies and Bathwater… • Not every situation involves example-specific costs and drifting within-class distributions • ROC curves are far better than accuracy • and ROC curves are better than AUC or any scalar measure • and cost curves are even better ??
And the question remains… How to select the examples for training which give the best classifier for your circumstances ? (Foster’s budgeted learning problem)
Within-class imbalance • Elkan: subpopulations in test distribution not evenly represented in training • Other presenters: subpopulations in training are not equal size
In Defense of studies of C4.5 and undersampling • Foster’s opening example (“budgeted learning”) is very common. • Undersampling is a common technique (SAS manual) • Different algorithms react differently to undersampling • C4.5’s reaction is not necessarily intuitive • Foster: appropriate sampling method depends on performance measure
Endless Tweaking ? • Definitely a danger • Overtuning • Plethora of results/methods • But exploratory research is valid once a clear need is established • Some papers have presented specific hypotheses that can now be tested • 1-class SVM outperforms 2-class SVM when…
Size matters • Having a small number of examples is a different problem than having an imbalance • Both cause problems • We should be careful to separate them in our experiments
No problem ? • Foster: problem diminishes when datasets get large • Are some learning algorithms insensitive ? • Generative models ? • SVMs ? (it seems not after today) • Active learning, progressive sampling
More problems ? • Imbalance detrimental to feature selection • Imbalance detrimental to clustering
ELKAN: Bogosity about learning with unbalanced data • The goal is yes/no classification. • No: ranking, or probability estimation • Often, P(c=minority|x) < 0.5 for all examples x • Decision trees and C4.5 are well-suited • No: model each class separately, then use Bayes’ rule • P(c|x) = P(x|c)P(c) / [ P (x|c)P(c) + P(x|~c)P(~c) ] • No: avoid small disjuncts • With naïve Bayes: P(x|c) = P(xi | c) • Under/over-sampling are appropriate • No: do cost-based example-specific sampling, then bagging • ROC curves and AUC are important