120 likes | 138 Views
Explore the nuances between "imbalanced" and "unbalanced" data in machine learning, discussing the implications of favoring minority classes, the significance of false positives, and the importance of selecting training examples that optimize classifier performance. Delve into the impact of class imbalance on algorithms like C4.5 and the effectiveness of undersampling techniques. Consider the challenges of endless tweaking, the relevance of dataset size, and the detriments of imbalance on feature selection and clustering. Unravel the complexities of learning with unbalanced data, highlighting decision trees, Bayes' rule, and the role of ROC curves and AUC in model evaluation.
E N D
Reflections Robert Holte University of Alberta holte@cs.ualberta.ca
“unbalanced” vs. “imbalanced” • Google: Searched the web for imbalanced. … about 53,800. • Searched the web for unbalanced. … about 465,000. Shouldn’t we favour the minority class ???
Is “FP” meaningful ? • Elkan: individual examples have costs, so the number of misclassified positive examples is irrelevant • Moreover, if the testing distribution can differ from the training distribution the FP measured on training may have no relation to FP later. BUT…
Babies and Bathwater… • Not every situation involves example-specific costs and drifting within-class distributions • ROC curves are far better than accuracy • and ROC curves are better than AUC or any scalar measure • and cost curves are even better ??
And the question remains… How to select the examples for training which give the best classifier for your circumstances ? (Foster’s budgeted learning problem)
Within-class imbalance • Elkan: subpopulations in test distribution not evenly represented in training • Other presenters: subpopulations in training are not equal size
In Defense of studies of C4.5 and undersampling • Foster’s opening example (“budgeted learning”) is very common. • Undersampling is a common technique (SAS manual) • Different algorithms react differently to undersampling • C4.5’s reaction is not necessarily intuitive • Foster: appropriate sampling method depends on performance measure
Endless Tweaking ? • Definitely a danger • Overtuning • Plethora of results/methods • But exploratory research is valid once a clear need is established • Some papers have presented specific hypotheses that can now be tested • 1-class SVM outperforms 2-class SVM when…
Size matters • Having a small number of examples is a different problem than having an imbalance • Both cause problems • We should be careful to separate them in our experiments
No problem ? • Foster: problem diminishes when datasets get large • Are some learning algorithms insensitive ? • Generative models ? • SVMs ? (it seems not after today) • Active learning, progressive sampling
More problems ? • Imbalance detrimental to feature selection • Imbalance detrimental to clustering
ELKAN: Bogosity about learning with unbalanced data • The goal is yes/no classification. • No: ranking, or probability estimation • Often, P(c=minority|x) < 0.5 for all examples x • Decision trees and C4.5 are well-suited • No: model each class separately, then use Bayes’ rule • P(c|x) = P(x|c)P(c) / [ P (x|c)P(c) + P(x|~c)P(~c) ] • No: avoid small disjuncts • With naïve Bayes: P(x|c) = P(xi | c) • Under/over-sampling are appropriate • No: do cost-based example-specific sampling, then bagging • ROC curves and AUC are important