Gary Weiss, Kate McCarthy, Bibi Zabar Fordham University

Cost-Sensitive Learning vs. Sampling:Which is Best for Handling Unbalanced Classes with Unequal Error Costs? Gary Weiss, Kate McCarthy, Bibi Zabar Fordham University

Background • Highly skewed data is common • Typically more interested in correctly classifying the minority class examples • Without special measures, classifier will rarely predict the minority class • A common approach: balance the data • Imposes non-uniform misclassification costs* • If alter training set class distribution from 1:1 to 2:1 then have essentially applied a cost ratio of 2:1 * C. Elkan. The foundations of cost-sensitive learning. IJCAI 2001.

Two Competing Approaches • Cost-sensitive learning algorithm • The algorithm itself handles cost-sensitivity • Does not throw away any data • Sampling • Down-sample the majority class • Discards potentially useful data • Up-sample the minority class • Increases amount of training data • Replicated examples may lead to overfitting

The Question ? • Which method is best? • cost-sensitive learning algorithm • up-sampling • down-sampling • Most prior work compares sampling methods

Experiments • We assume that cost information is known • Since cost info not really provided, we evaluate a variety of cost ratios and reports all results • Classifier performance is evaluated using total cost • Used cost-sensitive C5.0 • Evaluated scenarios whereCFNCFP • All results are based on averages over 10 runs • For cost-sensitive learning, cost info passed in • For sampling approaches • Altered the the training data to “impose” the specified misclassification cost

Fourteen Data Sets

Results: Letter-a Data Set 4% minority 20,000 examples

Weather Data Set 40% minority 5,597 examples

Coding Data Set 50% minority 20,000 examples

Blackjack Data Set 36% minority 15,000 examples

Contraceptive Data Set 23% minority 1,473 examples

Results: 1st/2nd/3rd Place Finishes

Comparison of 3 Methods

Discussion • Results vary widely based on the data set • no method consistently outperforms the other two or even one of the other two • Are there any patterns based on the properties of the data sets?

Discussion II: Patterns • For the four smallest data sets (size < 209) • Up-sampling does by far the best • Down-sampling does poorly since it discards data • For the eight largest data sets (size > 10,000) • Cost-sensitive learning does best • Beats up-sampling on average by 5.5% • Beats down-sampling on average by 5.7% • No clear pattern based on the degree of class imbalance

Discussion III • Why might cost-sensitive learning algorithm perform best for large data sets? • Perhaps this method requires accurate probability estimates in order to perform well • This requires many examples per classification “rule”

Conclusion • No consistent winner between cost-sensitive learning and sampling methods • Substantial differences for specific data sets • Cost-sensitive learning may be best for large data sets • Up-sampling appears best for small data sets

Follow-up Questions • Why isn’t cost-sensitive learning the best? • Can we identify problems with cost-sensitive learners? • Can we improve cost-sensitive learners? • Are we better off not using cost-sensitive learner and using sampling instead?!

Future Work • There are areas for future work • Use additional cost-sensitive learners • Use larger data sets (then cost-sensitive best?) • Include more sophisticated sampling schemes • Don’t assume known costs (ROC analysis) • I believe more comprehensive studies are needed and are underway

Gary Weiss, Kate McCarthy, Bibi Zabar Fordham University