1 / 43

Class Imbalance vs. Cost-Sensitive Learning

Class Imbalance vs. Cost-Sensitive Learning . Presenter: Hui Li University of Ottawa. Contents. Introduction Making Classifier Balanced Making Classifier Cost Sensitive MetaCost Stratification. Class Imbalance vs. Asymmetric Misclassification costs.

monty
Download Presentation

Class Imbalance vs. Cost-Sensitive Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Class Imbalance vs. Cost-Sensitive Learning Presenter: Hui Li University of Ottawa

  2. Contents • Introduction • Making Classifier Balanced • Making Classifier Cost Sensitive • MetaCost • Stratification

  3. Class Imbalance vs. Asymmetric Misclassification costs • Most of the algorithms assume that the data sets are balanced, and all errors have the same cost • This is seldom true • In data base marketing, the cost of mailing to a non-respondent is very small, but the cost of not mailing to someone who would respond is the entire profit lost • Both class imbalance and the cost of misclassification should be considered

  4. Class Imbalance vs. Asymmetric Misclassification costs • Class Imbalance: one class occurs much more often than the other • Asymmetric misclassification costs: the cost of misclassifying an example from one class is much larger than the cost of misclassifying an example from the other class. • One way to correct for imbalance: train a cost sensitive classifier with the misclassification cost of the minority class greater than that of the majority class. • One way to make an algorithm cost sensitive: intentionally imbalance the training set.

  5. Making Classifier Balanced • Baseline Methods • Random over-sampling • Random under-sampling • Under-sampling Methods • Tomek links • Condensed Nearest Neighbor Rule • One-sided selection • CNN + Tomek links • Neighborhood Cleaning Rule • Over-sampling Methods • Smote • Combination of Over-sampling method with Under-sampling method • Smote + Tomek links • Smote + ENN

  6. Making Classifiers Cost-sensitive • Substantial work has gone into making individual algorithms cost-sensitive [4] • A better solution would be to have a procedure that converted a broad variety of classifiers into cost-sensitive ones • Stratification: change the frequency of classes in the training data in proportion to their cost • Shortcoming • distort the distribution of examples • If it is done by under-sampling, it reduces the data available for learning. • If it is done by over-sampling, it increase learning time • MetaCost: a general method for making classifiers cost-sensitive

  7. MetaCost • By wrapping a cost-minimizing procedure, “meta-learning” stage, around the classifier • Treat the underlying classifier as a black box, requiring no knowledge of its functioning or change to it • Applicable to any number of classes and to arbitrary cost matrices • always produces large cost reductions compared to the cost-blind classifier

  8. MetaCost • Conditional rist R(i|x) is the expected cost of predicting that x belongs to class i • R(i|x) = ∑P(j|x)C(i, j) • Bayes optimal prediction is guaranteed to achieve the lowest possible overall cost • The goal of MetaCost procedure is: to relabel the training examples with their “optimal” classes • Therefore, we need to find a way to estimate their class probabilities P(j|x) • Learn multiple classifiers, for each example, use each class’s fraction of the total vote as an estimate of its probability given the example • Reason: most modern learners are highly unstable, in that applying them to slightly different training sets tends to produce very different models and correspondingly different predictions for the same examples, while the overall accuracy remains broadly unchanged. This accuracy can be much improved by learning several models in this way and then combining their predictions, for example by voting

  9. MetaCost procedure • Form multiple bootstrap replicates of the training set • Learn a classifier on each training set • Estimate each class’s probability for each example by the fraction of votes that it receives from the ensemble • Use conditional risk equation to relabel each training example with the estimated optimal class • Reapply the classifier to the relabeled training set

  10. Evaluation of MetaCost • Does MetaCost reduce cost compared to the error-based classifier and to stratification? • 27 databases from the UCI repository, 15 multiclass databases, 12 two-class databases • C4.5 Decision tree learner • C4.5Rules post-processor • Randomly select 2/3 of the examples in the database for training, using the remaining 1/3 for measuring the cost of their predictions • Results are the average of 20 such runs

  11. MultiClass Problems • Experiments were conducted with two different types of cost model. • Fixed interval model • Each C(i, i) was chosen randomly from a uniform distribution in the [0, 1000] interval • Each C(i, j), i ≠ j, was chosen randomly from the fixed interval [0, 10000] • Different costs were generated for each of the 20 runs conducted on each database • Class probability-dependent model • Same C(i, i) as in model 1 • Each C(i, j), i ≠ j, was chosen with uniform probability from the interval [0, 2000P(i)/P(j)], P(i) and P(j) are the probabilities of occurrence of classes i and j in the training set • This means that the highest costs are for misclassifying a rare class as a frequent one, as in the database marketing domains.

  12. MultiClass Problems • In the fixed interval case • Neither form of stratification is very effective in reducing costs • MetaCost reduces costs compared to C4.5R and under-sampling in all but one database • MetaCost reduces costs compared to Over-sampling in all but three • In the probability-dependent case • Both under-sampling and over-sampling reduce cost compared to C4.5R in 12 of the 15 databases • MetaCost achieves lower costs than C4.5R and both forms of stratification in all 15 databases • The average cost reduction obtained by MetaCost compared to C4.5R is approximately twice as large as that obtained by under-sampling, and five times that of over-sampling • Conclusion: MetaCost is the cost reduction metnod of choice for multiclass problems

  13. MultiClass Problems

  14. Two-class Problems • Cost model • 1 be the minority class, 2 be the majority class • C(1, 1) = C(2, 2) = 0 • C(1, 2) = 1000 • C(2, 1) = 1000r, where r was set alternately to 2, 5, and 10 • Result • Over-sampling is not very effective in reducing cost with any of the cost ratios • Under-sampling is effective for r = 5 and r = 10, but not for r = 2 • MetaCost reduces costs compared to C4.5R, under-sampling and over-sampling on almost all databases, for all cost ratios • Conclusion: MetaCost is the cost-reduction method for two-class problems

  15. Two-class Problems

  16. Lesion Studies of MetaCost • Q:How sensitive are the results to the number of resamples used? • E: using 20 and 10 resamples instead of 50 • R: • cost increases as the number of resamples decrease, but only gradually • There is no significant difference between the costs obtained with m=50 and m=20 • With m=10, Metacost still reduces costs compared to C4.5R and both forms of stratification in almost all datasets

  17. Lesion Studies of MetaCost 2. Q:Would it be enough to simply use the class probabilities produed by a single run of the error-based classifier on the full training set? • E: relabeling the training examples using the class probabilities produced by a single run of C4.5R on all the data (labeled “C4 Probs”) • R: • It produces worse results than MetaCost and under-sampling in almost all datasets for all cost ratio • It still outperforms over-sampling and C4.5R

  18. Lesion Studies of MetaCost 3. Q: how well would MetaCost do if the class probabilities produced by C4.5R were ignored, and the probability of a class was estimated simply as the fraction of models that predicted it? • E: ignoring the class probabilities produced by C4.5R (labeled “0-1 votes”) • R: • Increase cost in a majority of the datasets, but the relative differences are generally minor. • MetaCost still outperforms other methods in a large majority of the datasets, for all cost ratios

  19. Lesion Studies of MetaCost 4. Q: Would MetaCost perform better if all models were used in relabeling an example, irrespective of whether the example was used to learn them or not? • E: using all models in relabeling an example (labeled “all Ms”) • R: • Decrease cost for r=10 but increase it for r=5 and r=2 • In all three cases the relative differences are generally minor, and the performance vs. C4.5R and stratification is generally similar

  20. Lesion Studies of MetaCost

  21. Problem with MetaCost • MetaCost increases learning time compared to the error-based classifier • Reason: MetaCost increases time by a fixed factor, which is approximately the number of re-samples • Solutions • Parallelize the multiple runs of the error-based classifier • Use re-samples that are smaller than the original training set

  22. Stratification • Baseline method: C4.5 combined with under-sampling or over-sampling • Performance analysis technique: cost curves

  23. Cost Curve • The expected cost of a classifier is represented explicitly • Easy to understand • Allows the experimenter to immediately see the range of costs • Allows to see where a particular classifier is the best and quantitatively how much better than other classifiers.

  24. Cost Curve X-axis: probability cost function for positive examples PCF(+) = w+/(w+ + w-)Y-axis: expected cost normalized with respect to the cost incurred when every example is incorrectly classifiedNE[C] = (1-TP) w+ + FP w-w+ + w-Note:w+ = p (+)C(-|+)w- = p (-)C(+|-)P(a): probability of a given example being in class aC(a|b): cost incurred if an example in class b is misclassified as being in class aInterprate: the expected cost of a classifier across all possible choices of misclassification costs and class distributions.

  25. Comparing the Sampling Schemes Data set: Sonar data set, 208 instances; 111 mines and 97 rocks, 60 featuresBold dashed curve: performance of C4.5 using under-samplingBold continuous curve: over-sampling

  26. Comparing the Sampling Schemes Data set: Japanese credit data set, 690 instances; 307 positive and 383 negative, 15 featuresBold dashed curve: performance of C4.5 using under-samplingBold continuous curve: over-sampling

  27. Comparing the Sampling Schemes Data set: breast cancer data set, 286 instances; 201 non-recurrences and 85 recurrences, 9 featuresBold dashed curve: performance of C4.5 using under-samplingBold continuous curve: over-sampling

  28. Comparing the Sampling Schemes Data set: sleep data set, 840 instances; 100 1’s and 140 2’s, 15 featuresBold dashed curve: performance of C4.5 using under-samplingBold continuous curve: over-sampling

  29. Comparing the Sampling Schemes • Under-sampling produces a cost curve that is reasonably cost sensitive • Over-sampling produces a cost curve that is less sensitive, the performance varies little from that at data set’s original frequency • Under-sampling scheme outperforms Over-sampling scheme

  30. Investigating Over-sampling Curves • Over-sampling prunes less and thus produce specialization by narrowing the region surrounding instances of the more common class as their number is increased, therefore generalizes less than under-sampling • For data sets where there was appreciable pruning at the original frequency, oversampling produced some overall cost sensitivity. • Disable the stopping criterion will removed the small additional sensitivity shown at the ends of the curves

  31. Turn off Pruning

  32. Disable stopping criterion

  33. Investigating Under-sampling Curves • Disable pruning and make early stopping criterion make no real change in under-sampling • Yet, it still maintains roughly the same shape, not becoming as straight as the one produced when over-sampling

  34. Disable different features of C4.5

  35. Turn off Pruning

  36. Comparing Weighting and Sampling • Two weighting • Up-weighting, analogous to over-sampling, increases the weight of one of the classes keeping the weight of the other class at one • Down-weighting, analogous to under-sampling, decreases the weight of one of the classes keeping the weight of the other class at one • If we represent misclassification costs and class frequencies by means of internal weights within C4.5, disabling these features does seem to make the difference. • The curve for up-weighting is very close to that for over-sampling • The curve for down-weighting is close but sometimes better than for under-sampling • Turning off pruning and then the stopping criterion does produce a curve that is very straight

  37. Comparing Weighting and Sampling • If the performance curves of under-sampling and down-weighting are similar and disabling these internal factors makes down-weighting similar to up-weighting, why do they seem not to have the same effect when under-sampling • Explanation • Much of the cost sensitivity when under-sampling comes from the actual removal of instances • When we turned off many of the factors when down weighting, the branch was still grown and the region still labeled. • When the instances are removed from the training set, this cannot happen.

  38. Comparing Weighting and Sampling

  39. Disabling Factors when Down-weighting

  40. Improving Over-sampling • As over-sampling tends to disable pruning and other factors, perhaps we should increase their influence? • Experiment • Set a over-sampling ratio, then the stopping criterion is set to 2 times this ratio, the pruning confidence factor is set to 0.25 divided by the ratio • Change defaults factors for over-sampling of Sonar data sets • One class is over-sampled by 2.5 times • Then the stopping criterion is 5 • Pruning confidence factor is 0.1 • Result • By increasing the factors in proportion to the number of duplicates in the training set does indeed have the desired effect

  41. Improving Over-sampling

  42. Conclusion • MetaCost is applicable to any number of classes and to arbitrary cost matrices • MetaCost always produces large cost reductions compared to the cost-blind classifier • Using C4.5 with under-sampling establishes a reasonable standard for algorithmic comparison • Under-sampling produces a reasonable sensitivity to changes in misclassification costs and class distribution. • Over-sampling shows little sensitivity, there is often little difference in performance when misclassification costs are changed. • Over-sampling can be made cost-sensitive if the pruning and early stopping parameters are set in proportion to the amount of over-sampling that is done. • The extra computational cost of using over-sampling is unwarranted as the performance achieved is, at the best, the same as under-sampling

  43. Reference • [1] Domingos, P. MetaCost: A General Method for Making Classifiers Cost-Sensitive. In KDD (1999), pp. 155-164 • [2] Drummond, C., and Holte, R. C. C4.5, Class Imbalance, and Cost Sensitive: Why Under-sampling beats Over-sampling. In Workshop on Learning from Imbalanced Data Sets II (2003). • [3] Drummond, C., & Holte, R. C. (2000a). Explicitly representing expected cost: An alternative to ROC representation. Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (pp. 198-207). • [4] P. Turney. Cost-sensitive learning bibliography. Online bibliography, Institute for Information Technology of the National Research Council of Canada, Ottawa, Canada, 1997

More Related