1 / 35

Behavior of Methods for Balancing ML Training Data

This study evaluates 10 methods for dealing with class imbalance in machine learning training data. Experimental results on 13 UCI datasets are presented.

Download Presentation

Behavior of Methods for Balancing ML Training Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A study of the Behavior of Several Methods for Balancing Machine Learning Training Data Advisor :Dr. Hsu Presenter: Chien-Shing Chen Author: Gustavo E. A. P. A. Batista Ronaldo C. Prati Maria Carolina Monard SigKDD,2004

  2. Outline • Motivation • Objective • Introduction • KNN • 10 methods • Experimental Results • Conclusions • Personal Opinion

  3. Motivation • class imbalances are • significant losses of performance in standard classifiers

  4. Objective • a broad experimental evaluation involving 10 methods, to deal with the class imbalance problem • 13 UCI data sets

  5. Introduction • is a large imbalance between the majority class and the minority class • present some degree of class overlapping • May incorrectly classify many cases from the minority class because the nearest neighbors of these cases are examples belonging to the majority class

  6. Introduction • , • For instance, it is straightforward to create a classifier having an accuracy of 99% in a domain where the majority class proportion corresponds to 98% of the examples, by simply forecasting every new example as belonging to the majority class.

  7. Introduction

  8. Introduction • ROC curve (AUC) • represent the expectedperformance as a single scalar • it is equivalent to the Wilconxon test of ranks

  9. Methods • Implement k-NN algorithm

  10. Methodological • Implement k-NN algorithm • Use the Heterogeneous Value difference Metric distance function • Euclidean distance for quantitative attributes • VDM distance for qualitative attributes

  11. Method (1) • Random over-sampling • Random replication of minority class examples

  12. Method (2) • Random under-sampling • Random elimination of majority class examples

  13. Method (3) • Tomek Links • Given two examples Ei and Ej belonging to different classes, and d(Ei,Ej) is the distance between Ei and Ej. • A (Ei,Ej) pair is called a Tomek link if there is not an example El, such that d(Ei,El)<d(Ei,Ej) or d(Ej,El)<d(Ei,Ej) 1.borderline 2.Is noise • As an under-sampling method, eliminate majority class example • As a data cleaning method, eliminate both class examples

  14. Method (4) • Condensed Nearest Neighbor Rule • find consistent subset of examples, eliminate the examples from the majority class are distant from the decision brooder • a subset is consistent with E if using a 1-NN, correctly classifies the examples in E. • an algorithm to create a subset from E as an under-sampling method 1.Randomly draw one majority class example all examples from the minority class put these examples in 2.1-NN over the examples in to classify the examples in E 3.every misclassified example from E is moved to

  15. Method (5) • One-sided selection (OSS) • Is an under-sampling method resulting from the application of Tomek links followed by the application of CNN • Remove noisy and borderline majority class examples

  16. Method (6) • CNN + Tomek links • It is similar to the OSS, but the method to find the consistent subset is applied before the Tomk links. • As finding Tomek links is computationally demanding, it would be computationally cheaper.

  17. Method (7) • Neighborhood Cleaning Rule • Use Wilson’s Edited Nearest Neighbor Rule

  18. Method (8) • Synthetic Minority Over-sampling Technique (Smote) • Its main idea is to form new minority class examples by interpolating between several minority class examples that lie together. • cause the decision boundaries for the minority class to spread further into the majority class space

  19. Method (9) • Smote + Tomek links

  20. Method (10) • Smote + ENN • ENN remove more examples than the Tomek links does, so it is expected that it will provide a more in depth data cleaning.

  21. Experimental Evaluation • C4.5 symbolic learning algorithm to induce decision trees • 15 UCI data sets

  22. Experimental Evaluation

  23. Experimental Evaluation

  24. Experimental Evaluation

  25. Experimental Evaluation • Unpruned decision trees obtained better results

  26. Experimental Evaluation

  27. Experimental Evaluation

  28. Experimental Evaluation

  29. Experimental Evaluation

  30. Experimental Evaluation

  31. Experimental Evaluation

  32. Experimental Evaluation

  33. Experimental Evaluation

  34. Conclusion • imbalance • 10 methods of matching • Smote + Tomek or Smote + ENN might be applied to data sets with a small number of positive instances. • Large number of positive examples, the Random over-sampling method less expensive than other methods would produce meaningful results. • ROC curves

  35. Conclusions • Drawback • go deeper! Be carefully! • Application • What alternative methodological are there? • Future Work • easy to implement

More Related