350 likes | 364 Views
This study evaluates 10 methods for dealing with class imbalance in machine learning training data. Experimental results on 13 UCI datasets are presented.
E N D
A study of the Behavior of Several Methods for Balancing Machine Learning Training Data Advisor :Dr. Hsu Presenter: Chien-Shing Chen Author: Gustavo E. A. P. A. Batista Ronaldo C. Prati Maria Carolina Monard SigKDD,2004
Outline • Motivation • Objective • Introduction • KNN • 10 methods • Experimental Results • Conclusions • Personal Opinion
Motivation • class imbalances are • significant losses of performance in standard classifiers
Objective • a broad experimental evaluation involving 10 methods, to deal with the class imbalance problem • 13 UCI data sets
Introduction • is a large imbalance between the majority class and the minority class • present some degree of class overlapping • May incorrectly classify many cases from the minority class because the nearest neighbors of these cases are examples belonging to the majority class
Introduction • , • For instance, it is straightforward to create a classifier having an accuracy of 99% in a domain where the majority class proportion corresponds to 98% of the examples, by simply forecasting every new example as belonging to the majority class.
Introduction • ROC curve (AUC) • represent the expectedperformance as a single scalar • it is equivalent to the Wilconxon test of ranks
Methods • Implement k-NN algorithm
Methodological • Implement k-NN algorithm • Use the Heterogeneous Value difference Metric distance function • Euclidean distance for quantitative attributes • VDM distance for qualitative attributes
Method (1) • Random over-sampling • Random replication of minority class examples
Method (2) • Random under-sampling • Random elimination of majority class examples
Method (3) • Tomek Links • Given two examples Ei and Ej belonging to different classes, and d(Ei,Ej) is the distance between Ei and Ej. • A (Ei,Ej) pair is called a Tomek link if there is not an example El, such that d(Ei,El)<d(Ei,Ej) or d(Ej,El)<d(Ei,Ej) 1.borderline 2.Is noise • As an under-sampling method, eliminate majority class example • As a data cleaning method, eliminate both class examples
Method (4) • Condensed Nearest Neighbor Rule • find consistent subset of examples, eliminate the examples from the majority class are distant from the decision brooder • a subset is consistent with E if using a 1-NN, correctly classifies the examples in E. • an algorithm to create a subset from E as an under-sampling method 1.Randomly draw one majority class example all examples from the minority class put these examples in 2.1-NN over the examples in to classify the examples in E 3.every misclassified example from E is moved to
Method (5) • One-sided selection (OSS) • Is an under-sampling method resulting from the application of Tomek links followed by the application of CNN • Remove noisy and borderline majority class examples
Method (6) • CNN + Tomek links • It is similar to the OSS, but the method to find the consistent subset is applied before the Tomk links. • As finding Tomek links is computationally demanding, it would be computationally cheaper.
Method (7) • Neighborhood Cleaning Rule • Use Wilson’s Edited Nearest Neighbor Rule
Method (8) • Synthetic Minority Over-sampling Technique (Smote) • Its main idea is to form new minority class examples by interpolating between several minority class examples that lie together. • cause the decision boundaries for the minority class to spread further into the majority class space
Method (9) • Smote + Tomek links
Method (10) • Smote + ENN • ENN remove more examples than the Tomek links does, so it is expected that it will provide a more in depth data cleaning.
Experimental Evaluation • C4.5 symbolic learning algorithm to induce decision trees • 15 UCI data sets
Experimental Evaluation • Unpruned decision trees obtained better results
Conclusion • imbalance • 10 methods of matching • Smote + Tomek or Smote + ENN might be applied to data sets with a small number of positive instances. • Large number of positive examples, the Random over-sampling method less expensive than other methods would produce meaningful results. • ROC curves
Conclusions • Drawback • go deeper! Be carefully! • Application • What alternative methodological are there? • Future Work • easy to implement