Behavior of Methods for Balancing ML Training Data

A study of the Behavior of Several Methods for Balancing Machine Learning Training Data Advisor ：Dr. Hsu Presenter： Chien-Shing Chen Author: Gustavo E. A. P. A. Batista Ronaldo C. Prati Maria Carolina Monard SigKDD,2004

Outline • Motivation • Objective • Introduction • KNN • 10 methods • Experimental Results • Conclusions • Personal Opinion

Motivation • class imbalances are • significant losses of performance in standard classifiers

Objective • a broad experimental evaluation involving 10 methods, to deal with the class imbalance problem • 13 UCI data sets

Introduction • is a large imbalance between the majority class and the minority class • present some degree of class overlapping • May incorrectly classify many cases from the minority class because the nearest neighbors of these cases are examples belonging to the majority class

Introduction • , • For instance, it is straightforward to create a classifier having an accuracy of 99% in a domain where the majority class proportion corresponds to 98% of the examples, by simply forecasting every new example as belonging to the majority class.

Introduction

Introduction • ROC curve (AUC) • represent the expectedperformance as a single scalar • it is equivalent to the Wilconxon test of ranks

Methods • Implement k-NN algorithm

Methodological • Implement k-NN algorithm • Use the Heterogeneous Value difference Metric distance function • Euclidean distance for quantitative attributes • VDM distance for qualitative attributes

Method (1) • Random over-sampling • Random replication of minority class examples

Method (2) • Random under-sampling • Random elimination of majority class examples

Method (3) • Tomek Links • Given two examples Ei and Ej belonging to different classes, and d(Ei,Ej) is the distance between Ei and Ej. • A (Ei,Ej) pair is called a Tomek link if there is not an example El, such that d(Ei,El)<d(Ei,Ej) or d(Ej,El)<d(Ei,Ej) 1.borderline 2.Is noise • As an under-sampling method, eliminate majority class example • As a data cleaning method, eliminate both class examples

Method (4) • Condensed Nearest Neighbor Rule • find consistent subset of examples, eliminate the examples from the majority class are distant from the decision brooder • a subset is consistent with E if using a 1-NN, correctly classifies the examples in E. • an algorithm to create a subset from E as an under-sampling method 1.Randomly draw one majority class example all examples from the minority class put these examples in 2.1-NN over the examples in to classify the examples in E 3.every misclassified example from E is moved to

Method (5) • One-sided selection (OSS) • Is an under-sampling method resulting from the application of Tomek links followed by the application of CNN • Remove noisy and borderline majority class examples

Method (6) • CNN + Tomek links • It is similar to the OSS, but the method to find the consistent subset is applied before the Tomk links. • As finding Tomek links is computationally demanding, it would be computationally cheaper.

Method (7) • Neighborhood Cleaning Rule • Use Wilson’s Edited Nearest Neighbor Rule

Method (8) • Synthetic Minority Over-sampling Technique (Smote) • Its main idea is to form new minority class examples by interpolating between several minority class examples that lie together. • cause the decision boundaries for the minority class to spread further into the majority class space

Method (9) • Smote + Tomek links

Method (10) • Smote + ENN • ENN remove more examples than the Tomek links does, so it is expected that it will provide a more in depth data cleaning.

Experimental Evaluation • C4.5 symbolic learning algorithm to induce decision trees • 15 UCI data sets

Experimental Evaluation

Experimental Evaluation • Unpruned decision trees obtained better results

Experimental Evaluation

Conclusion • imbalance • 10 methods of matching • Smote + Tomek or Smote + ENN might be applied to data sets with a small number of positive instances. • Large number of positive examples, the Random over-sampling method less expensive than other methods would produce meaningful results. • ROC curves

Conclusions • Drawback • go deeper! Be carefully! • Application • What alternative methodological are there？ • Future Work • easy to implement

Behavior of Methods for Balancing ML Training Data

Behavior of Methods for Balancing ML Training Data

Presentation Transcript

Machine learning methods – Introduction The main properties of learning algorithms

A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data

Machine learning methods for the analysis of heterogeneous, multi-source data

Use of Machine Learning Methods to Impute Categorical Data

Machine Learning in the Study of Protein Structure

Hybrids of generative and discriminative methods for machine learning

Comparative study of various Machine Learning methods For Telugu Part of Speech tagging

Kernel Methods: the Emergence of a Well-founded Machine Learning

Machine Learning Methods

A Study of Gerbil Behavior

Machine Learning in the Study of Protein Structure

Several of the Advantages of Distance Learning

Few of the Several Perks of Using A Way Neck Machine for Fitness Training

Machine Learning for Big Data, Methods and Applications

Machine Learning in the Study of Protein Structure

Hybrids of generative and discriminative methods for machine learning

Machine Learning Methods for Cybersecurity

Ensemble Methods for Machine Learning

Fuzzy Machine Learning Methods for Biomedical Data Analysis

A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data

Importance of Machine Learning Course for Data Scientist