250 likes | 519 Views
Imbalanced Data Set Learning with Synthetic Examples. Benjamin X. Wang and Nathalie Japkowicz. The Class Imbalance Problem I. Data sets are said to be balanced if there are, approximately, as many positive examples of the concept as there are negative ones.
E N D
Imbalanced Data Set Learning with Synthetic Examples Benjamin X. Wang and Nathalie Japkowicz
The Class Imbalance Problem I • Data sets are said to be balanced if there are, approximately, as many positive examples of the concept as there are negative ones. • There exist many domains that do not have a balanced data set. • Examples: • Helicopter Gearbox Fault Monitoring • Discrimination between Earthquakes and Nuclear Explosions • Document Filtering • Detection of Oil Spills • Detection of Fraudulent Telephone Calls
The Class Imbalance Problem II • The problem with class imbalances is that standard learners are often biased towards the majority class. • That is because these classifiers attempt to reduce global quantities such as the error rate, not taking the data distribution into consideration. • As a result examples from the overwhelming class are well-classified whereas examples from the minority class tend to be misclassified.
Some Generalities • Evaluating the performance of a learning system on a class imbalance problem is not done appropriately with the standard accuracy/error rate measures. ROC Analysis is typically used, instead. • There is a parallel between research on class imbalances and cost-sensitive learning. • There are four main ways to deal with class imbalances: re-sampling, re-weighing, adjusting the probabilistic estimate, one-class learning
Advantage of Resampling • Re-sampling provides a simple way of biasing the generalization process. • It can do so by: • Generating synthetic samples accordingly biased • Controlling the amount and placement of the new samples • Note: this type of control can also be achieved by smoothing the classifier’s probabilistic estimate (e.g., Zadrozny & Elkan, 2001), but that type of control cannot be as localized as the one achieved with re-sampling techniques.
SMOTE: A State-of-the-Art Resampling Approach • SMOTE stands for Synthetic Minority Oversampling Technique. • It is a technique designed by Chawla, Hall, & Kegelmeyer in 2002. • It combines Informed Oversampling of the minority class with Random Undersampling of the majority class. • SMOTE currently yields the best results as far as re-sampling and modifying the probabilistic estimate techniques go (Chawla, 2003).
SMOTE’s Informed Oversampling Procedure II • For each minority Sample • Find its k-nearest minority neighbours • Randomly select j of these neighbours • Randomly generate synthetic samples along the lines joining the minority sample and its j selected neighbours (j depends on the amount of oversampling desired)
SMOTE’s Informed vs. Random Oversampling • Random Oversampling (with replacement) of the minority class has the effect of making the decision region for the minority class very specific. • In a decision tree, it would cause a new split and often lead to overfitting. • SMOTE’s informed oversampling generalizes the decision region for the minority class. • As a result, larger and less specific regions are learned, thus, paying attention to minority class samples without causing overfitting.
SMOTE’s Informed Oversampling Procedure I … But what if there is a majority sample Nearby? : Majority sample : Minority sample : Synthetic sample
SMOTE’s Shortcomings • Overgeneralization • SMOTE’s procedure is inherently dangerous since it blindly generalizes the minority area without regard to the majority class. • This strategy is particularly problematic in the case of highly skewed class distributions since, in such cases, the minority class is very sparse with respect to the majority class, thus resulting in a greater chance of class mixture. • Lack of Flexibility • The number of synthetic samples generated by SMOTE is fixed in advance, thus not allowing for any flexibility in the re-balancing rate.
SMOTE’s Tendency for Overgeneralization Overgeneralization!!! : Minority sample : Synthetic sample : Majority sample
Our Proposed Solution • In order to avoid overgeneralization, we propose to use three techniques: • Testing for data sparsity • Clustering the minority class • 2-class (rather than 1-class) sample generalization • In order to avoid SMOTE’s lack of flexibility, we propose one technique: • Multiple Trials/Feedback • We call our Approach: Adaptive Synthetic Minority Oversampling Method (ASMO)
ASMO’s Strategy I • Overfitting Avoidance I: Testing for data sparsity: • For each minority sample m, if m’s g neighbours are majority samples, then the data set is sparse and ASMO should be used. Otherwise, SMOTE can be used. (As a default, we used g=20). • Overgeneralization Avoidance II: Clustering • We will use k-means or other such clustering systems on the minority class (for now, this step is done, but in a non-standard way)
ASMO’s Strategy II • Overfitting Avoidance III: Synthetic sample generation using two classes: • Rather than using the k-nearest neighbours of the minority class to generate new samples, we use the k nearest neighbours of the opposite class.
ASMO’s Strategy III: Overfitting avoidance: Overview - Clustering -2-class sample generation : Minority sample : Majority sample : Synthetic sample
ASMO’s Strategy III • Flexibility Enhancement through Multiple Trials and Feedback: • For each Cluster Ci, iterate through different rates of majority undersampling and synthetic minority generation. Keep the best combination subset Si. • Merge the Si’s into a single training set S. • Apply the classifier to S.
Discussion of our Technique I • Assumption we made/Justification: • the problem is decomposable. i.e., optimizing each subset will yield an optimal merged set. • As long as the base classifier we use does some kind of local learning (not just global optimization), this assumption should hold. • Question/Answer: • Why did we use different oversampling and undersampling rates? • It was previously shown that optimal sampling rates are problem dependent, and thus, are best set adaptively (Weiss & Provost, 2003, Estabrook & Japkowicz, 2001)
Experiment Setup I • We tested our system on three different data sets: • Lupus (thanks to James Malley of NIH) • Minority class: 2.8% • Dataset Size: 3839 • Abalone-5 (UCI) • Minority class: 2.75% • Dataset Size: 4177 • Connect-4 (UCI) • Minority class: 9.5% • Dataset Size: 11,258
Experiment Setup II • ASMO was compared to two other techniques: • SMOTE • O-D [the Combination of Random Over- and Down (Under)- sampling; O-D was shown to outperform both Random Oversampling and Random Undersampling in preliminary experiments]. • The base classifier in all experiments is SVM; k-NN was used in the syntactic generation process in order to identify the samples’ nearest neighbours (within the minority class or between the minority and majority class). • The results are reported in the form of ROC Curves on 10-fold corss-validation experiments.
Discussion of the Results • On every domain, ASMO slightly outperforms both O-D and SMOTE. In the ROC areas where ASMO does not outperform the other two systems, its performance equals theirs. • ASMO’s effect seems to be one of smoothening SMOTE’s ROC Curve. • SMOTE’s performance is comparatively better in the two domains where the class imbalance is greater (Lupus, Abalone-5). We expect its relative performance to increase as the imbalance grows even more.
Summary • We presented a few modifications to the State-of-the-art re-sampling system, SMOTE. • These modifications had two goals: • To correct for SMOTE’s tendency to overgeneralize • To make SMOTE more flexible • We observed a slight improved performance on three domains. However that improvement came at the expense of greater time consumption.
Future Work [This was a very preliminary study!] • To clean-up the system (e.g., to use a standard clustering method) • To test the system more rigorously (to test for significance; to use TANGO [used in the medical domain] • To test our system on highly imbalanced data sets, to see if, indeed, our design helps address this particular issue. • To modify the data generation process so as to test biases other than the one proposed by SMOTE.