110 likes | 266 Views
Investigating the Effect of Sampling Methods for Imbalanced Data Distributions. Advisor : Dr. Hsu Presenter : Ai-Chen Liao Authors : Show-Jane Yen, Yue-Shi Lee, Cheng-Han Lin and Jia-Ching Ying. 2006 . ICSMC . Page(s) : 4163 - 4168. Outline. Motivation
E N D
Investigating the Effect of Sampling Methods for Imbalanced Data Distributions Advisor : Dr. Hsu Presenter : Ai-Chen Liao Authors : Show-Jane Yen, Yue-Shi Lee, Cheng-Han Lin and Jia-Ching Ying 2006 . ICSMC . Page(s) : 4163 - 4168
Outline • Motivation • Objective • Method • Strategies for handling imbalanced data • Cluster-based under-sampling approach • Experimental Result • Conclusion • Comments
Motivation • Classification is an important and well-known technique in the field of machine learning, and the training data will significantly influence the classification accuracy. • The classification techniques usually assume that the training samples are uniformly-distributed between different classes. • The training data in real-world applications often are imbalanced class distribution. ex. Fraud detection, risk management, medical research…,etc.
Objective • We propose a cluster-based sampling approach for selecting the representative data as training data to improve the classification accuracy. • We investigate the effect of under-sampling methods in the imbalanced class distribution problem.
Dataset : 共 1100 筆資料 Cluster 1 MA=500 MI=10 MA : 共 1000 筆資料 Cluster 2 MA=300 MI=50 Cluster 3 MA=200 MI=40 MI : 共 100 筆資料 Method ─ Cluster-based under-sampling approach • The main idea is that there are different clusters in a dataset, and each cluster seems to have distinct characteristics.
Cluster 1 MA=500 MI=10 Cluster 2 MA=300 MI=50 Cluster 3 MA=200 MI=40 Method ─ Cluster-based under-sampling approach Assume that the ratio of SizeMA TO SizeMI in the training data is set to be 1:1
Conclusion • We propose cluster-based under-sampling approach to solve the imbalanced class distribution problem by using backpropagation neural network. • SBC not only has high classification accuracy on predicting the minority class samples but also has fast execution time.
Comments • Advantage • A novel approach • Drawback • Setting necessary parameters • Application • Handling imbalanced data
Method ─ Strategies for handling imbalanced data • 修正學習演算法來處理imbalanced data • cost-sensitive learning • 將資料進行事前的處理 • Multi-classifier committee • Resamplingupsizing the minority class (oversampling) • downsizing the majority class (undersampling) MA=48 samples MI = 2 samples MA’s size:MI’s size=1:1 48/2=24 Voting ex.SMOTE