160 likes | 304 Views
GMDH-based feature ranking and selection for improved classification of medical data. Advisor : Dr. Hsu Presenter : Yu-San Hsieh Author : R.E. Abdel-Aal. 2005. BI.456-468. Outline. Motivation Objective Method Material Results Conclusions. Motivation.
E N D
GMDH-based feature ranking and selection for improved classification of medical data Advisor : Dr. Hsu Presenter : Yu-San Hsieh Author : R.E. Abdel-Aal 2005. BI.456-468
Outline • Motivation • Objective • Method • Material • Results • Conclusions
Motivation • Accuracy is very important in classifiers used for medical application.
Objective • Improved classification performance of medical data.
2.Selection and stopping Method Square error An increasing rmin:model becoming complex, 1.Overfitting the estimation data 2.Performing poorly on the new selection data. • First stage – ranked feature • GMDH algorithm 1. representation Iteration z1 r12 x1 r22 x2 y rmin x3 x4 rm(m-1)2 Zm(m-1)/2
2.Selection and stopping 2.Selection and stopping Method Avoid overfitting Using CPM control 1.CPM>1,simpler model that are less accurate but generalize. 2.CPM<1,complex model, overfit training data and decrease actual prediction performance. • First stage – ranked feature • AIM abductive network • First stage – ranked feature • AIM abductive network 1.repesentation 1.repesentation
Method • Second stage – selected feature • Selected k, performance on an evaluation dataset would first improve and starts to deteriorate due to the model overfitting the training data. • A compact m-feature subset can be obtained by taking the first m features starting from top of the ranking list.Ex: ranking list{2,6,7,8,1,5,3,4,9}, selected 6-features is {2,6,7,8,1,5}. • The optimum subset of features is determined by repeatedly forming subset of k features, starting from the top of the ranking list.Ex: ranking list{2,6,7,8,1,5,3,4,9}, {2,6,7,8,1,5},{6,7,8,1,5,3}…中選出最佳的subset
Material • Two standard medical diagnosis datasets from the UCI Machine Learning Repository were used for this study. • Wisconsin breast cancer dataset • Cleveland heart disease dataset 70% 30%
Results • The breast cancer data • Ranking for the feature set{2,6,7,8,1,5,3,4,9} Feature selected Feature ranked 7 5 9
Results Rough set data analysis of dataset Overfitting Overfitting 3% 3%
Results Standard error↓ Standard error↓ AUC↑ 3% 3%
Results • The heart disease data • Ranking for the feature set{13,12,9,3,2,10,8,4,5,11,1,7,6} Feature selected Feature ranked
Results Overfitting 6% 3%
Results AUC↑ AUC↑ Requires less than half the number of input features Models using the reduced feature set will be more efficient.
Conclusions • Improved implementation and performance of classifiers for medical screening and diagnosis. • Feature reduction is particularly useful with high-dimensional data characterized by a large number of feature and a relatively few training example.
My opinion • Advantage: Preprocess • Disadvantage: • Apply:Clustering, Association Rule……