290 likes | 440 Views
Feature Selection and Bioinformatics Applications Isabelle Guyon. Part I. INTRODUCTION. Objectives. Output y. Predictor f( x ). Input x. Reduce the number of features as much as possible without significantly degrading prediction performance.
E N D
Feature Selection and Bioinformatics Applications Isabelle Guyon
Part I INTRODUCTION
Objectives Output y Predictor f(x) Input x • Reduce the number of features as much as possible without significantly degrading prediction performance. • Possibly improve prediction performances. • Gain insight.
Applications training examples High Energy Physics Market Analysis OCR HWR 105 Machine Vision 104 Text Categorization 103 Genomics System diagnosis 102 Bioinformatics 10 Proteomics inputs 10 102 103 104 105
This talk: • Simple is beautiful but some (moderate) sophistication is needed. • “Classical statistics” is pessimistic: it advocates the simplest methods to overcome the curse of dimensionality. • Modern statistical methods from soft-computing and machine learning provide necessary additional sophistication and still defeat the curse of dimensionality.
Part II PROBLEM STATEMENT
Correlation Analysis {yk}, k=1…num_patients {xik}, k=1…num_patients m- m+ Top 25 positively correlated features (genes) Top 25 negatively correlated features (genes) s- s+ 38 training ex. (27 ALL, 11 AML); 34 test ex. (20 ALL, 14 AML). Golub et al, Science Vol 286:15 Oct. 1999 {- yk}
Yes, but ... s- s+ m- m+ m- m+ s- s+
I.I.D. Features 6 4 2 0 -2 -4 5 0 -5 -4 -2 0 2 4 6 -5 0 5
I.I.D. Features 5 0 -5 6 4 2 0 -2 -4 -6 -5 0 5 -6 -4 -2 0 2 4 6 m- m+
Smaller Win 4 2 0 -2 -4 -6 4 2 0 -2 -4 -6 -6 -4 -2 0 2 4 -6 -4 -2 0 2 4
Bigger Win 6 4 2 0 -2 -4 4 2 0 -2 -4 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4
Explanation: F1: The peak of interest F2: The best local estimate of the baseline.
Two “Useless” Features 1.5 1 0.5 0 -0.5 1.5 1 0.5 0 -0.5 -0.5 0 0.5 1 1.5 -0.5 0 0.5 1 1.5 Axis projections do not help finding good features.
Higher dimension problem Even two-d projections may not help finding good features.
Part IV ALGORITHMS
Main Goal Output Output Predictor f(x) - Eliminate useless features (distracters). - Rank useful features. - Eliminate redundant features. - Rank subsets of useful features. Sub-goals: Main goal:
Filters and Wrappers Feature subset • Main goal: rank subsets of useful features. • Danger of overfitting: Greedy search often works better. All features Filter Predictor Multiple Feature subsets All features Predictor Wrapper
Nested Subset Methods Nested subset methods perform a greedy search: - At each step add or remove a single feature to best improve (or least degrade) the cost function. - Backward elimination: Start with all features, progressively remove (never add). Example: RFE (Guyon, Weston, et al, 2002.) - Forward selection: Start with an empty set, progressively add (never remove). Example: Gram-Schmidt orthogonalization (Stoppiglia et al, 2003, Rivals-Personnaz, 2003.)
Backward elimination: RFE Improve (or least degrade) cost function J(t): • Exact or approximate difference calculation DJ=J(feat+1)-J(feat). • RFE with linear predictor f(x)=w.x+b: eliminate the feature with smallest wi2(Guyon, Weston, et al, 2002.) • Zero norm/multiplicative updates (MU): rescale the input with |wi| at each iteration(Weston, Elisseeff et al. 2003.) • Non-linear RFE and non-linear MU: estimate (DJ)i ~ aH(i)a.
Forward selection: Gram-Schmidt Feature ranking in the context of others • Vanilla (linear) GS: At every iteration, project onto null space of features already selected; select feature most correlated with target. • Relief (Kira and Rendell, 1992): • GS-Relief combination (Guyon, 2003).
Part IV EXPERIMENTS
Mass Spectrometry Experiments In collaboration with Biospect Inc., 2003 Data from Cancer Research, Adam, et al, 2002 TOF - EVMS prostate cancer data: 326 samples (167 cancer, 159 control). - Preprocessing including m/z 200-10000, baseline removal. - Split in 3 equal parts and make 3 experiments 2/3 train 1/3 test. - Fourty-four methods tried.
Method Comparison: 100 Features ... Non-linear multivariate > Linear multivariate > Linear univariate
Method Comparison: 7 Features ... Non-linear multivariate > Linear multivariate > Linear univariate
Part V CONCLUSION
Experimental Results In spite of the risk of overfitting ... • Subset selection methods can outperform single feature ranking by correlation with the target. • Non-linear feature selection can outperform linear feature selection. | > > … in prediction performance and number of features.
Which method works best? See the results of the NIPS 2003 competition. Presentation on December 19th. See also: JMLR special issue: www.jmlr.org/papers/special/feature.html I. Guyon and A. Elisseeff editors, March 2003. Workshop website: www.clopinet.com/isabelle/Projects/NIPS2003 Acknowledgements: Masoud Nikravesh