160 likes | 711 Views
NIPS 2001 Workshop on Feature/Variable Selection Isabelle Guyon BIOwulf Technologies. Schedule. 7:30-8:00: Welcome and introduction to the problem of feature/variable selection - Isabelle Guyon - 8:00-8:20 a.m. Dimensionality Reduction via Sparse Support Vector Machines - Jinbo Bi,
E N D
NIPS 2001 Workshop on Feature/Variable SelectionIsabelle GuyonBIOwulf Technologies
Schedule • 7:30-8:00: Welcome and introduction to the problem of feature/variable selection - Isabelle Guyon - • 8:00-8:20 a.m. Dimensionality Reduction via Sparse Support Vector Machines - Jinbo Bi, • Kristin P. Bennett, Mark Embrechts and, Curt Breneman - • 8:20-8:40 a.m. Feature selection for non-linear SVMs using a gradient descent algorithm • - Olivier Chapelle and Jason Weston - • 8:40-9:00 a.m. When Rather Than Whether: Developmental Variable Selection - Melissa Dominguez - • 9:00-9:20 a.m. Pause, free discussions. • 9:20-9:40 How to recycle your SVM code to do feature selection - Andre Elisseeff and Jason Weston - • 9:40-10:00 Lasso-type estimators for variable selection - Yves Grandvalet and Stéphane Canu - • 10:00-10:30 a.m. Discussion. What are the various statements of the variable selection problem? • 4:00-4:20 p.m. Using DRCA to see the effects of variable combinations on classifiers - Ofer Melnik - • 4:20-4:40 p.m.Feature selection in the setting of many irrelevant features • - Andrew Y. Ng and Michael I. Jordan • 4:40-5:00 p.m. Relevant coding and information bottlenecks: A principled approach to multivariate • feature selection - Naftali Tishby - • 5:00-5:20 p.m. Learning discriminative feature transforms may be an easier problem than feature • selection - Kari Torkkola • 5:20-5:30 p.m. Pause. • 5:30-6:30 p.m. Discussion. Organization of a future workshop with benchmark. • 6:30-7:00 p.m. Impromptu talks.
Relevance to the “concept” Usefulness to the predictor Outline Vocabulary • Variable vs. feature
Relevance to the concept Output System or “Concept” 1- Eliminate distracters 2 - Rank (combinations of ) relevant variables Objectives
Definition of distracter: if tweaked, no change in input/output relationship for any position of all other knobs. “Exhaustive search”: Check all knob positions. One knob at a time does not work if one variable alone does not control the output. For continuous variables: need experimental design. Greedy “query” strategies. A big search problem
Noisy/bad data (imprecise knobs, reading errors, systematic errors). Lack of data: cannot perform optimum experimental design. Probabilistic definition of a distracter: P(distractor)=fraction of times everything else equal, a change in the position of the knob does not result in a change in output. Continuous case: need to measure state space areas in which a knob has little or no effect. More difficulties
Yet harder Output Uncontrollable variables Unobservable variables Controllable variables
x2 3 4 1 2 x1 0 0 0 Tiny example y=[x1+2(x2-1)]q(x1)q(x2) + 2 + 0 0 0 x3 x1 x2 -1 x2 x3 2 2 1 2 1 1
x2 3 4 1 2 x1 0 0 0 0 x1 x3 x2 x2 x3 2 2 1 2 1 1 Theory and practice y y y 4 4 4 3 3 3 2 2 2 1 1 1 x1 x2 x3 0 1 2 0 1 2 0 1 2 x1,x2 Any x3 Any x3 2,0 1,0 0,0 x2=0 x1=0 2,1 1,1 0,1 x2=1 x1=1 x2=2 x1=2 2,2 1,2 0,2
If the system is observed only through given examples of input/outputs or if it is expensive to get a lot of data points: build a predictor. Define criterion of relevance, e.g. (f(x1, x2, x3)-f(x1, x2))2 dP(x3|x1,x2) and approximate it using empirical data. Use of a predictor
Kohavi et al.: classification problem. xi is strongly relevant if its removal yields a deterioration of the performance of the Bayes Optimum Classifier. xi is weakly relevant if not strongly relevant and there exists a subset of variables S such that the performance on S{xi} is better than the performance on S. Features that are neither strongly or weakly relevant are irrelevant. Relevance to the concept:weak and strong relevance
New objective: make good predictions. Find a subset of variables that min. an estimate of the generalization error E. Find a subset of size n or less that min. E. Find a subset of min. size for which E E_all_var + Model selection pb.: CV, perf. bounds, etc. Usefulness to the predictor
A relevant variable may not contribute to getting a better predictor (e.g. case of redundant variables). Reciprocally, a variable that helps improving the performance of a predictor may be irrelevant (e.g. a bias value). Relevance to the concept vs. usefulness to the predictor
Filters vs. wrappers. Exhaustive search. Backward elimination vs. forward selection. Other greedy search methods (e.g. best first, beam search, compound operators). Organization of results. Overfitting problems. H81558 R88740 60 H06524 T62947 U19969 M59040 H06524 50 H64807 T94579 R88740 T58861 H08393 40 M59040 T94579 L08069 H08393 30 H81558 M82919 T62947 L03840 T64012 20 U19969 R55310 D14812 T86444 M82919 10 U09564 M82919 H06524 L06895 Algorithms
6000 5000 4000 3000 Estimated falsely significant genes 2000 1000 0 -1000 0 1000 2000 3000 4000 5000 6000 Genes called significant Validation • Classical statistics • (compare with random data). • Machine Learning • (predict accuracy w. test data). • Validation with other data • (e.g. medical literature).
Do not confuse relevance to the concept and usefulness to the predictor. Do not confuse correlation and causality. Q1: what are good statements of the variable/feature selection problem? Q2: what are good benchmarks? Epilogue