1 / 42

Feature Selection with Mutual Information and Resampling

Feature Selection with Mutual Information and Resampling. M. Verleysen Université catholique de Louvain (Belgium) Machine Learning Group http://www.dice.ucl.ac.be/mlg/ Joint work with D. François, F. Rossi and W. Wertz. High-dimensional data: Spectrophotometry.

lydia
Download Presentation

Feature Selection with Mutual Information and Resampling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Feature Selection with Mutual Information and Resampling M. Verleysen Université catholique de Louvain (Belgium) Machine Learning Group http://www.dice.ucl.ac.be/mlg/ Joint work with D. François, F. Rossi and W. Wertz.

  2. High-dimensional data: Spectrophotometry To predict sugar concentration in an orange juce samplefrom light absorbtion spectra 115 samples in dimension 512 Even a linear model would lead to overfitting !

  3. Material resistance classification • Goal: to classify materials into “valid”, “non-valid” and “don’t know”

  4. Material resistance: features extraction • Extraction of whateverfeatures you would imagine …

  5. Why reducing dimensionality ? • Theoretically not useful : • More information means easier task • Models can ignore irrelevant features (e.g. set weights to zero) • Models can adjust metrics « In theory,practice and theory are the same. But in practice,they're not » • Lot of inputs means … Lots of parameters & high-dimensional input space  Curse of dimensionality and risks of overfitting !

  6. Reduced set of variables • Initial variables • Reduced set: • selection or • projection • Advantages • selection: interpretability, easy algorithms • projection: potentially more powerful x1, x2, x3, …, xN x2, x7, x23, …, xN-4 y1, y2, y3, …, yM (where yi = f (wi , x))

  7. Feature selection • Initial variables • Reduced set: • Selection • Based on sound statistical criteria • Makes interpretation easy: • x7, x23 are the variables to take into account • set {x7, x23} is as good as set {x2, x44, x47} to serve as input to the model x1, x2, x3, …, xN x2, x7, x23, …, xN-4

  8. Feature selection • Two ingredients are needed: • Key Element 1 : Subset relevance assessment • Measuring how a subset of features fits the problem • Key Element 2 : Subset search policy • Avoiding to try all possible subsets

  9. Feature selection • Two ingredients are needed: • Key Element 1 : Subset relevance assessment • Measuring how a subset of features fits the problem • Key Element 2 : Subset search policy • Avoiding to try all possible subsets

  10. Optimal subset search • Which subset is most relevant ? [ ] [X1X2X3X4] NP problem : exponential in the number of features

  11. Option 1: Best subset is … subset of best features • Which subset is most relevant ? • Hypothesis : Best subset is the set of K most relevant features [X1 X2 X3 X4] Naive search (Ranking)

  12. Ranking is usually not optimal • Very correlated features • Obviously, close features will be selected!

  13. Option 2: Best subset is … approximate solution to NP problem • Which subset is most relevant ? • Hypothesis : Best subset can be constructed iteratively Iterative heuristics

  14. About the relevance criterion The relevance criterion must deal with subsets of variables !

  15. Feature selection • Two ingredients are needed: • Key Element 1 : Subset relevance assessment • Measuring how a subset of features fits the problem • Key Element 2 : Subset search policy • Avoiding to try all possible subsets

  16. Mutual information • Mutual information is • Bounded below by 0 • Not bounded above by 1 • Bounded above by the (unknown) entropies

  17. Mutual information is difficult to estimate • probability density functions are not known • integrals cannot be computed exactly • X can be high-dimensional

  18. Estimation in HD • Traditional MI estimators: • histograms • kernels (Parzen windows) NOT appropriate for high dimension • Kraskov's estimator (k-NN counts) Still not very appropriate, but works better ... Principle: when data are close in the X space, are the corresponding Y close too ?

  19. Kraskov's estimator (k-NN counts) • Principle: to count the # of neighbors in X versus the number of neighbors in Y Y Y X X

  20. Kraskov's estimator (k-NN counts) • Principle: to count the # of neighbors in X versus the number of neighbors in Y Y Y X X Nearest neighbors in X and Y coincide: high mutual information Nearest neighbors in X and Y do not coincide: low mutual information

  21. MI estimation • Mutual Information estimators require the tuning of a parameter: • bins in histograms • Kernel variance in Parzen • K in k-NN based estimator (Kraskov) • Unfortunately, the MI estimator is not very robust to this parameter…

  22. Robustness of MI estimator • 100 samples

  23. Sensitivity to stopping criterion • Forward search: stop when MI does not increase anymore • In theory: is it valid?

  24. Sensitivity to stopping criterion • Forward search: stop when MI does not increase anymore • In theory: is it valid? • Answer: NO, because

  25. Sensitivity to stopping criterion • Forward search: stop when MI does not increase anymore • In theory: NOT OK! • In practice: ???

  26. In summary • Two problems: • The number k of neighbors in the k-NN estimator • When to stop?

  27. Number of neighbors? • How to select k (the number of neighbors)?

  28. Number of neighbors? • How to select k (the number of neighbors)? • Idea: compare the (distributions of) the MI between Y and • a relevant feature X • a non-relevant one Xp.

  29. The best value for k • The optimal value of k best separates the distributions (ex: Student-like test)

  30. How to obtain these distributions? • Distribution of MI(Y, X ): • use non-overlapping subsets X[i] • compute I (X[i] , Y) • Distribution of MI(Y, Xp): eliminate the relation between X and Y • How? Permute X -> Xp • use non-overlapping subsets Xp[j] • compute I (Xp[j] , Y)

  31. The stopping criterion • Observed difficulty: the MI estimation depends on the size of the feature set: even if MI (Xk,Y ) = 0. • Avoid comparing MI on subsets of different sizes! • Compare with

  32. The stopping criterion 95% percentiles of the permutation distribution

  33. The stopping criterion • 100 datasets (MC simulations)

  34. "Housing" benchmark • Dataset origin: StatLib library, Carnegie Mellon Univ. • Concerns housing values in suburbs of Boston • Attributes: • CRIM per capita crime rate by town • ZN proportion of residential land zoned for lots over 25,000 sq.ft. • INDUS proportion of non-retail business acres per town • CHAS Charles River dummy variable (= 1 if tract bounds river; otherw. 0) • NOX nitric oxides concentration (parts per 10 million) • RM average number of rooms per dwelling • AGE proportion of owner-occupied units built prior to 1940 • DIS weighted distances to five Boston employment centres • RAD index of accessibility to radial highways • TAX full-value property-tax rate per $10,000 • PTRATIO pupil-teacher ratio by town • B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town • LSTAT % lower status of the population • MEDV Median value of owner-occupied homes in $1000's

  35. The stopping criterion • Housing dataset

  36. The stopping criterion • Housing dataset RBFN performances on test set: - all features: RMSE = 18.97 - 2 features (max MI): RMSE = 19.39 - Selected features: RMSE = 9.48

  37. The stopping criterion • Spectral analysis (Nitrogen dataset) • 141 IR spectra, 1050 wavelengths • 105 spectra for training, 36 for test • Functional preprocessing (B-splines)

  38. The stopping criterion • Spectral analysis (Nitrogen dataset) RBFN performances on test set: - all features: RMSE = 3.12 - 6 features (max MI): RMSE = 0.78 - Selected features: RMSE = 0.66 SHANGHAÏ

  39. The stopping criterion • Delve-Census dataset • 104 features used • 22784 data • 14540 for test • 8 x 124 for training (to study variability)

  40. The stopping criterion • Delve-Census dataset

  41. The stopping criterion • Delve-Census dataset • RMSE on test set

  42. Conclusion • Selection of variables by Mutual Information may improve learning performancesand increases interpretability … • …if used in an adequate way ! • Reference: • D. François, F. Rossi, V. Wertz and M. Verleysen, Resampling methods for parameter-free and robust feature selection with mutual information, Neurocomputing, Volume 70, Issues 7-9 , March 2007, Pages 1276-1288 • Thanks to my co-authors for (most part of…) the work!

More Related