1 / 24

On the Stability of Feature Selection in the Presence of Correlations

Explore the stability of feature selection method in the presence of correlations, trust in choices, reproducibility, and investments. Investigate how small data perturbations influence selected features and understand the impact of changing feature subsets. Learn how domain knowledge and correlations correction can improve feature selection stability. Find insights on the reliability and effectiveness of feature choices in machine learning research.

mharris
Download Presentation

On the Stability of Feature Selection in the Presence of Correlations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Slides at: tinyurl.com/ecml2019stability On the Stabilityof Feature Selectionin the Presence of Correlations Konstantinos Sechidis, Konstantinos Papangelou, Sarah Nogueira, James Weatherall and Gavin Brown

  2. Your Data Science Pipeline Predictions Conclusions Investments How much do you trustyour feature choices? How reproducible is your research result?

  3. How much do you trustyour choices? x1, x2, x3, x4, x5, x6, x7, …, etc,…, x499, x500 Your Data Science Pipeline FeatureSelection (any method) x1, x3, x5, x6, x493

  4. How much do you trustyour choices? x1, x2, x3, x4, x5, x6, x7, …, etc,…, x499, x500 FeatureSelection Cross-validate Your Data Science Pipeline x1, x3, x5, x6, x493

  5. How much do you trustyour choices? x1, x2, x3, x4, x5, x6, x7, …, etc,…, x499, x500 Drop random1% of examples Will not make a difference …or will it? FeatureSelection Cross-validate Your Data Science Pipeline x491 x2 x1, x3, x5, x6, x493 “Stability”

  6. Stability of Feature Selection z1 z2 z3 z4 z5 z6 z7 …… z493 … [ x1, x3, x5, x6, x493 ] “Selection vector” [ 1 0 1 0 11 0 0 ….. 1…] ECML 2016 tinyurl.com/ecml2016stability (a few differences ECML -> JMLR) JMLR 2018 Nogueira et al, “On The Stability of Feature Selection” “the change in the selected feature subset caused by tiny changes in the training data”

  7. Stability Set intersection (i.e. num features in common) z1 z2 z3 z4 z5 z6 z7 …… z493 … [ 1 0 1 0 11 0 0 ….. 1…] [ 0 1 0 111 0 0 ….. 1…] [ 1 0 0 0 0 0 0 0 ….. 1…] [ 1 0 1 0 111 1 ….. 1…] [ 0 1 0 1 0 0 0 0 ….. 1…] [ 1 0 1 0 11 0 0 ….. 1…] [ 0 1 0 0 0 1 0 0 ….. 1…] s1 s2 s3 s4 . . . sM Repeat M times - I perturb my data - Select features. e.g. Kalousis 2005, Kuncheva 2007, Lustgarden 2008, etc

  8. Stability Set intersection (i.e. num features in common) z1 z2 z3 z4 z5 z6 z7 …… z493 … [ 1 0 1 0 11 0 0 ….. 1…] [ 0 1 0 111 0 0 ….. 1…] [ 1 0 0 0 0 0 0 0 ….. 0…] [ 1 0 1 0 111 1 ….. 1…] [ 0 1 0 1 0 0 0 0 ….. 1…] [ 1 0 1 0 11 0 0 ….. 1…] [ 0 1 0 0 0 1 0 0 ….. 1…] s1 s2 s3 s4 . . . sM Repeat M times - I perturb my data - Select features. e.g. Kalousis 2005, Kuncheva 2007, Lustgarden 2008, etc

  9. Nogueira et al, “On The Stability of Feature Selection” Journal of Machine Learning Research 2018 Stability z1 z2 z3 z4 z5 z6 z7 …… z493 … Probability of selecting f [ 1 0 1 0 11 0 0 ….. 1…] [ 0 1 0 111 0 0 ….. 1…] [ 1 0 1 0 0 0 0 0 ….. 1…] [ 1 0 1 0 111 1 ….. 1…] [ 0 1 0 1 0 0 0 0 ….. 1…] [ 1 0 1 0 11 0 0 ….. 1…] [ 0 1 0 0 0 1 0 0 ….. 1…] s1 s2 s3 s4 . . . sM Average num selected Repeat M times - I perturb my data - Select features. Total number of features

  10. Nogueira et al, “On The Stability of Feature Selection” Journal of Machine Learning Research 2018 Stability z1 z2 z3 z4 z5 z6 z7 …… z493 … [ 1 0 1 0 11 0 0 ….. 1…] [ 1 0 1 0 11 0 0 ….. 1…] [ 1 0 1 0 11 0 0 ….. 1…] [ 1 0 1 0 11 0 0 ….. 1…] [ 1 0 1 0 11 0 0 ….. 1…] [ 1 0 1 0 11 0 0 ….. 1…] [ 1 0 1 0 11 0 0 ….. 1…] = 1.0 … assuming large sample size M … usually M=50 is sufficient. Constant processes give stabilityONE.

  11. Nogueira et al, “On The Stability of Feature Selection” Journal of Machine Learning Research 2018 Stability z1 z2 z3 z4 z5 z6 z7 …… z493 … [ 0 0 1 0 11 0 0 ….. 1…] [ 1 0 0 1 0 0 0 1 ….. 0…] [ 1 0 1 0 11 0 0 ….. 1…] [ 0 0 0 1 0 0 0 0 ….. 1…] [ 1 0 1 0 0 1 0 1 ….. 0…] [ 1 0 0 0 0 0 0 0 ….. 1…] [ 1 0 1 0 11 0 0 ….. 0…] = 0.0 … assuming large sample size M … usually M=50 is sufficient. Random processes give stabilityZERO

  12. Nogueira et al, “On The Stability of Feature Selection” Journal of Machine Learning Research 2018 Stability z1 z2 z3 z4 z5 z6 z7 …… z493 … [ 1 0 1 0 11 0 0 ….. 1…] [ 0 1 0 111 0 0 ….. 1…] [ 1 0 1 0 11 0 0 ….. 1…] [ 1 0 1 0 11 0 0 ….. 1…] [ 0 1 0 111 0 0 ….. 1…] [ 1 0 1 0 11 0 0 ….. 1…] [ 0 1 0 111 0 0 ….. 1…] It’s alternating…. = 0.26 Not very stable? But if I tell you…. - feature x1 is highly correlated with x2 ? - feature x3 is measuring the same thing as x4 ?

  13. On the Stabilityof Feature Selectionin the Presence of Correlations • This paper proves : • stability measures (any) will systematically underestimate in the presence of correlations leading to an overly pessimistic stability. • This paper provides : • a correction to incorporate domain knowledge, on feature equivalencies and/or correlations, givingtheeffective stability.

  14. On the Stabilityof Feature Selectionin the Presence of Correlations

  15. On the Stabilityof Feature Selectionin the Presence of Correlations z1 z2 z3 z4 z5 z6 z7 …… z493 … s1 s2 [ 1 0 0 0 11 0 0 ….. 1…] [ 0 1 0 0 11 0 0 ….. 1…] Intersection Effective Intersection Coded as binary matrix Domain Knowledge… Feature x1 and x2 are the same thing. where if features are to be treated as the same. or partial correlation, e.g.

  16. On the Stabilityof Feature Selectionin the Presence of Correlations Proposed by LASSO [ 1 0 1 0 11 0 0 ….. 1…] 1.0 LASSO Proposed by Mut Info [ 0 0 11 0 1 0 0 ….. 0…] Accuracy (3-nn) MIM 0.0 0.0 1.0 Stability Accuracy/Stability is a trade-off. “Explicit Control of Feature Relevance and Selection Stability Through Pareto Optimality” IAL workshop, ECML 2019 , Victor Hamer, Pierre Dupont

  17. On the Stabilityof Feature Selectionin the Presence of Correlations 1.0 1.0 LASSO LASSO Accuracy (3-nn) MIM MIM 0.0 0.0 0.0 0.0 1.0 1.0 Effective Stability Stability

  18. On the Stabilityof Feature Selectionin the Presence of Correlations Stability Effective Stability Accuracy vs Stability: Pareto-optimality Effective Stability identifies a solution as more stable than expected.

  19. On the Stabilityof Feature Selectionin the Presence of Correlations Accuracy vs Stability: Pareto-optimality Effective stability alters the ‘optimal’ choice of feature set in 7/10datasets.

  20. On the Stabilityof Feature Selectionin the Presence of Correlations Empirical Study: Stability of Biomarker Selection Efficacy of gefitinib vs chemotherapy for lung cancer

  21. On the Stabilityof Feature Selectionin the Presence of Correlations All EGFR gene mutations (known to play a role in NSCLC) Measurewithin-group stability to see what’s happening… Changes our view of the “best” algorithm to invest in.

  22. Conclusions A simple closed form estimator for the effective stability Incorporating domain knowledge on feature correlations and equivalences. Empirically demonstrated on biomarker identification tasks, allows measurement of trust in in data science pipelines.

  23. Your Data Science Pipeline Predictions Conclusions Investments How much do you trustyour data science pipeline? How reproducible / defendable are your decisions?

  24. On the Stabilityof Feature Selectionin the Presence of Correlations Konstantinos Sechidis, Konstantinos Papangelou, Sarah Nogueira, James Weatherall and Gavin Brown A simple closed form estimator for the effective stability Incorporating domain knowledge on feature correlations and equivalences. Empirically demonstrated on biomarker identification tasks, allows measurement of trust in in data science pipelines.

More Related