240 likes | 247 Views
Explore the stability of feature selection method in the presence of correlations, trust in choices, reproducibility, and investments. Investigate how small data perturbations influence selected features and understand the impact of changing feature subsets. Learn how domain knowledge and correlations correction can improve feature selection stability. Find insights on the reliability and effectiveness of feature choices in machine learning research.
E N D
Slides at: tinyurl.com/ecml2019stability On the Stabilityof Feature Selectionin the Presence of Correlations Konstantinos Sechidis, Konstantinos Papangelou, Sarah Nogueira, James Weatherall and Gavin Brown
Your Data Science Pipeline Predictions Conclusions Investments How much do you trustyour feature choices? How reproducible is your research result?
How much do you trustyour choices? x1, x2, x3, x4, x5, x6, x7, …, etc,…, x499, x500 Your Data Science Pipeline FeatureSelection (any method) x1, x3, x5, x6, x493
How much do you trustyour choices? x1, x2, x3, x4, x5, x6, x7, …, etc,…, x499, x500 FeatureSelection Cross-validate Your Data Science Pipeline x1, x3, x5, x6, x493
How much do you trustyour choices? x1, x2, x3, x4, x5, x6, x7, …, etc,…, x499, x500 Drop random1% of examples Will not make a difference …or will it? FeatureSelection Cross-validate Your Data Science Pipeline x491 x2 x1, x3, x5, x6, x493 “Stability”
Stability of Feature Selection z1 z2 z3 z4 z5 z6 z7 …… z493 … [ x1, x3, x5, x6, x493 ] “Selection vector” [ 1 0 1 0 11 0 0 ….. 1…] ECML 2016 tinyurl.com/ecml2016stability (a few differences ECML -> JMLR) JMLR 2018 Nogueira et al, “On The Stability of Feature Selection” “the change in the selected feature subset caused by tiny changes in the training data”
Stability Set intersection (i.e. num features in common) z1 z2 z3 z4 z5 z6 z7 …… z493 … [ 1 0 1 0 11 0 0 ….. 1…] [ 0 1 0 111 0 0 ….. 1…] [ 1 0 0 0 0 0 0 0 ….. 1…] [ 1 0 1 0 111 1 ….. 1…] [ 0 1 0 1 0 0 0 0 ….. 1…] [ 1 0 1 0 11 0 0 ….. 1…] [ 0 1 0 0 0 1 0 0 ….. 1…] s1 s2 s3 s4 . . . sM Repeat M times - I perturb my data - Select features. e.g. Kalousis 2005, Kuncheva 2007, Lustgarden 2008, etc
Stability Set intersection (i.e. num features in common) z1 z2 z3 z4 z5 z6 z7 …… z493 … [ 1 0 1 0 11 0 0 ….. 1…] [ 0 1 0 111 0 0 ….. 1…] [ 1 0 0 0 0 0 0 0 ….. 0…] [ 1 0 1 0 111 1 ….. 1…] [ 0 1 0 1 0 0 0 0 ….. 1…] [ 1 0 1 0 11 0 0 ….. 1…] [ 0 1 0 0 0 1 0 0 ….. 1…] s1 s2 s3 s4 . . . sM Repeat M times - I perturb my data - Select features. e.g. Kalousis 2005, Kuncheva 2007, Lustgarden 2008, etc
Nogueira et al, “On The Stability of Feature Selection” Journal of Machine Learning Research 2018 Stability z1 z2 z3 z4 z5 z6 z7 …… z493 … Probability of selecting f [ 1 0 1 0 11 0 0 ….. 1…] [ 0 1 0 111 0 0 ….. 1…] [ 1 0 1 0 0 0 0 0 ….. 1…] [ 1 0 1 0 111 1 ….. 1…] [ 0 1 0 1 0 0 0 0 ….. 1…] [ 1 0 1 0 11 0 0 ….. 1…] [ 0 1 0 0 0 1 0 0 ….. 1…] s1 s2 s3 s4 . . . sM Average num selected Repeat M times - I perturb my data - Select features. Total number of features
Nogueira et al, “On The Stability of Feature Selection” Journal of Machine Learning Research 2018 Stability z1 z2 z3 z4 z5 z6 z7 …… z493 … [ 1 0 1 0 11 0 0 ….. 1…] [ 1 0 1 0 11 0 0 ….. 1…] [ 1 0 1 0 11 0 0 ….. 1…] [ 1 0 1 0 11 0 0 ….. 1…] [ 1 0 1 0 11 0 0 ….. 1…] [ 1 0 1 0 11 0 0 ….. 1…] [ 1 0 1 0 11 0 0 ….. 1…] = 1.0 … assuming large sample size M … usually M=50 is sufficient. Constant processes give stabilityONE.
Nogueira et al, “On The Stability of Feature Selection” Journal of Machine Learning Research 2018 Stability z1 z2 z3 z4 z5 z6 z7 …… z493 … [ 0 0 1 0 11 0 0 ….. 1…] [ 1 0 0 1 0 0 0 1 ….. 0…] [ 1 0 1 0 11 0 0 ….. 1…] [ 0 0 0 1 0 0 0 0 ….. 1…] [ 1 0 1 0 0 1 0 1 ….. 0…] [ 1 0 0 0 0 0 0 0 ….. 1…] [ 1 0 1 0 11 0 0 ….. 0…] = 0.0 … assuming large sample size M … usually M=50 is sufficient. Random processes give stabilityZERO
Nogueira et al, “On The Stability of Feature Selection” Journal of Machine Learning Research 2018 Stability z1 z2 z3 z4 z5 z6 z7 …… z493 … [ 1 0 1 0 11 0 0 ….. 1…] [ 0 1 0 111 0 0 ….. 1…] [ 1 0 1 0 11 0 0 ….. 1…] [ 1 0 1 0 11 0 0 ….. 1…] [ 0 1 0 111 0 0 ….. 1…] [ 1 0 1 0 11 0 0 ….. 1…] [ 0 1 0 111 0 0 ….. 1…] It’s alternating…. = 0.26 Not very stable? But if I tell you…. - feature x1 is highly correlated with x2 ? - feature x3 is measuring the same thing as x4 ?
On the Stabilityof Feature Selectionin the Presence of Correlations • This paper proves : • stability measures (any) will systematically underestimate in the presence of correlations leading to an overly pessimistic stability. • This paper provides : • a correction to incorporate domain knowledge, on feature equivalencies and/or correlations, givingtheeffective stability.
On the Stabilityof Feature Selectionin the Presence of Correlations
On the Stabilityof Feature Selectionin the Presence of Correlations z1 z2 z3 z4 z5 z6 z7 …… z493 … s1 s2 [ 1 0 0 0 11 0 0 ….. 1…] [ 0 1 0 0 11 0 0 ….. 1…] Intersection Effective Intersection Coded as binary matrix Domain Knowledge… Feature x1 and x2 are the same thing. where if features are to be treated as the same. or partial correlation, e.g.
On the Stabilityof Feature Selectionin the Presence of Correlations Proposed by LASSO [ 1 0 1 0 11 0 0 ….. 1…] 1.0 LASSO Proposed by Mut Info [ 0 0 11 0 1 0 0 ….. 0…] Accuracy (3-nn) MIM 0.0 0.0 1.0 Stability Accuracy/Stability is a trade-off. “Explicit Control of Feature Relevance and Selection Stability Through Pareto Optimality” IAL workshop, ECML 2019 , Victor Hamer, Pierre Dupont
On the Stabilityof Feature Selectionin the Presence of Correlations 1.0 1.0 LASSO LASSO Accuracy (3-nn) MIM MIM 0.0 0.0 0.0 0.0 1.0 1.0 Effective Stability Stability
On the Stabilityof Feature Selectionin the Presence of Correlations Stability Effective Stability Accuracy vs Stability: Pareto-optimality Effective Stability identifies a solution as more stable than expected.
On the Stabilityof Feature Selectionin the Presence of Correlations Accuracy vs Stability: Pareto-optimality Effective stability alters the ‘optimal’ choice of feature set in 7/10datasets.
On the Stabilityof Feature Selectionin the Presence of Correlations Empirical Study: Stability of Biomarker Selection Efficacy of gefitinib vs chemotherapy for lung cancer
On the Stabilityof Feature Selectionin the Presence of Correlations All EGFR gene mutations (known to play a role in NSCLC) Measurewithin-group stability to see what’s happening… Changes our view of the “best” algorithm to invest in.
Conclusions A simple closed form estimator for the effective stability Incorporating domain knowledge on feature correlations and equivalences. Empirically demonstrated on biomarker identification tasks, allows measurement of trust in in data science pipelines.
Your Data Science Pipeline Predictions Conclusions Investments How much do you trustyour data science pipeline? How reproducible / defendable are your decisions?
On the Stabilityof Feature Selectionin the Presence of Correlations Konstantinos Sechidis, Konstantinos Papangelou, Sarah Nogueira, James Weatherall and Gavin Brown A simple closed form estimator for the effective stability Incorporating domain knowledge on feature correlations and equivalences. Empirically demonstrated on biomarker identification tasks, allows measurement of trust in in data science pipelines.