320 likes | 345 Views
Learn how to select relevant features to improve lung cancer prediction, avoiding pitfalls. Explore causal relationships and feature relevance in this informative lecture by Guyon et al.
E N D
Lecture 3:Causality and Feature Selection Isabelle Guyon isabelle@clopinet.com
Variable/feature selection Y X Remove features Xi to improve (or least degrade) prediction of Y.
What can go wrong? Guyon-Aliferis-Elisseeff, 2007
X2 Y Y X2 X1 X1 What can go wrong? Guyon-Aliferis-Elisseeff, 2007
Y Y X Causal feature selection Uncover causal relationships between Xi and Y.
Causal feature relevance Lung cancer
Causal feature relevance Lung cancer
Causal feature relevance Lung cancer
Markov Blanket Lung cancer Strongly relevant features (Kohavi-John, 1997) Markov Blanket (Tsamardinos-Aliferis, 2003)
Feature relevance • Surely irrelevant feature Xi: P(Xi, Y |S\i) = P(Xi |S\i)P(Y |S\i) for all S\i X\i and all assignment of values to S\i • Strongly relevant feature Xi: P(Xi, Y |X\i) P(Xi |X\i)P(Y |X\i) for some assignment of values to X\i • Weakly relevant feature Xi: P(Xi, Y |S\i) P(Xi |S\i)P(Y |S\i) for some assignment of values to S\i X\i
Markov Blanket Lung cancer Strongly relevant features (Kohavi-John, 1997) Markov Blanket (Tsamardinos-Aliferis, 2003)
Markov Blanket PARENTS Lung cancer Strongly relevant features (Kohavi-John, 1997) Markov Blanket (Tsamardinos-Aliferis, 2003)
Markov Blanket Lung cancer CHILDREN Strongly relevant features (Kohavi-John, 1997) Markov Blanket (Tsamardinos-Aliferis, 2003)
Markov Blanket SPOUSES Lung cancer Strongly relevant features (Kohavi-John, 1997) Markov Blanket (Tsamardinos-Aliferis, 2003)
Causal relevance • Surely irrelevant feature Xi: P(Xi, Y |S\i) = P(Xi |S\i)P(Y |S\i) for all S\i X\i and all assignment of values to S\i • Causally relevant feature Xi: P(Xi,Y|do(S\i)) P(Xi |do(S\i))P(Y|do(S\i)) for some assignment of values to S\i • Weak/strong causal relevance: • Weak=ancestors, indirect causes • Strong=parents, direct causes.
Examples Lung cancer
Immediate causes (parents) Genetic factor1 Smoking Lung cancer
Non-immediate causes (other ancestors) Smoking Anxiety Lung cancer
Non causes (e.g. siblings) Genetic factor1 Other cancers Lung cancer
X || Y | C CHAIN FORK C C X X Y
Hidden more direct cause Smoking Tar in lungs Anxiety Lung cancer
Confounder Smoking Genetic factor2 Lung cancer
Immediate consequences (children) Lung cancer Metastasis Coughing Biomarker1
X || Y but X || Y | C Lung cancer X X C C C X Strongly relevant features (Kohavi-John, 1997) Markov Blanket (Tsamardinos-Aliferis, 2003)
Non relevant spouse (artifact) Lung cancer Bio-marker2 Biomarker1
Systematic noise Another case of confounder Lung cancer Bio-marker2 Biomarker1
Truly relevant spouse Lung cancer Allergy Coughing
Sampling bias Lung cancer Metastasis Hormonal factor
Tar in lungs Genetic factor2 Bio-marker2 Allergy Systematic noise Biomarker1 Metastasis Coughing Hormonal factor Causal feature relevance Genetic factor1 Smoking Other cancers Anxiety Lung cancer (b)
Conclusion • Feature selection focuses on uncovering subsets of variables X1, X2, …predictive of the target Y. • Multivariate feature selection is in principle more powerful than univariate feature selection, but not always in practice. • Taking a closer look at the type of dependencies in terms of causal relationships may help refining the notion of variable relevance.
Acknowledgements and references • Feature Extraction, • Foundations and Applications • I. Guyon et al, Eds. • Springer, 2006. • http://clopinet.com/fextract-book • 2) Causal feature selection • I. Guyon, C. Aliferis, A. Elisseeff • In “Computational Methods of Feature Selection”, • Huan Liu and Hiroshi Motoda Eds., • Chapman and Hall/CRC Press, 2007. • http://clopinet.com/isabelle/Papers/causalFS.pdf