180 likes | 450 Views
Three feature selection problems (with solutions). Jose M. Peña Computational Biology Linköping University Sweden jmp@ifm.liu.se www.ifm.liu.se/~jmp. Joint work with Roland Nilsson Johan Björkegren Jesper Tegnér. Outline. Problem I: Posterior distribution. Solution: Markov boundary.
E N D
Three feature selection problems(with solutions) Jose M. Peña Computational Biology Linköping University Sweden jmp@ifm.liu.se www.ifm.liu.se/~jmp Joint work with Roland Nilsson Johan Björkegren Jesper Tegnér JMP at IDAMAP 2007
Outline • Problem I: Posterior distribution. • Solution: Markov boundary. • Peña, J. M., Nilsson, R., Björkegren, J. and Tegnér, J. (2007). Towards Scalable and Data Efficient Learning of Markov Boundaries. International Journal of Approximate Reasoning, 45(2), 211-232. • Problem II: Class label. • Solution: Bayes relevant features. • Nilsson, R., Peña, J. M., Björkegren, J. and Tegnér, J. (2007). Consistent Feature Selection for Pattern Recognition in Polynomial Time. Journal of Machine Learning Research, 8, 589-612. • Problem III: All relevant features. • Solution: RIT algorithm. • Nilsson, R., Peña, J. M., Björkegren, J. and Tegnér, J. (2007). Detecting Multivariate Differentially Expressed Genes. BMC Bioinformatics, 8:150. JMP at IDAMAP 2007
Preliminaries • Classifier, g:X->Y. • Bayes classifier, g*(X) = arg maxy p(y|X). • Risk, R(g) = p(g(X) Y). JMP at IDAMAP 2007
Problem I: Posterior distribution • The Markov boundary of Y, SM, is the minimal set of features such that p(p(Y|X) = p(Y| SM)) = 1. • If p(X) > 0 then SM is unique. • If p(X) > 0 then Z SM iff p(p(Y|X) p(Y|X\Z)) > 0. Data inefficient Z is strongly relevant JMP at IDAMAP 2007
Algorithms for SM • Satisfied by • Gaussian distributions. • Distributions perfect to some graph. • Closed under marginalizacion and conditioning*. (Tsamardinos et al., 2003) • IAMB is consistent under the composition property assumption (X ╨ Y | Z ٨ X ╨ W | Z → X ╨ YW | Z). JMP at IDAMAP 2007
Algorithms for SM • Consistent under the same conditions as IAMB. JMP at IDAMAP 2007
Data provided by DuPont Pharmaceuticals for KDD Cup 2001. 1909 training instances + 634 testing instances 139351 binary features (3-D properties of a drug compound tested for binding to thrombin, a key receptor in blood clotting) Thrombin data JMP at IDAMAP 2007
Problem II: Class label • Z is Bayes relevant iff p(g*(X)g*(X\Z)) > 0. • Let S* denote the set of Bayes relevant features. Then, • S* is unique if g* is unique, and • g* is unique if p(p(Y=0|X) = p(Y=1|X)) = 0 (Devroye et al. 1996). • If p(X) > 0 then S* is the minimal set of features such that p(g*(X) = g*(S*)) = 1. Assumption JMP at IDAMAP 2007
S* may differ from SM • S* SM. • But the converse may not be true. JMP at IDAMAP 2007
Algorithm for S* • Polynomial in the number of features if ĉ is so (e.g., empirical risk of the k-NN classifier on some testing data). JMP at IDAMAP 2007
UCI data sets JMP at IDAMAP 2007
Problem III: All relevant features • Z is weakly relevant iff p(p(Y|X) = p(Y|X\Z)) = 1 but p(p(Y|S) p(Y|S,Z)) > 0 with S X\Z. • The set of relevant features, SA, is the set of strongly and weakly relevant features. JMP at IDAMAP 2007
Why is this important ? JMP at IDAMAP 2007
Satisfied by • Gaussian distributions. • Distributions perfect to some graph. • Closed under marginalizacion and conditioning*. Algorithm for SA • There exists f(X,Y) > 0 such that searching for SA implies an exhaustive search. • RIT is consistent under the following assumptions: • strictly positivity (f(X)>0), • composition (X ╨ Y | Z ٨ X ╨ W | Z → X ╨ YW | Z), and • weak transitivity (X ╨ Y | Z٨X ╨ Y | ZV→ X ╨ V | Z ٧V ╨ Y | Z). • RIT performs at most |SA||X| tests (|SA|<|X|). JMP at IDAMAP 2007
Algorithm for SA JMP at IDAMAP 2007
Algorithm for SA with FDR control JMP at IDAMAP 2007
Diabetes data Data from Gunton et al. (2005) Cell, 122. 7 Normal vs. 15 type 2 diabetic patients, and 5000 genes kept after filtering out those with low variance. 3 genes are univariately differentially expressed: Arnt, Cdc14a and Ddx3Y (370 if no control for multiplicity). Dopey1 was recently shown to be active in the vesicle traffic system, the mechanism that delivers insulin receptors to the cell surface. 4 genes encoded TFs, which is intriguing since a large fraction of previously discovered diabetes-related genes are TFs. So does Ddx3Y (only 6 genes annotated with this function). JMP at IDAMAP 2007
Summary • Problem I: Posterior distribution. • Solution: Markov boundary. • Peña, J. M., Nilsson, R., Björkegren, J. and Tegnér, J. (2007). Towards Scalable and Data Efficient Learning of Markov Boundaries. International Journal of Approximate Reasoning, 45(2), 211-232. • Problem II: Class label. • Solution: Bayes relevant features. • Nilsson, R., Peña, J. M., Björkegren, J. and Tegnér, J. (2007). Consistent Feature Selection for Pattern Recognition in Polynomial Time. Journal of Machine Learning Research, 8, 589-612. • Problem III: All relevant features. • Solution: RIT algorithm. • Nilsson, R., Peña, J. M., Björkegren, J. and Tegnér, J. (2007). Detecting Multivariate Differentially Expressed Genes. BMC Bioinformatics, 8:150. JMP at IDAMAP 2007