220 likes | 322 Views
Hierarchical multilabel classification trees for gene function prediction. Leander Schietgat Hendrik Blockeel Jan Struyf Katholieke Universiteit Leuven (Belgium) Amanda Clare University of Aberystwyth (Wales) Sa š o D ž eroski Jo ž ef Stefan Institute Ljubljana (Slovenia).
E N D
Hierarchical multilabel classification trees for gene function prediction Leander Schietgat Hendrik Blockeel Jan Struyf Katholieke Universiteit Leuven (Belgium) Amanda Clare University of Aberystwyth (Wales) Sašo Džeroski Jožef Stefan Institute Ljubljana (Slovenia) Probabilistic Modeling and Machine Learning in Structural and Systems Biology Tuusula, Finland, 17-18 June 2006
Overview • The application • gene function prediction • The machine learning context • hierarchical multilabel classification • Decision trees for HMC • the algorithm: Clus-HMC • Experimental results • Conclusions PMSB 2006 2/21
c1 c2 c3 c21 c22 Gene Function Prediction • Task • Given a data set with descriptions of genes and the functions they have • Learn a model that can predict for a new gene what functions it performs • Genes can have multiple functions • These functions are hierarchically organised PMSB 2006 3/21
Machine Learning • Classifier • predicts for unseen instances the class to which they belong • learned with already classified training examples • Different techniques • decision trees • support vector machines • bayesian networks • … PMSB 2006 4/21
Hierarchical Multilabel Classification • Normal classification setting • only predicts a single class • HMC • predict multiple classes at once • classes are organized in a hierarchy • Hierarchy constraint • instances of a class must be instances of its superclasses PMSB 2006 5/21
m2 … mn m1 … c1? c2? cn? Two HMC approaches 1. Learn model for each class and combine the predictions • Advantage • a lot of machine learning algorithms available • Disadvantages • efficiency • skewed class distributions • hierarchical relationships PMSB 2006 6/21
Two HMC approaches (c’ted) 2. Learn a single model that predicts all the classes together • Advantages • faster to learn • easier to interpret • hierarchy constraint automatically imposed • selection of features relevant for all classes • Disadvantage • may have worse predictive performance M PMSB [c1, c2, …, cn] 2006 7/21
Related work on HMC • Barutcuoglu et al. (2006) • learn classes separately with SVM’s and combine the predictions with Naïve Bayes • Clare (2003) • extension of C4.5 decision tree method that learns all classes together • A lot of work in the area of text classification • Rousu et al. (2005) give an overview on SVM-methods that learn a single model for all classes PMSB 2006 8/21
Why decision trees? training examples • fast to build • fast to use • accurate predictions • easy to interpret Nitrogen depletion <= -2.74? yes no + ̶ Positive Heat shock > 1.28? + Gene ND HS … MF? G1 25 29 … ̶ G2 32 40 … + G3 19 0 … ̶ G4 44 45 … + … … … … … + ̶ yes no ̶ + + + ̶ Positive + Negative ̶ ̶ + PMSB 2006 9/21
Decision trees for HMC • The Clus system • created by Jan Struyf • propositional DT learner, implemented in Java • uses ideas of: • C4.5 [Quinlan93] and CART [Breiman84] • Predictive Clustering Trees [Blockeel98] • Heuristic for HMC • look for test that minimizes the intra-cluster variance (= generalisation of CART) PMSB 2006 10/21
Decision trees for HMC (c’ted) • can be used for HMC (Clus-HMC) … • … as well as binary classification (Clus-SC ~ CART) c1 c1,c21,c22 c1,c2,c21 c1,c3 c2,c21,c22 c1 PMSB 1 2 n 2006 … 11/21 c1? c2? cn?
Experiments in yeast functional genomics • Saccharomyces cerevisiae or baker’s/brewer’s yeast • MIPS FunCat hierarchy • 250 functions of yeast genes • 12 datasets [Clare03] • Sequence structure (seq) • Phenotype growth (pheno) • Secondary structure (struc) • Homology search (hom) • Microarray data • cellcycle, church, derisi, eisen, gasch1, gasch2, spo, expr (all) 1 METABOLISM 1/1 amino acid metabolism1/2 nitrogen and sulfur metabolisms … 2 ENERGY 2/1 glycolysis and gluconeogenesis … PMSB 2006 12/21
Example run description functions • each leaf contains multiple classes • which classes to predict? • problem: different class frequencies • use of threshold • precision-recall curves: independent of a specific threshold {1,5,5/1,3,3/5} {5,5/1,40} {5,5/1,40,40/3} {40,40/16} Name A1 A2 … An 1 … 5 5/1 … 40 40/3 40/16 … G1 … … … … x x x x x G2 … … … … x x x x G3 … … … … x x G4 … … … … x x x G5 … … … … x x x G6 … … … … x x x … … … … … … … … … … … … … … … … {1,5} {5,5/1,40} {40,40/3,40/16} nitrogen_depletion > 5 {40,40/3, 40/16} 37C_to_25C_shock > 1.28 {40,40/16} {5,5/1,40} {1,5,5/1,3, 3/5} PMSB {5,5/1,40, 40/3} {1,5} {5,5/1,40} 2006 Predictions 13/21
Comparison of Clus-HMC with [Clare03] • Average precision-recall curves PRECISION = proportion of (instance, class) predictions that is correct RECALL = proportion of true (instance, class) cases that are predicted PMSB 2006 14/21
Extracting rules • e.g. predictions for class 40/3 in “gasch1” dataset IF Nitrogen_Depletion_8_h <= -2.74 AND Nitrogen_Depletion_2_h > -1.94 AND 1point5_mM_diamide_5_min > -0.03 AND 1M_sorbitol___45_min_ > -0.36 AND 37C_to_25C_shock___60_min > 1.28 THEN 40,40/3 PMSB 2006 Precision: 0.97 Recall: 0.15 15/21
HMC vs. single classification • Tree sizes • on average • HMC tree: 24 nodes • SC tree: 33 nodes (250 of such trees) • Time to grow trees • single SC tree is grown faster than single HMC • but 250 single trees have to be built • HMC on average 37 times faster • Predictive performance • next slide PMSB 2006 16/21
HMC vs. single classification • Average precision-recall curves PMSB 2006 17/21
Explanation of the results • The classes are not independent • different trees for different classes actually share structure • explains some complexity reduction achieved by Clus-HMC • one class carries information on other classes • this increases the signal-to-noise ratio • provides better guidance when learning the tree (explaining good predictive performance) • avoids overfitting (explaining further reduction of tree size) • this was confirmed empirically PMSB 2006 18/21
Conclusions • HMC decision trees are a useful tool for gene function prediction • fast to learn • high interpretability • Compared to regular tree learning, HMC tree learning: • is even faster • yields trees that: • are smaller • are easier to interpret • have equal or better predictive performance PMSB 2006 19/21
GO cellular component biological process molecular function … … … … … … … … … … … … cell physiological process catalytic activity … … … … … … … … … … amino acid metabolism cytosol 3-isopropylmalate dehydratase activity … … amino acid biosynthesis branched chain family amino acid metabolism branched chain family amino acid biosynthesis leucine metabolism leucine biosynthesis Further work • Comparison to other HMC learning algorithms • kernel methods studied by Rousu et al. and Barutcuoglu et al. • other suggestions are welcome! • Use more advanced hierarchy such as Gene Ontology • thousands of classes, spread over 19 levels • how to handle the part_of relationship? • if a function A is part-of a function B then does a gene with function A also have function B? • gene “has” function B X vs. gene “is involved” in function B PMSB 2006 20/21
Questions? PMSB 2006 21/21