130 likes | 280 Views
Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat, Jan Struyf, Hendrik Blockeel, Dragi Kocev, Sa šo Džeroski. K.U.Leuven Department of Computer Science. Hierarchical Multi-Label Classification (HMC) for Gene Function Prediction.
E N D
Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat, Jan Struyf, Hendrik Blockeel,Dragi Kocev, Sašo Džeroski K.U.LeuvenDepartment of Computer Science
Hierarchical Multi-Label Classification (HMC) for Gene Function Prediction • Classification: a common machine learning task e.g., • Given: genes with known function • Task: predict function for new genes • Special case: hierarchical multi-label classification (HMC) • gene can have multiple functions • functions are organized in a hierarchy • tree (e.g., MIPS FunCat) • DAG (e.g., Gene Ontology) Hierarchy constraint:if gene is labeled with function X, thenit is also labeled with all parents of X K.U.LeuvenDepartment of Computer Science
Predictions in Functional Genomics • S. cerevisiae (13 datasets) and A. thaliana (12 datasets) • two of biology’s model organisms • most genes are annotated, ideal for testing purposes • method can be applied to other organisms • Data • based on sequence statistics, phenotype, secondary structure, homology, microarray data,…
Predictive Clustering Trees • Our focus is on decision trees • Advantages: fast to build, noise-resistant, fast to apply, accurate predictions, easy to interpret, … • General framework: predictive clustering trees (PCTs) Input Algorithm Output PCT-algo Name A1 A2 … An 1 … 5 5/1 … 40 40/3 40/16 … G1 … … … … x x x x x G2 … … … … x x x x G3 … … … … x x G4 … … … … x x x G5 … … … … x x x G6 … … … … x x x … … … … … … … … … … … … … … … … top-down induction of PCTs PCT genes with features and known functions
Decision Trees for HMC: Different Approaches Our approach learns one single tree for all classes Special-purpose approach learns one tree per class + hierarchy constraint Standard approach learns one tree per class
Test set Clus-HMC L1 1 Clus-HMC L2 2 3 L Clus-HMC L3 combined prediction n … … Clus-HMC Ln 50 PCTs Predictive Clustering Forests • Ensembles • Less interpretability • Better performance • Algorithm: Clus-HMC-Ens 50 predictions Training set 50 bootstrap replicates
Decision Trees for HMC: Different Approaches Our approach learns one single tree for all classes Variant of our approach learns forest Special-purpose approach learns one tree per class + hierarchy constraint Standard approach learns one tree per class
Evaluation • Evaluation: precision-recall • precision: percentage of predicted functions that are correct (TP/(TP+FP)) • recall: percentage of actual functions predicted by the algorithm (TP/(TP+FN)) • Average PR curve • Consider (instance,class) couples • Couple is (predicted) true if instance (is predicted to have) has class
S. cerevisiae-FunCat (hom) A. thaliana-GO (seq) S. cerevisiae-FunCat (expr) A. thaliana-GO (interpro) • Clus-HMC-Ens better than Clus-HMC (average AUC improvement of 7%) • Clus-HMC better than C4.5H (state-of-the-art system for HMC) (for the same recall of C4.5H, average precision improvement of 20.9%)
Comparison with SVMs (Barutcuoglu et al.) • Learn SVM per class • Correct for HC violations with bayesian model
Conclusions • Clus-HMC outperforms (or is comparable to) state-of-the-art methods on functional genomics tasks • Ensembles of Clus-HMC are able to boost performance, if the user is willing to give up on interpretability • “Revenge of the decision trees”