190 likes | 309 Views
Decision trees for hierarchical multilabel classification. A case study in functional genomics. Work by. Hendrik Blockeel Leander Schietgat Jan Struyf Katholieke Universiteit Leuven (Belgium) Amanda Clare University of Aberystwyth (Wales) Sa š o D ž eroski
E N D
Decision trees for hierarchical multilabel classification A case study in functional genomics
Work by • Hendrik Blockeel • Leander Schietgat • Jan Struyf Katholieke Universiteit Leuven (Belgium) • Amanda Clare University of Aberystwyth (Wales) • Sašo Džeroski Jozef Stefan Institute Ljubljana (Slovenia)
Overview • Hierarchical Multilabel Classification • task description • Predictive Clustering Trees for HMC • the algorithm: Clus-HMC • Evaluation on yeast datasets
1 (1) 2 (2) 3 (5) 2/1 (3) 2/2 (4) Hierarchical multilabel classification (HMC) • Classification • predict class for unseen instances based on (classified) training examples • HMC • instance can belong to multiple classes • classes are organised in a hierarchy • Example • toy hierarchy • Advantages • efficiency • skewed class distributions • hierarchical relationships
Predictive clustering trees • ~ decision trees [Blockeel et al. 1998] • each node (including leaves) is a cluster • tests in nodes are descriptions of clusters • Heuristic • minimize intra-cluster variance • maximise inter-cluster variance • Can be extended to perform HMC • distance measure d (quantifies similarity) • prediction function p (maps a cluster in a leaf onto prediction)
1 (1) 2 (2) 3 (5) 2/1 (3) 2/2 (4) Instantiating d • Class labels are represented in a vector • vi = [1,1,0,1,0] (1) (2) (3) (4) (5) • Distance between vectors is defined as the component-wise Euclidean distance: • d(x1,x2) = √∑k wk • (v1,k – v2,k)2 Example Si = {1,2,2/2}, Sj = {2} dEucl([1,1,0,1,0],[0,1,0,0,0]) = sqrt(w + w²) (wk = wdepth(ck))
Instantiating p • Each leaf contains multiple classes (organised in a hierarchy) • Which classes to predict? • binary classification: predict positive if the instance ends up in a leaf with at least 50% positives • multilabel classification: skewed class distributions • Threshold • an instance ending up in some leaf is predicted to belong to class ci if vi ti, with vi the proportion of instances in the leaf belonging to ci, and ti some threshold • by varying threshold, we obtain different points on the precision-recall curve
stopping criterion Clus-HMC algorithm • Pseudo code
Experiments in yeast functional genomics • Saccharomyces cerevisiae or baker’s/brewer’s yeast • MIPS FunCat hierarchy • function of yeast genes • 12 data sets [Clare 2003] • Sequence structure (seq) • Phenotype growth (pheno) • Secondary structure (struc) • Homology search (hom) • Microarray data • cellcycle, church, derisi, eisen, gasch1, gasch2, spo, expr (all) 1 METABOLISM 1/1 amino acid metabolism1/2 nitrogen and sulfur metabolisms … 2 ENERGY 2/1 glycolysis and gluconeogenesis …
Experimental evaluation • Objectives • Comparison with C4.5H [Clare 2003] • Evaluation of the improvement obtainable with HMC trees over single classification trees • Evaluation with precision-recall curves • precision • recall • advantages = TP / Yes = TP / (TP+FP) = TP / + = TP / (TP+FN)
Comparison with C4.5H • C4.5H = hierarchical multilabel extension of C4.5 [Clare 2003] • Designed by Amanda Clare • Heuristic: information gain • adaptation of entropy (sum of all classes) • Prediction: most frequent set of classes + significance test • Clus-HMC method • Tuning: different F-tests on validation data, choose F-test with highest AUPRC
Comparison between Clus-HMC and C4.5H • Average case
I II IV III Comparison between Clus-HMC and C4.5H • Specific classes 25 wins (II), 6 losses (IV)
Comparing rules • e.g. predictions for class 40/3 in “gasch1” data set • C4.5H: two rules • Clus-HMC(most precise rule) IF 29C_Plus1M_sorbitol_to_33C_Plus_1M_sorbitol___15_minutes <= 0.03 AND constant_0point32_mM_H202_20_min_redo <= 0.72 AND 1point5_mM_diamide_60_min <= -0.17 AND steady_state_1M_sorbitol > -0.37 AND DBYmsn2_4__37degree_heat___20_min <= -0.67 THEN 40/3 IF Heat_Shock_10_minutes_hs_1 <= 1.82 AND Heat_Shock_030inutes__hs_2 <= -0.48 AND 29C_Plus1M_sorbitol_to_33C_Plus_1M_sorbitol___5_minutes > -0.1 THEN 40/3 Precision: 0.52 Recall: 0.26 Precision: 0.56 Recall: 0.18 IF Nitrogen_Depletion_8_h <= -2.74 AND Nitrogen_Depletion_2_h > -1.94 AND 1point5_mM_diamide_5_min > -0.03 AND 1M_sorbitol___45_min_ > -0.36 AND 37C_to_25C_shock___60_min > 1.28 THEN 40/3 Precision: 0.97 Recall: 0.15
HMC vs. single classification • Method • Average case
HMC vs. single classification • Specific classes • numbers are AUPRC(Clus-HMC) – AUPRC(Clus-SC) HMC performs better!
Conclusions • Use of precision-recall curves to optimize the learned models and to evaluate the results • Improvement over C4.5H • HMC compared to SC • Comparable predictive performance • Faster • Easier to interpret
References • Hendrik Blockeel, Luc De Raedt, Jan Ramon, Top-down induction of clustering trees (1998) • Amanda Clare, Machine learning and data mining for yeast functional genomics, Doctoral dissertation (2003) • Jan Struyf, Sašo Džeroski, Hendrik Blockeel, Amanda Clare, Hierarchical multi-classification with predictive clustering trees in functional genomics (2005)