1 / 22

Decision Trees for Hierarchical Multilabel Classification : A Case Study in Functional Genomics

Decision Trees for Hierarchical Multilabel Classification : A Case Study in Functional Genomics. Hendrik Blockeel 1 , Leander Schietgat 1 , Jan Struyf 1,2 , Saso Dzeroski 3 , Amanda Clare 4 1 Katholieke Universiteit Leuven 2 University of Wisconsin, Madison

Download Presentation

Decision Trees for Hierarchical Multilabel Classification : A Case Study in Functional Genomics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Decision Trees for Hierarchical Multilabel Classification :A Case Study in Functional Genomics Hendrik Blockeel1, Leander Schietgat1, Jan Struyf1,2, Saso Dzeroski3, Amanda Clare4 1 Katholieke Universiteit Leuven 2 University of Wisconsin, Madison 3 Jozef Stefan Institute, Ljubljana 4 University of Wales, Aberystwyth

  2. Overview • The task: Hierarchical multilabel classification (HMC) • Applied to functional genomics • Decision trees for HMC • Multiple prediction with decision trees • HMC decision trees • Experiments • How does HMC tree learning compare to learning multiple standard trees? • Conclusions

  3. Classification settings • Normally, in classification, we assign one class label ci from a set C = {c1, …, ck} to each example • In multilabel classification, we have to assign a subset S C to each example • i.e., one example can belong to multiple classes • Some applications: • Text classification: assign subjects (newsgroups) to texts • Functional genomics: assign functions to genes • In hierarchical multilabel classification (HMC), the classes C form a hierarchy C, • Partial order  expresses “is a superclass of”

  4. Hierarchical multilabel classification • Hierarchy constraint: • cicj coverage(cj)  coverage(ci) • Elements of a class must be elements of its superclasses • Should hold for given data as well as predictions • Straightforward way to learn a HMC model: • Learn k binary classifiers, one for each class • Disadvantages: • 1. difficult to guarantee hierarchy constraint • 2. skewed class distributions (few pos, many neg) • 3. relatively slow • 4. no simple interpretable model • Alternative: learn one classifier that predicts a vector of classes • Quite natural for, e.g., neural networks • We will do this with (interpretable) decision trees

  5. Goal of this work • There has been work on extending decision tree learning to the HMC case • Multiple prediction trees: Blockeel et al., ICML 1998; Clare and King, ECML 2001; … • HMC trees: Blockeel et al., 2002; Clare, 2003; Struyf et al., 2005 • HMC trees were evaluated in functional genomics, with good results (  proof of concept) • But: no comparison with learning multiple single classification trees has been made • Size of trees, predictive accuracy, runtimes… • Previous work focused on the knowledge discovery aspect • We compare both approaches for functional genomics

  6. 1 2 250 … Functional genomics • Task:Given a data set with descriptions of genes and the functions they have, learn a model that can predict for a new gene what functions it performs • A gene can have multiple functions (out of 250 possible functions, in our case) • Could be done with decision trees, with all the advantages that brings (fast, interpretable)… But: • Decision trees predict only one class, not a set of classes • Should we learn a separate tree for each function? • 250 functions = 250 trees: not so fast and interpretable anymore! description functions Name A1 A2 A3 ….. An 1 2 3 4 5 … 249 250 G1 … … … … x x x x G2 … … … … x x x G3 … … … … x x x … … … …. … … … … … … … … …

  7. description function Name A1 A2 A3 ….. An 1 2 3 4 5 … 249 250 G1 … … … … x x x x G2 … … … … x x x G3 … … … … x x x … … … …. … … … … … … … … … 1 4,12,105,250 1,5 2 1,5,24,35 140 Multiple prediction trees • A multiple prediction tree (MPT) makes multiple predictions at once • Basic idea: (Blockeel, De Raedt, Ramon, 1998) • A decision tree learner prefers tests that yield much information on the “class” attribute (measured using information gain (C4.5) or variance reduction (CART)) • MPT learner prefers tests that reduce variance for all target variables together • Variance = mean squared distance of vectors to mean vector, in k-D space

  8. The algorithm Procedure MPTree(T) returns tree (t*,h*,P*) = (none, , ) For each possible test t P = partition induced by t on T h = TkP |Tk|/|T| Var(Tk) if (h<h*) and acceptable(t,P) (t*,h*,P*) = (t,h,P) If t* <> none for each TkP* treek = MPTree(Tk) return node(t*, k{treek}) Else return leaf(v)

  9. HMC tree learning • A special case of MPT learning • Class vector contains all classes in hierarchy • Main characteristics: • Errors higher up in the hierarchy are more important • Use weighted euclidean distance (higher weight for higher classes) • Need to ensure hierarchy constraint • Normally, leaf predicts ci iff proportion of ci examples in leaf is above some threshold ti (often 0.5) • We will let ti vary (see further) • To ensure compliance with hierarchy constraint: • cicjtitj • Automatically fulfilled if all ti equal

  10. . Weight 1 c1 c2 c3 x1 Weight 0.5 c4 c5 c6 c7 . x1: {c1, c3, c5} = [1,0,1,0,1,0,0] x2: {c1, c3, c7} = [1,0,1,0,0,0,1] x3: {c1, c2, c5} = [1,1,0,0,0,0,0] c1 c2 c3 x2 c4 c5 c6 c7 . x3 c1 c2 c3 c4 c5 c6 c7 Example . c1 c2 c3 c4 c5 c6 c7 d2(x1, x2) = 0.25 + 0.25 = 0.5 d2(x1, x3) = 1+1 = 2 x1 is more similar to x2 than to x3 DT tries to create leaves with “similar” examples i.e., relatively pure w.r.t. class sets

  11. Evaluating HMC trees • Original work by Clare et al.: • Derive rules with high “accuracy” and “coverage” from the tree • Quality of individual rules was assessed • No simple overall criterion to assess quality of tree • In this work: using precision-recall curves • Precision = P(pos| predicted pos) • Recall = P(predicted pos | pos) • The P,R of a tree depends on the tresholds ti used • By changing the threshold ti from 1 to 0, a precision-recall curve emerges • For 250 classes: • Precision = P(X | predicted X) [with X any of the 250 classes] • Recall = P(predicted X | X) • This gives a PR curve that is a kind of “average” of the individual PR curves for each class

  12. The Clus system • Created by Jan Struyf • Propositional DT learner, implemented in Java • Implements ideas from • C4.5 (Quinlan, ’93) • CART (Breiman et al., ’84) • predictive clustering trees (Blockeel et al., ’98) • includes multiple prediction trees and hierarchical multilabel classification trees • Reads data in ARFF format (Weka) • We used two versions for our experiments: • Clus-HMC: HMC version as explained • Clus-SC: single classification version, +/- CART

  13. The datasets • 12 datasets from functional genomics • Each with a different description of the genes • Sequence statistics (1) • Phenotype (2) • Predicted secondary structure (3) • Homology (4) • Micro-array data (5-12) • Each with the same class hierarchy • 250 classes distributed over 4 levels • Number of examples: 1592 to 3932 • Number of attributes: 52 to 47034

  14. Our expectations… • How does HMC tree learning compare to the “straightforward” approach of learning 250 trees? • We expect: • Faster learning: Learning 1 HMCT is slower than learning 1 SPT (single prediction tree), but faster than learning 250 SPT’s • Much faster prediction: Using 1 HMCT for prediction is as fast as using 1 SPT for prediction, and hence 250 times faster than using 250 SPT’s • Larger trees: HMCT is larger than average tree for 1 class, but smaller than set of 250 trees • Less accurate: HMCT is less accurate than set of 250 SPT’s (but hopefully not much less accurate) • So how much faster / simpler / less accurate are our HMC trees?

  15. The results • The HMCT is on average less complex than one single SPT • HMCT has 24 nodes, SPT’s on average 33 nodes • … but you’d need 250 of the latter to do the same job • The HMCT is on average slightly more accurate than a single SPT • Measured using “average precision-recall curves” (see graphs) • Surprising, as each SPT is tuned for one specific prediction task • Expectations w.r.t. efficiency are confirmed • Learning: min. speedup factor = 4.5x, max 65x, average 37x • Prediction: >250 times faster (since tree is not larger) • Faster to learn, much faster to apply

  16. Precision recall curves Precision: proportion of predictions that is correct P(X | predicted X) Recall: proportion of class memberships correctly identified P(predicted X | X)

  17. An example rule • High interpretability: IF-THEN rules extracted from the HMCT are quite simple IF Nitrogen_Depletion_8_h <= -2.74 AND Nitrogen_Depletion_2_h > -1.94 AND 1point5_mM_diamide_5_min > -0.03 AND 1M_sorbitol___45_min_ > -0.36 AND 37C_to_25C_shock___60_min > 1.28 THEN40, 40/3, 5, 5/1 For class 40/3: Recall = 0.15; precision = 0.97. (rule covers 15% of all class 40/3 cases, and 97% of the cases fulfilling these conditions are indeed 40/3)

  18. The effect of merging… . . . Optimized for c1 Optimized for c2 Optimized for c250 • Smaller than average individual tree • - More accurate than average individual tree Optimized for c1, c2, …, c250

  19. Any explanation for these results? • Seems too good to be true… how is it possible? • Answer: the classes are not independent • Different trees for different classes actually share structure • Explains some complexity reduction achieved by the HMC tree, but not all ! • One class carries information on other classes • This increases the signal-to-noise ratio • Provides better guidance when learning the tree (explaining good accuracy) • Avoids overfitting (explaining further reduction of tree size) • This was confirmed empirically

  20. Overfitting • To check our “overfitting” hypothesis: • Compared area under PR curve on training set (Atr) and test set (Ate) • For SPC: Atr – Ate = 0.219 • For HMCT: Atr – Ate = 0.024 • (to verify, we tried Weka’s M5’ too: 0.387) • So HMCT clearly overfits much less

  21. Conclusions • Surprising discovery: a single tree can be found that • predicts 250 different functions with, on average, equal or better accuracy than special-purpose trees for each function • is not more complex than a single special-purpose tree (hence, 250 times simpler than the whole set) • is (much) more efficient to learn and to apply • The reason for this is to be found in the dependencies between the gene functions • Provide better guidance when learning the tree • Help to avoid overfitting • Multiple prediction / HMC trees have a lot of potential and should be used more often !

  22. Ongoing work • More extensive experimentation • Predicting classes in a lattice instead of a tree-shaped hierarchy

More Related