1 / 27

Decision Trees

Decision Trees. Klassifikations- und Clustering-Methoden für die Computerlinguistik Sabine Schulte im Walde, Irene Cramer, Stefan Schacht Universität des Saarlandes, Winter 2004/2005. Outline. Example What are decision trees? Some characteristics A (tentative) definition How to build them?

wauna
Download Presentation

Decision Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Decision Trees Klassifikations- und Clustering-Methoden für die ComputerlinguistikSabine Schulte im Walde, Irene Cramer, Stefan SchachtUniversität des Saarlandes, Winter 2004/2005

  2. Outline • Example • What are decision trees? • Some characteristics • A (tentative) definition • How to build them? • Lots of questions … • Discussion • Advantages & disadvantages • When should we use them?

  3. Illustration – Classification example Remember: example at the black board

  4. Discussion:Illustration – Results • Lets gather some characteristics of our decision tree: • binäre Entscheidungsfragen (ja/nein-Frage) • nicht unbedingt balanciert • Baumtiefe beliebig, aber abhängig von gewünschter Feinheit • Merkmale und Klassen stehen fest • annotierte Daten • Nominale und ordinale Merkmale • What questions did arise? • Größe nicht mit ja/nein antworten • Reihenfolge, davon auch abh. Baum unbalanciert

  5. Our First Definition: • A decision tree is a graph • It consists of nodes, edges and leafs • nodes  questions about features • edges  possible value of a feature • leafs  class labels • Path from root to leaf  conjunction of questions (rules) • A decision tree is learned by splitting the source data into subsets based on features/rules (how: we will see later on) • This process is repeated recursively until splitting is either non-feasible, or a singular classification can be applied to each element of the derived subset

  6. Building Decision Trees • We meet a lot of questions while building/using decision trees: • Should we only allow binary questions? Why? • Which features (properties) should we use? Thus, what questions should we ask? • Under what circumstances is a node a leaf? • How large should our tree become? • How should the category labels be assigned? • What should we do with corrupted data?

  7. Only Binary Questions? Taken from the web: http://www.smartdraw.com/resources/examples/business/images/decision_tree_diagram.gif

  8. Only Binary Questions? Taken from the web: http://www.cs.cf.ac.uk/Dave/AI2/dectree.gif

  9. Only Binary Questions? • Branching factor = how many edges do we have? • Binary  branching factor = 2 • All decision trees can be converted into binary ones • Binary trees are very expressive • Binary decision trees are simpler to train • With a binary tree: 2n possible classifications (n is number of features)

  10. What Questions Should We Ask? • Try to follow Ockham’s Razor  prefer the simplest model thus prefer those features/questions that lead to a simple tree (not very helpful?)

  11. What Questions Should We Ask? • Measure impurity at each split • Impurity (i(N)): • Metaphorically speaking, shows how many different classes we have at each node • Best would be: just one class  leaf • Some impurity measures: • Entropy Impurity • Gini Impurity • Misclassification Impurity

  12. What Questions Should We Ask? • Entropy Impurity • Gini Impurity • Misclassification Impurity where P(ωj) is the fraction of patterns at node N that are in class ωj

  13. What Questions Should We Ask? • Calculate best question/rule at a node: where NL and NR are the left and right descendent nodes; i(NL) and i(NR) are their impurities and PL is the fraction of patterns in node N that will go to NL when this question is used; • Δi(N) should be as high as possible • Most common: Entropy Impurity

  14. What Questions Should We Ask? • Additional information about questions: • monothetic (i.e. nominal) vs. polythetic (i.e. real valued) • we now understand why binary trees are simpler • Keep in mind: a local optimum isn’t necessarily a global one!

  15. When to Declare Node = Leaf? • On the one hand … on the other: • if i(N) near 0  over fit (possible) • tree to small  (highly) erroneous classification • 2 solutions: • stop before i(N) = 0  how to decide when? • pruning  how?

  16. When to Declare Node = Leaf? • When to stop growing? • Cross validation • split training data in two subsets • train with bigger set • validate with smaller • Δi(N) < threshold • get unbalanced tree • what threshold is reasonable? • P(NL), P(NR) < threshold • reasonable thresholds 5%, 10% of data • advantage: good partition where high data density • Δi(N) ≠ 0  significant • Hypothesis testing • …

  17. Large Tree vs. Small Tree? • Tree to large? Prune! • first grow the tree fully, then cut • cut those nodes/leafs where i(N) is very small • avoid horizon effect • Tree to large? Merge branches or rules!

  18. When to assign Category Label  Leaf? • If i(N) = 0, then category label is class of all objects • If i(N) > 0, then category label is class of most objects

  19. Discussion:We have learnt until now. • Merkmale von Entscheidungsbäumen • Entscheidungsfragen gestellt • Unterschied zwischen entropy impurity und gini impurity? • Frage nach optimalem Baum nicht geklärt - NP vollständig Problem

  20. Examples: Scanned from Pattern Classification by Duda, Hart, and Stork

  21. Examples: Scanned from Pattern Classification by Duda, Hart, and Stork

  22. Examples: Scanned from Pattern Classification by Duda, Hart, and Stork

  23. What to do with “Corrupted” Data? Missing attributes … • during classification • look for surrogate questions • use virtual value • during training • calculate impurity of basis of attributes at hand • dirty solution: don‘t consider data with missing attributes

  24. Some Terminology • CART (classification and regression trees) • general framework  instances in many ways • see questions on slide before • ID3 • for unordered nominal attributes (if real valued variables  intervals) • seldom binary • algorithm continues until nodes is pure or no more variables left • no pruning • C4.5 • refinement of ID3 (in various aspects, i.e. real-valued variables, pruning etc.)

  25. Advantages & Disadvantages • Advantages of decision trees: • non-metric data (nominal features) yes/no questions • easily interpretable for humans • information in tree can be converted in rules • include expertise • Disadvantages of decision trees: • deduced rules can be very complex • decision tree could be suboptimal (i.e. cross check, over fitting) • need annotated data

  26. Discussion :When could we use decision trees? • Named Entity Recognition • Verbklassifikation • Polysemie • Spamfilter • immer dann, wenn nomiale Merkmale • POS

  27. Literature: • Richard O. Duda, Peter E. Hart und David G. Stork (2000): Pattern Classification. John Wiley & Sons, New York. • Tom M. Mitchell (1997): Machine Learning. McGraw-Hill, Boston. • www.wikipedia.org

More Related