Decision Trees

Decision Trees Klassifikations- und Clustering-Methoden für die ComputerlinguistikSabine Schulte im Walde, Irene Cramer, Stefan SchachtUniversität des Saarlandes, Winter 2004/2005

Outline • Example • What are decision trees? • Some characteristics • A (tentative) definition • How to build them? • Lots of questions … • Discussion • Advantages & disadvantages • When should we use them?

Illustration – Classification example Remember: example at the black board

Discussion:Illustration – Results • Lets gather some characteristics of our decision tree: • binäre Entscheidungsfragen (ja/nein-Frage) • nicht unbedingt balanciert • Baumtiefe beliebig, aber abhängig von gewünschter Feinheit • Merkmale und Klassen stehen fest • annotierte Daten • Nominale und ordinale Merkmale • What questions did arise? • Größe nicht mit ja/nein antworten • Reihenfolge, davon auch abh. Baum unbalanciert

Our First Definition: • A decision tree is a graph • It consists of nodes, edges and leafs • nodes  questions about features • edges  possible value of a feature • leafs  class labels • Path from root to leaf  conjunction of questions (rules) • A decision tree is learned by splitting the source data into subsets based on features/rules (how: we will see later on) • This process is repeated recursively until splitting is either non-feasible, or a singular classification can be applied to each element of the derived subset

Building Decision Trees • We meet a lot of questions while building/using decision trees: • Should we only allow binary questions? Why? • Which features (properties) should we use? Thus, what questions should we ask? • Under what circumstances is a node a leaf? • How large should our tree become? • How should the category labels be assigned? • What should we do with corrupted data?

Only Binary Questions? Taken from the web: http://www.smartdraw.com/resources/examples/business/images/decision_tree_diagram.gif

Only Binary Questions? Taken from the web: http://www.cs.cf.ac.uk/Dave/AI2/dectree.gif

Only Binary Questions? • Branching factor = how many edges do we have? • Binary  branching factor = 2 • All decision trees can be converted into binary ones • Binary trees are very expressive • Binary decision trees are simpler to train • With a binary tree: 2n possible classifications (n is number of features)

What Questions Should We Ask? • Try to follow Ockham’s Razor  prefer the simplest model thus prefer those features/questions that lead to a simple tree (not very helpful?)

What Questions Should We Ask? • Measure impurity at each split • Impurity (i(N)): • Metaphorically speaking, shows how many different classes we have at each node • Best would be: just one class  leaf • Some impurity measures: • Entropy Impurity • Gini Impurity • Misclassification Impurity

What Questions Should We Ask? • Entropy Impurity • Gini Impurity • Misclassification Impurity where P(ωj) is the fraction of patterns at node N that are in class ωj

What Questions Should We Ask? • Calculate best question/rule at a node: where NL and NR are the left and right descendent nodes; i(NL) and i(NR) are their impurities and PL is the fraction of patterns in node N that will go to NL when this question is used; • Δi(N) should be as high as possible • Most common: Entropy Impurity

What Questions Should We Ask? • Additional information about questions: • monothetic (i.e. nominal) vs. polythetic (i.e. real valued) • we now understand why binary trees are simpler • Keep in mind: a local optimum isn’t necessarily a global one!

When to Declare Node = Leaf? • On the one hand … on the other: • if i(N) near 0  over fit (possible) • tree to small  (highly) erroneous classification • 2 solutions: • stop before i(N) = 0  how to decide when? • pruning  how?

When to Declare Node = Leaf? • When to stop growing? • Cross validation • split training data in two subsets • train with bigger set • validate with smaller • Δi(N) < threshold • get unbalanced tree • what threshold is reasonable? • P(NL), P(NR) < threshold • reasonable thresholds 5%, 10% of data • advantage: good partition where high data density • Δi(N) ≠ 0  significant • Hypothesis testing • …

Large Tree vs. Small Tree? • Tree to large? Prune! • first grow the tree fully, then cut • cut those nodes/leafs where i(N) is very small • avoid horizon effect • Tree to large? Merge branches or rules!

When to assign Category Label  Leaf? • If i(N) = 0, then category label is class of all objects • If i(N) > 0, then category label is class of most objects

Discussion:We have learnt until now. • Merkmale von Entscheidungsbäumen • Entscheidungsfragen gestellt • Unterschied zwischen entropy impurity und gini impurity? • Frage nach optimalem Baum nicht geklärt - NP vollständig Problem

Examples: Scanned from Pattern Classification by Duda, Hart, and Stork

What to do with “Corrupted” Data? Missing attributes … • during classification • look for surrogate questions • use virtual value • during training • calculate impurity of basis of attributes at hand • dirty solution: don‘t consider data with missing attributes

Some Terminology • CART (classification and regression trees) • general framework  instances in many ways • see questions on slide before • ID3 • for unordered nominal attributes (if real valued variables  intervals) • seldom binary • algorithm continues until nodes is pure or no more variables left • no pruning • C4.5 • refinement of ID3 (in various aspects, i.e. real-valued variables, pruning etc.)

Advantages & Disadvantages • Advantages of decision trees: • non-metric data (nominal features) yes/no questions • easily interpretable for humans • information in tree can be converted in rules • include expertise • Disadvantages of decision trees: • deduced rules can be very complex • decision tree could be suboptimal (i.e. cross check, over fitting) • need annotated data

Discussion :When could we use decision trees? • Named Entity Recognition • Verbklassifikation • Polysemie • Spamfilter • immer dann, wenn nomiale Merkmale • POS

Literature: • Richard O. Duda, Peter E. Hart und David G. Stork (2000): Pattern Classification. John Wiley & Sons, New York. • Tom M. Mitchell (1997): Machine Learning. McGraw-Hill, Boston. • www.wikipedia.org

Decision Trees

Decision Trees

Presentation Transcript

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

DECISION TREES

Decision Trees

Decision Trees

Decision trees

Decision Trees

Decision Trees