270 likes | 801 Views
Decision Trees. Klassifikations- und Clustering-Methoden für die Computerlinguistik Sabine Schulte im Walde, Irene Cramer, Stefan Schacht Universität des Saarlandes, Winter 2004/2005. Outline. Example What are decision trees? Some characteristics A (tentative) definition How to build them?
E N D
Decision Trees Klassifikations- und Clustering-Methoden für die ComputerlinguistikSabine Schulte im Walde, Irene Cramer, Stefan SchachtUniversität des Saarlandes, Winter 2004/2005
Outline • Example • What are decision trees? • Some characteristics • A (tentative) definition • How to build them? • Lots of questions … • Discussion • Advantages & disadvantages • When should we use them?
Illustration – Classification example Remember: example at the black board
Discussion:Illustration – Results • Lets gather some characteristics of our decision tree: • binäre Entscheidungsfragen (ja/nein-Frage) • nicht unbedingt balanciert • Baumtiefe beliebig, aber abhängig von gewünschter Feinheit • Merkmale und Klassen stehen fest • annotierte Daten • Nominale und ordinale Merkmale • What questions did arise? • Größe nicht mit ja/nein antworten • Reihenfolge, davon auch abh. Baum unbalanciert
Our First Definition: • A decision tree is a graph • It consists of nodes, edges and leafs • nodes questions about features • edges possible value of a feature • leafs class labels • Path from root to leaf conjunction of questions (rules) • A decision tree is learned by splitting the source data into subsets based on features/rules (how: we will see later on) • This process is repeated recursively until splitting is either non-feasible, or a singular classification can be applied to each element of the derived subset
Building Decision Trees • We meet a lot of questions while building/using decision trees: • Should we only allow binary questions? Why? • Which features (properties) should we use? Thus, what questions should we ask? • Under what circumstances is a node a leaf? • How large should our tree become? • How should the category labels be assigned? • What should we do with corrupted data?
Only Binary Questions? Taken from the web: http://www.smartdraw.com/resources/examples/business/images/decision_tree_diagram.gif
Only Binary Questions? Taken from the web: http://www.cs.cf.ac.uk/Dave/AI2/dectree.gif
Only Binary Questions? • Branching factor = how many edges do we have? • Binary branching factor = 2 • All decision trees can be converted into binary ones • Binary trees are very expressive • Binary decision trees are simpler to train • With a binary tree: 2n possible classifications (n is number of features)
What Questions Should We Ask? • Try to follow Ockham’s Razor prefer the simplest model thus prefer those features/questions that lead to a simple tree (not very helpful?)
What Questions Should We Ask? • Measure impurity at each split • Impurity (i(N)): • Metaphorically speaking, shows how many different classes we have at each node • Best would be: just one class leaf • Some impurity measures: • Entropy Impurity • Gini Impurity • Misclassification Impurity
What Questions Should We Ask? • Entropy Impurity • Gini Impurity • Misclassification Impurity where P(ωj) is the fraction of patterns at node N that are in class ωj
What Questions Should We Ask? • Calculate best question/rule at a node: where NL and NR are the left and right descendent nodes; i(NL) and i(NR) are their impurities and PL is the fraction of patterns in node N that will go to NL when this question is used; • Δi(N) should be as high as possible • Most common: Entropy Impurity
What Questions Should We Ask? • Additional information about questions: • monothetic (i.e. nominal) vs. polythetic (i.e. real valued) • we now understand why binary trees are simpler • Keep in mind: a local optimum isn’t necessarily a global one!
When to Declare Node = Leaf? • On the one hand … on the other: • if i(N) near 0 over fit (possible) • tree to small (highly) erroneous classification • 2 solutions: • stop before i(N) = 0 how to decide when? • pruning how?
When to Declare Node = Leaf? • When to stop growing? • Cross validation • split training data in two subsets • train with bigger set • validate with smaller • Δi(N) < threshold • get unbalanced tree • what threshold is reasonable? • P(NL), P(NR) < threshold • reasonable thresholds 5%, 10% of data • advantage: good partition where high data density • Δi(N) ≠ 0 significant • Hypothesis testing • …
Large Tree vs. Small Tree? • Tree to large? Prune! • first grow the tree fully, then cut • cut those nodes/leafs where i(N) is very small • avoid horizon effect • Tree to large? Merge branches or rules!
When to assign Category Label Leaf? • If i(N) = 0, then category label is class of all objects • If i(N) > 0, then category label is class of most objects
Discussion:We have learnt until now. • Merkmale von Entscheidungsbäumen • Entscheidungsfragen gestellt • Unterschied zwischen entropy impurity und gini impurity? • Frage nach optimalem Baum nicht geklärt - NP vollständig Problem
Examples: Scanned from Pattern Classification by Duda, Hart, and Stork
Examples: Scanned from Pattern Classification by Duda, Hart, and Stork
Examples: Scanned from Pattern Classification by Duda, Hart, and Stork
What to do with “Corrupted” Data? Missing attributes … • during classification • look for surrogate questions • use virtual value • during training • calculate impurity of basis of attributes at hand • dirty solution: don‘t consider data with missing attributes
Some Terminology • CART (classification and regression trees) • general framework instances in many ways • see questions on slide before • ID3 • for unordered nominal attributes (if real valued variables intervals) • seldom binary • algorithm continues until nodes is pure or no more variables left • no pruning • C4.5 • refinement of ID3 (in various aspects, i.e. real-valued variables, pruning etc.)
Advantages & Disadvantages • Advantages of decision trees: • non-metric data (nominal features) yes/no questions • easily interpretable for humans • information in tree can be converted in rules • include expertise • Disadvantages of decision trees: • deduced rules can be very complex • decision tree could be suboptimal (i.e. cross check, over fitting) • need annotated data
Discussion :When could we use decision trees? • Named Entity Recognition • Verbklassifikation • Polysemie • Spamfilter • immer dann, wenn nomiale Merkmale • POS
Literature: • Richard O. Duda, Peter E. Hart und David G. Stork (2000): Pattern Classification. John Wiley & Sons, New York. • Tom M. Mitchell (1997): Machine Learning. McGraw-Hill, Boston. • www.wikipedia.org