240 likes | 913 Views
Decision Trees. What is a decision tree?. Input = assignment of values for given attributes Discrete (often Boolean) or continuous Output = predicated value Discrete - classification Continuous - regression Structure: Internal node - tests one attribute
E N D
What is a decision tree? • Input = assignment of values for given attributes • Discrete (often Boolean) or continuous • Output = predicated value • Discrete - classification • Continuous - regression • Structure: • Internal node - tests one attribute • Leaf node - output value (or linear function of attributes)
Example Attributes • Alternate - Boolean • Bar - Boolean • Fri/Sat - Boolean • Hungry - Boolean • Patrons - {None, Some, Full} • Price - {$, $$, $$$} • Raining - Boolean • Reservation - Boolean • Type - {French, Italian, Thai, burger} • WaitEstimate - {0-10min, 10-30, 30-60, >60}
Decision Tree Properties • Propositional: attribute-value pairs • Not useful for relationships between objects • Universal: Space of possible decision trees includes every Boolean function • Not efficient for all functions (e.g., parity, majority) • Good for disjunctive functions
Decision Tree Learning • From a training set, construct a tree
Methods for Constructing Decision Trees One path for each example in training set • Not robust, little predictive value Rather look for “small” tree • Applying Ockham’s Razor ID3: simple non-backtracking search through space of decision trees
Choose-Attribute Greedy algorithm - use attribute that gives the greatest immediate information gain
Information/Entropy • Information provided by knowing an answer: • Possible answers vi with probability P(vi) • I({P(vi)}) = i -P(vi) log2P(vi) • I({0.5, 0.5}) = -0.5*(-1)-0.5*(-1) = 1 • I({0.01,0.99}) = -0.01*(-6.6) -0.99*(-0.014) = 0.08 • Estimate probability from set of examples • Example: • 6 yes, 6 no, estimate P(yes)=P(no)=0.5 • 1 bit required • After testing an attribute, take weighted sum of information required for subsets of the examples
After Testing “Type” 2/12*I(1/2,1/2) + 2/12*I(1/2,1/2) + 4/12*I(1/2,1/2) + 4/12*I(1/2,1/2) = 1
After Testing “Patrons” 2/12*I(0,1) + 4/12*I(1,0) + 6/12*I(2/6,4/6) = 0.46
Repeat adding tests • Note: induced tree not the same as “true” function • Best we can do given the examples
Overfitting • Algorithm can find meaningless regularity • E.g., use date & time to predict roll of die • Approaches to fixing: • Stop the tree from growing too far • Allow the tree to overfit, then prune
2 -Pruning • Is a split irrelevant? • Information gain close to zero - how close? • Assume no underlying pattern (null hypothesis) • If statistical analysis shows <5% probability that null hypothesis is correct, then assume attribute is relevant • Pruning provides noise tolerance
Rule Post-Pruning • Convert tree into a set of rules (one per path) • Prune preconditions (generalize) each rule if it improves accuracy • Rules may now overlap • Consider rules in order of accuracy when doing classification
Continuous-valued attributes • Split based on some threshhold • X<97 vs X>=97 • Many possible split points
Multi-valued attributes • Information gain of attribute with many values may be huge (e.g., Date) • Rather than absolute info gain use ratio of gain to SplitInformation -
Continuous-valued outputs • Leaf is a linear function of attributes, not value • Regression Tree
Missing Attribute Values • Attribute value in example may not be known • Assign most common value amongst comparable examples • Split into fractional examples based of observed distribution of values
Credits • Diagrams from “Artificial Intelligence - A Modern Approach’’ by Russell and Norvig