340 likes | 363 Views
Decision Tree. Rong Jin. Determine Milage Per Gallon. A Decision Tree for Determining MPG. mpg cylinders displacement horsepower weight acceleration modelyear maker 4 low low low high 75to78 asia. good. From slides of Andrew Moore. Decision Tree Learning. Extremely popular method
E N D
Decision Tree Rong Jin
A Decision Tree for Determining MPG mpg cylinders displacement horsepower weight acceleration modelyear maker 4 low low low high 75to78 asia good From slides of Andrew Moore
Decision Tree Learning • Extremely popular method • Credit risk assessment • Medical diagnosis • Market analysis • Good at dealing with symbolic feature • Easy to comprehend • Compared to logistic regression model and support vector machine
Representational Power • Q: Can trees represent arbitrary Boolean expressions? • Q: How many Boolean functions are there over N binary attributes?
A Simple Idea • Enumerate all possible trees • Check how well each tree matches with the training data • Pick the one work best Too many trees How to determine the quality of decision trees? Problems ?
Solution: A Greedy Approach • Choose the most informative feature • Split data set • Recursive until each data item is classified correctly
How to Determine the Best Feature? • Which feature is more informative to MPG? • What metric should be used? Mutual Information ! From Andrew Moore’s slides
Mutual Information for Selecting Best Features From Andrew Moore’s slides
Example: Playing Tennis Humidity (9+, 5-) Wind (9+, 5-) High Norm Weak Strong (6+, 1-) (3+, 4-) (3+, 3-) (6+, 2-)
Predication for Nodes What is the predication for each node? From Andrew Moore’s slides
Recursively Growing Trees cylinders = 4 cylinders = 5 cylinders = 6 Original Dataset Partition it according to the value of the attribute we split on cylinders = 8 From Andrew Moore slides
Build tree from These records.. Build tree from These records.. Build tree from These records.. Build tree from These records.. cylinders = 5 cylinders = 4 cylinders = 6 cylinders = 8 Recursively Growing Trees From Andrew Moore slides
Recursively growing trees A Two Level Tree
Should we split this node ? When should We Stop Growing Trees?
Base Cases • Base Case One: If all records in current data subset have the same output then don’t recurse • Base Case Two: If all records have exactly the same set of input attributes then don’t recurse
Base Cases: An idea • Base Case One: If all records in current data subset have the same output then don’t recurse • Base Case Two: If all records have exactly the same set of input attributes then don’t recurse Proposed Base Case 3: If all attributes have zero information gain then don’t recurse Is this a good idea?
Pruning What should We do ?
Pruning Decision Tree • Stop growing trees in time • Build the full decision tree as before. • But when you can grow it no more, start to prune: • Reduced error pruning • Rule post-pruning
Reduced Error Pruning • Split data into training and validation set • Build a full decision tree over the training set • Keep removing node that maximally increases validation set accuracy
Rule Post-Pruning • Convert tree into rules • Prune rules by removing the preconditions • Sort final rules by their estimated accuracy Most widely used method (e.g., C4.5) Other methods: statistical significance test (chi-square)
Real Value Inputs • What should we do to deal with real value inputs?
Information Gain • x: a real value input • t: split value • Find the split value t such that the mutual information I(x, y: t) between x and the class label y is maximized.
Conclusions • Decision trees are the single most popular data mining tool • Easy to understand • Easy to implement • Easy to use • Computationally cheap • It’s possible to get in trouble with overfitting • They do classification: predict a categorical output from categorical and/or real inputs
Software • Most widely used decision tree C4.5 (or C5.0) http://www2.cs.uregina.ca/~hamilton/courses/831/notes/ml/dtrees/c4.5/tutorial.html • Source code, tutorial