170 likes | 206 Views
Learn about the key components of decision trees and the importance of pruning in data analysis. Explore stopping criteria, handling missing values, and the process of tree pruning. Discover how to balance simplicity and accuracy in tree selection.
E N D
CART (Classific.and Regression Tree): Key Parts of Tree Struct. Data Analysis • Tree growing • Splitting rules to generate tree • Stopping criteria: how far to grow? • Missing values: using surrogates (náhrada) • Tree pruning • Trimming off parts of the tree that don’t work • Ordering the nodes of a large tree by contribution to tree accuracy … which nodes come off first? • Optimal tree selection • Deciding on the best tree after growing and pruning • Balancing simplicity against accuracy
Stopping criteria for growing the tree All instances in the node belong to the same class The maximum tree depth has been reached Size of the data in the node is below a threshold (e.g. 5% of the original dataset) The best splitting criteria is below a threshold …
How to Address Overfitting • Pre-Pruning (Early Stopping Rule) • Stop the algorithm before it becomes a fully-grown tree • Typical stopping conditions for a node: • Stop if all instances belong to the same class • Stop if all the attribute values are the same • More restrictive conditions: • Stop if number of instances is less than some user-specified threshold • Stop if class distribution of instances are independent of the available features (e.g., using 2 test) • Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain).
How to Address Overfitting… • Post-pruning • Grow decision tree to its entirety • Trim the nodes of the decision tree in a bottom-up fashion • If generalization error improves after trimming, replace sub-tree by a leaf node. • Class label of leaf node is determined from majority class of instances in the sub-tree • Can use MDL for post-pruning
CART Pruning Method: Grow Full Tree, Then Prune • You will never know when to stop . . . so don’t! • Instead . . . grow trees that are obviously too big • Largest tree grown is called “maximal” tree • Maximal tree could have hundreds or thousands of nodes • usually instruct CART to grow only moderately too big • rule of thumb: should grow trees about twice the size of the truly best tree • This becomes first stage in finding the best tree • Next we will have to get rid the parts of the overgrown tree that don’t work (not supported by test data)
Tree Pruning • Take a very large tree (“maximal” tree) • Tree may be radically over-fit • Tracks all the idiosyncrasies of THIS data set • Tracks patterns that may not be found in other data sets • At bottom of tree splits based on very few cases • Analogous to a regression with very large number of variables • PRUNE away branches from this large tree • But which branch to cut first? • CART determines a pruning sequence: • the exact order in which each node should be removed • pruning sequence determined for EVERY node • sequence determined all the way back to root node
Order of Pruning: Weakest Link Goes First • Prune away "weakest link" — the nodes that add least to overall accuracy of the tree • contribution to overall tree a function of both increase in accuracy and size of node • accuracy gain is weighted by share of sample • small nodes tend to get removed before large ones ! • If several nodes have same contribution they all prune away simultaneously • Hence more than two terminal nodes could be cut off in one pruning • Sequence determined all the way back to root node • need to allow for possibility that entire tree is bad • if target variable is unpredictable we will want to prune back to root . . . the no model solution
Pruning Sequence Example 24 Terminal Nodes 21 Terminal Nodes 18 Terminal Nodes 20 Terminal Nodes
Now we test every tree in the pruning sequence • Take a test data set and drop it down the largest tree in the sequence and measure its predictive accuracy • how many cases right and how many wrong • measure accuracy overall and by class • Do same for 2nd largest tree, 3rd largest tree, etc • Performance of every tree in sequence is measured • Results reported in table and graph formats • Note that this critical stage is impossible to complete without test data • CART procedure requires test data to guide tree evaluation
Training Data Vs. Test Data Error Rates No. Terminal Nodes • Compare error rates measured by • learn data • large test set • Learn R(T) always decreases as tree grows (Q: Why?) • Test R(T) first declines then increases (Q: Why?) • Overfitting is the result tree of too much reliance on learn R(T) • Can lead to disasters when applied to new data R(T) Rts(T) 71 .00 .42 63 .00 .40 58 .03 .39 40 .10 .32 34 .12 .32 19 .20 .31 **10 .29 .30 9 .32 .34 7 .41 .47 6 .46 .54 5 .53 .61 2 .75 .82 1 .86 .91
Why look at training data error rates (or cost) at all? • First, provides a rough guide of how you are doing • Truth will typically be WORSE than training data measure • If tree performing poorly on training data error may not want to pursue further • Training data error rate more accurate for smaller trees • So reasonable guide for smaller trees • Poor guide for larger trees • At optimal tree training and test error rates should be similar • if not something is wrong • useful to compare not just overall error rate but also within node performance between training and test data
CART: Optimal Tree • Within a single CART run which tree is best? • Process of pruning the maximal tree can yield many sub-trees • Test data set or cross- validation measures the error rate of each tree • Current wisdom — select the tree with smallest error rate • Only drawback — minimum may not be precisely estimated • Typical error rate as a function of tree size has flat region • Minimum could be anywhere in this region
In what sense is the optimal tree “best”? • Optimal tree has lowest or near lowest cost as determined by a test procedure • Tree should exhibit very similar accuracy when applied to new data • BUT Best Tree is NOT necessarily the one that happens to be most accurate on a single test database • trees somewhat larger or smaller than “optimal” may be preferred • Room for user judgment • judgment not about split variable or values • judgment as to how much of tree to keep • determined by story tree is telling • willingness to sacrifice a small amount of accuracy for simplicity
Decision Tree Summary • Decision Trees • splits – binary, multi-way • split criteria – entropy, gini, … • missing value treatment • pruning • rule extraction from trees • Both C4.5 and CART are robust tools • No method is always superior – experiment! witten & eibe