320 likes | 339 Views
Explore techniques for building decision trees, understanding splits, computing information gain, dealing with highly branching attributes, and strategies for branching and discretization in tree-based learning. Discover best practices for achieving optimal results in decision-making processes.
E N D
Machine Learning in PracticeLecture 17 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute
Plan for the Day • Announcements • Questions? • Quiz • Assignment 7 • Progress on Term Projects? • Tree Based Learning
Thinking about optimization… Forshadowing..... Optimal Solution Locally Optimal Solution
Building Decision Trees • 3 main types of Splits • Nominal: split all values • [red,green,blue] [red] [green] [blue] • Nominal: binary split • [red,green,blue] [red] [green,blue] • Numeric split • [5-25] [<=10] [>10]
How do trees select an attribute? • Use a measure of “purity” to get the cleanest split • Given the mix of instances (wrt target classification), how many bits of information would be required to determine the correct class for each instance • The value of a feature can be computed in terms of the average gain in “purity” that splitting based on that feature would give you Yes yes no no yes no yes no ** Remember that the gain for numeric features depends on where you split! ? Yes yes yes no No no no yes
Computing Information • We have 4 yes’s and 4 no’s • Info[4,4] = (-4/8)*log2(4/8) + (-4/8)*log2(4/8) = 1 • We have 5 yes’s and 1 no • Info[5,1] = (-5/6)*log2(5/6) + (-1/6)*log2(1/6) = .64 Yes yes no no yes no yes no Yes yes no yes yes yes
Computing Information Gain • Gain = Info[4,4] – Info([3,1],[1,3]) • 1 – (.5*.81 + .5*.81) = .19 • You can compute a separate gain score for every possible split • So some attributes can have more than one possible gain score depending upon the split Yes yes no no yes no yes no 1 bit ? .81 bits Yes yes yes no No no no yes .81 bits
General Problem with Highly Branching Attributes • Highly branching attributes look good because they split the data into the smallest subsets, which are more likely to be skewed in one direction or the other • In the extreme case, each instance is in a different set, so accuracy on training set is 100% • But what will it probably be on test set?
General Problem with Highly Branching Attributes • Gain ratio is a normalized version of Information Gain • Takes into account number and size of daughter nodes – penalizes highly branching attributes • Sometimes Gain Ratio over-compensates • You can compensate for this over-compensation by considering both Gain Ratio and Information Gain • Tweaking formulas like this is a major part of research in machine learning!!!
Branching/Discretization Where should we branch? • 65 68 69 70 71 72 72 75 75 80 81 83 85 • Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No
Branching/Discretization • Do you pick a different majority class for each subset after the split? • Do you get a better accuracy after the split than before? • If you answered no to either question, then try a different split Where should we branch? • 65 68 69 70 71 72 72 75 75 80 81 83 85 • Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No
Branching/Discretization • On this data we can do a binary split between 70 and 71 • Note that deciding where exactly to split is much trickier when you have a real valued attribute! • With this split we get 71% • Better than always predicting the majority class, which is 64% Predict Yes Predict No • 65 68 69 70 71 72 72 75 75 80 81 83 85 • Yes No Yes Yes Yes No No Yes No No No Yes Yes No
Branching/Discretization • We could insert boundaries wherever there are shifts in categories • If this was possible, our accuracy (at least on the training set) would be 100% • But that doesn’t work if the same value is assigned two different categories • One way of resolving it would be to use adjacent instances as tie-breakers. 71 is closest to 72, and it’s a No. Where should we branch? • 65 68 69 70 71 72 72 75 75 80 81 83 85 • Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No
Branching/Discretization • Accuracy on training set is still high: 93% Where should we branch? • 65 68 69 70 71 72 72 75 75 80 81 83 85 • Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No
Branching/Discretization • Large number of splits creates a many-branching attribute, which is prone to over-fitting • Note that discretization based on the training data this way is a supervised filter! • A simpler discretization is more likely to generalize Where should we branch? • 65 68 69 70 71 72 72 75 75 80 81 83 85 • Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No
Branching/Discretization • What do we do with missing values? • 1R treats it like a separate nominal value (after the discretization) • The majority of cases where this attribute is unknown are Yes, so it will be associated with the Yes class Where should we branch? • 65 68 69 ? 71 72 72 ? 75 80 ? 83 85 • Yes No Yes Yes Yes No No Yes No No No Yes Yes No
Branching/Discretization • Other possibilities: assigning the majority class over the whole set, assigning the majority value of that attribute in the case of an unknown value, assigning the mean value (in the case of numeric attributes) Where should we branch? • 65 68 69 ? 71 72 72 ? 75 80 ? 83 85 • Yes No Yes Yes Yes No No Yes No No No Yes Yes No
What is pruning? • Minimum description length principle: the simplest model that fits your data will generalize better • Complex models are more likely to over-fit • Decision trees can become complex when you try to optimize performance on training data • Simplifying the tree will make performance on training data go down, but may lead to increases in performance on the test data
Prepruning versus Postpruning • Prepruning: knowing when to stop growing a tree • Hard to do because sometimes the value of a feature doesn’t become clear until lower down on the tree • Interactions between features • You always have to have a stopping criterion, so in some sense you are doing prepruning • But in practice you over-shoot and then do post-pruning
Prepruning versus Postpruning • Postpruning: simplifying a tree after it is built • Easier because hindsight is 20/20 • Less efficient because you might have done a lot of work that you’re going to throw away now
Subtree Replacement * Most common form of pruning. Simple Tree Original Tree
Subtree Raising * Less common form of pruning, not always worth it. Usually restricted to use on most popular branch.
How do we decide when/where/how much to prune? • If we can estimate the error rate at each node of both the original and resulting tree (after pruning), we can make our decision rationally • Won’t work to use the training set to compute the error rate because the original tree was optimized over this set already • You can hold back some of the training data for a validation set to use for pruning • This is called Reduced Error Pruning • The downside is that you train on less data
How do we decide when/where/how much to prune? • The alternative is to make an estimate of the error based on the training data • Similar principle to computing confidence intervals • Remember that we set c, which is a constant indicating the level of confidence we want (default is 25%) • Lowering the confidence value causes more pruning • We use the estimate “pessimistically” by using the upper confidence limit on the error estimate
[10A, 10B] [5A,1B] [5A,10B] [5A,2B] [0A,7B] Thinking about the Confidence Factor • The error at lower levels will normally look lower than the error at higher levels • Are we confident enough about the difference we see to risk the effect of the added complexity? • So look at the upper end of the confidence interval around the error rate • If I’m pessimistic and it still looks lower, then it probably is lower
( ) Confidence Factors vs Confidence Intervals Confidence factor = .25 Means you are 75% sure the error rate is within the interval. Confidence factor = .10 Means you are 90% sure the error rate is within the interval. ( )
[10A, 10B] [5A,1B] [5A,10B] [5A,2B] [0A,7B] Thinking about the Confidence Factor • Higher confidence factors mean smaller confidence intervals • Smaller confidence intervals mean it’s easier to conclude that the difference you see is real • Thus, lower confidence values make it more likely that pruning seems safe • You can’t conclude that the increase in accuracy is real
Take Home Message • Tree based learners are “divide and conquer learners” • Global maximization of performance on each iteration • Three main types of splits • It will pick the one that gives it the greatest information gain • Pruning simplifies the learned tree • Reduces performance on the training data • Increases performance on the testing data • Use performance on a validation set or confidence intervals over error rate to decide how much to prune