620 likes | 640 Views
Understand decision trees, overfitting, bias-variance tradeoff, pruning techniques, regression trees, ensemble learners, boosting, bagging, and random forests. Learn to divide datasets for training and testing sets and ways to handle overfitting with pruning to create robust prediction models.
E N D
Decision Trees II CSC 576: Data Mining
Today • Decision Trees • Overfitting / Underfitting • Bias-Variance Tradeoff • Pruning • Regression Trees • Ensemble Learners • Boosting • Bagging • Random Forests
Training Set vs. Test Set • Overall dataset is divided into: • Training set – used to build model • Test set – evaluates model • (sometimes a Validation set is also used; more later)
Problem • Error rates on training set vs. testing set might be drastically different. • No guarantee that the model with the smallest training error rate will have the smallest testing error rate
Overfitting • Overfitting: occurs when model “memorizes” the training set data • very low error rate on training data • yet, high error rate on test data • Model does not generalize to the overall problem • This is bad! We wish to avoid overfitting.
Bias and Variance • Bias: the error introduced by modeling a real-life problem (usually extremely complicated) by a much simpler problem • The more flexible (complex) a method is, the less bias it will generally have. • Variance: how much the learned model will change if the training set was different • Does changing a few observations in the training set, dramatically affect the model? • Generally, the more flexible (complex) a method is, the more variance it has.
Data Point Observations created by: Y=f(X)+ε Example: we wish to build a model that separates the dark-colored points from the light-colored points. Black line is simple, linear model Currently, some classification error • Low variance • Bias present
More complex model (curvy line instead of linear) Zero classification error for these data points • No linear model bias • Higher Variance?
More data has been added. Re-train both models (linear line, and curvy line) in order to minimize error rate • Variance: • Linear model doesn’t change much • Curvy line significantly changes Which model is better?
Model Overfitting • Errors committed by a classification model are generally divided into: • Training errors: misclassification on training set records • Generalization errors(testing errors): errors made on testing set / previously unseen instances • Good model has low training error and low generalization error. • Overfitting: model has low training error rate, but high generalization errors
Model Underfitting and Overfitting • When tree is small: • Underfitting • Large training error rate • Large testing error rate • Structure of data isn’t yet learned • When tree gets too large: • Beware of overfitting • Training error rate decreases while testing error rate increases • Tree is too complex • Tree “almost perfectly fits” training data, but doesn’t generalize to testing examples Model underfitting Model overfitting
Additional Reasons for Overfitting • Presence of noise • Lack of representative samples
Training Set Two training records are mislabeled. Tree perfectly fits training data.
Testing Set Test error rate: 30%
Testing Set Test error rate: 30% • Reasons for misclassifications: • Mislabeled records in training data • “Exceptional case” • Unavoidable • Minimal error rate achievable by any classifier
Training error rate: 0% • Test error rate: 30% • overfitting • Training error rate: 20% • Test error rate: 10%
Overfitting and Decision Trees • The likelihood of overfitting occurring increases as a tree gets deeper • the resulting classifications are based on smaller subsets of the full training dataset • Overfitting involves splitting the data on an irrelevant feature.
Pruning: Handling Overfitting in Decision Trees • Tree pruning identifies and removes subtrees within a decision tree that are likely to be due to noise • samples variance in the training set used to induce it • Pruning will result in decision trees being created that are not consistent with the training set used to build them. • But we are more interested in created prediction models that generalize well to new data! • Pre-pruning (Early Stopping) • Post-pruning
Pre-pruning Techniques • Stop creating subtrees when the number of instances in a partition falls below a threshold • Information gain measured at a node is not deemed to be sufficient to make partitioning the data worthwhile • Depth of the tree goes beyond a predefined limit • … other more advanced approaches Benefits: Computationally efficient; works well for small datasets. Downsides: Stopping too early will fail to create the most effective trees.
Post-pruning • Decision tree initially grown to its maximum size • Then examine each branch • Branches that are deemed likely to be due to overfitting are pruned. • Post-pruning tends to give better results than prepruning • Which is faster? • Post-pruning is more computationally expensive than prepruning because entire tree is grown
Reduced Error Pruning • Starting at the leaves, each node is replaced with its most popular class. • If the accuracy is not affected, then the change is kept. • Evaluate accuracy on a validation set • Set aside some of the training set as a validation set • Advantages: simplicity and speed
Post-Pruning Example Example validation set: • Induced decision tree from training data • Need to prune?
Post-Pruning Example • Pruned nodes in black
Occam’s Razor General Principle (Occam’s Razor): given two models with same generalization (testing) errors, the simpler model is preferred over the more complex model • Additional components in a more complex model have greater chance at being fitted purely by chance Problem solving principle by philosopher William of Ockham (1287-1347)
Advantages of Pruning • Smaller trees are easier to interpret • Increased generalization accuracy.
Regression Trees • Target Attribute: • Decision (Classification) Trees: qualitative • Regression Trees: continuous • Decision trees: reduce the entropy in each subtree • Regression trees: reduce the variance in each subtree • Idea: adapt ID3 algorithm measure of Information Gain to use variance rather than node impurity
Regression Tree Splits Classification Trees Regression Trees • Gain: “goodness of the split” • larger gain => better split (better test condition) • Impurity (variance) at a node: • Select feature to split on that minimizes the weighted variance across all resulting partitions: • I(n) = impurity measure at node n • k = number of attribute values • N(n) = total number of records at child node n • N = total number of records at parent node
Need to watch out for Overfitting • Want to avoid overfitting: • Early stopping criterion • Stop partitioning the dataset if the number of training instances is less than some threshold • (5% of the dataset)
Advantages and Disadvantages of Trees Disadvantages Advantages • Trees are very easy to explain • Easier to explain than linear regression • Trees can be displayed graphically and interpreted by a non-expert • Decision trees may more closely mirror human decision-making • Trees can easily handle qualitative predictors • No dummy variables • Trees usually do not have same level of predictive accuracy as other data mining algorithms • But, predictive performance of decision trees can be improved by aggregating trees. • Techniques: bagging, boosting, random forests
Sampling • Common approach for selecting a subset of data objects to be analyzed • Select only some instances for the training set, instead of all of them • Sampling without replacement - as each item is sampled, it is removed from the population • Sampling with replacement - the same object/instance can be picked more than once
Can also be applied to other learning algorithms Ensemble Methods • Currently using onesingle classifier induced from training data as our model, to predict class of test instance • What if we used multiple decision trees? • Motivation: committee of experts working together are likely to better solve a problem than a single expert • But no “group think”: each model should make predictions independently of other models in the ensemble • In practice: methods work surprisingly well, usually greatly improve decision tree accuracy
Ensemble Characteristics • Build multiple models from the same training data by creating each model on a modified version of the training data. • Make a final, ensemble prediction by aggregating the predictions of the individual models • Classification prediction: Let each model have a vote on the correct class prediction. Assign the class with the most votes. • Regression prediction: Measure of central tendency (mean or median)
Rationale • How can an ensemble method improve a classifier’s performance? • Assume we have 25 binary classifiers • Each has error rate: ε= 0.35 • If all 25 classifiers are identical: • They will vote the same way on each test instance • Ensemble error rate: ε= 0.35
Rationale • How can an ensemble method improve a classifier’s performance? • Assume we have 25 binary classifiers • Each has error rate: ε= 0.35 • If all 25 classifiers are independent (errors are uncorrelated): • Ensemble method only makes a wrong prediction if more than half of the base classifiers predict incorrectly. 6% much less than 35%
In practice, difficult to have base classifiers than are completely independent. • Ensemble methods have been shown to improve classification accuracies even when there is some correlation. Rationale • Conditions necessary for an ensemble classifier to perform better than a single classifier: • Base classifiers should be independent of each other • Base classifiers should not do worse than a classifier doing random guessing • Example: for two-class problem, base classifier error rate: ε< .5
Decision Tree Model Ensembles • Boosting • Bagging • Random Forests
Boosting • Boosting works by iteratively creating models and adding them to the ensemble • Iteration stops when a predefined number of models have been added • Each new model added to the ensemble is biased to pay more attention to instances that previous models misclassified (weighted dataset).
General Boosting Algorithm • Initially instances are assigned weights of 1/N • Each is equally likely to be chosen for sample • Sample drawn with replacement: Di • Classifier induced on Di • Weights of training examples are updated: • Instances classified incorrectly have weights increased • Instances classified correctly have weights decreased
General Boosting Algorithm • During each iteration the algorithm: • Induces a model and calculates the total error, ε, by summing the weights of the training instances for which the predictions made by the model are incorrect. • Increases the weights for the instances misclassified • Decreases the weights for the instances correctly classified • Calculate a confidence factor α, for the model such that α increases as ε decreases
Boosting Example • Suppose that Instance #4is hard to classify. • Weight for this instance will be increased in future iterations, as it gets misclassified repeatedly. • Examples not chosen in previous round (Instances #1, #5) also may have better chance of being selected in next round. • Why? Predictions in previous round are likely to be wrong since they weren’t trained on. • As boosting rounds proceed, instances that are the hardest to classify become even more prevalent.
Prediction • Once the set of models have been created the ensemble makes predictions using a weighted aggregate of the predictions made by the individual models. • The weights used in this aggregation are simply the confidence factors associated with each model.
Boosting Algorithms • Several different boosting algorithms exist • Different by: • How weights of training instances are updated after each boosting round • How predictions made by each classifier are combined • Each boosting round produces one base classifier
Bagging • On average, each Di will contain 63% of original training data. • Probability of sample being selected for Di: 1 – (1 – (1/N)N • Converges to: 1 – 1/e = 0.632 • Ensemble method that “manipulates the training set” • Action: repeatedly sample with replacementaccording to uniform probability distribution • Every instance has equal chance of being picked • Some instances may be picked multiple times; others may not be chosen • Sample Size: same as training set • Di: each bootstrap sample • Footnote: also called bootstrap aggregating Consequently, every bootstrap sample will be missing some of the instances from the dataset so each bootstrap sample will be different and this means that models trained on different bootstrap samples will also be different
Bagging Algorithm • Model Generation: • Let n be the number of instances in the training data. • For each of t iterations: • Sample n instances with replacement from training data. • Apply the learning algorithm to the sample. • Store the resulting model. • Classification: • For each of the t models: • Predict class of instance using model. • Return class that has been predicted most often.
Bagging Example Now going to apply bagging and create many decision stump base classifiers. • Decision Stump: one-level binary decision tree • What’s the best performance of a decision stump on this data? • Splitting condition will be x <= k, where k is the split point • Best splits: x <= 0.35 or x <= 0.75 • Best accuracy: 60% Dataset: 10 instances Predictor Variable: x Target Variable: y
Bagging Example • First choose how many “bagging rounds” to perform • Chosen by analyst • We’ll do 10 bagging rounds in this example:
Bagging Example In each round, create Di by sampling with replacement Round 1: Learn decision stump. What stump will be learned? If x <= 0.35 then y = 1 If x > 0.35 then y = -1