780 likes | 926 Views
Decision Trees. Advanced Statistical Methods in NLP Ling572 January 10, 2012. Information Gain. InfoGain (S,A): expected reduction in entropy due to A. Information Gain. InfoGain (S,A): expected reduction in entropy due to A. Information Gain.
E N D
Decision Trees Advanced Statistical Methods in NLP Ling572 January 10, 2012
Information Gain • InfoGain(S,A): expected reduction in entropy due to A
Information Gain • InfoGain(S,A): expected reduction in entropy due to A
Information Gain • InfoGain(S,A): expected reduction in entropy due to A
Information Gain • InfoGain(S,A): expected reduction in entropy due to A • Select A with max InfoGain • Resulting in lowest average entropy
Fraction of samples down branch i Disorder of class distribution on branch i Computing Average Entropy |S| instances Branch 2 Branch1 Sa2a Sa2b Sa1a Sa1b
Hair Color Height Lotion Weight Picking a Test Brown Blonde Tall Short Red Average Alex:N Annie:B Katie:N Sarah:B Emily:B John:N Sarah: B Dana: N Annie: B Katie: N Alex: N Pete: N John: N Dana:N Pete:N Emily: B Yes No Heavy Light Average Sarah:B Annie:B Emily:B Pete:N John:N Dana:N Alex:N Katie:N Dana:N Alex:N Annie:B Emily:B Pete:N John:N Sarah:B Katie:N
Entropy in Sunburn Example S = [3B,5N]
Entropy in Sunburn Example S = [3B,5N]
Entropy in Sunburn Example S = [3B,5N] Hair color= 0.954-(4/8(-2/4 log 2/4 - 2/4log2/4) + 1/8*0 + 3/8 *0) = 0.954- 0.5 = 0.454 Height = 0.954 - 0.69= 0.264 Weight = 0.954 - 0.94= 0.014 Lotion = 0.954 - 0.61= 0.344
Height Lotion Weight Picking a Test Tall Short Average Annie:B Katie:N Sarah:B Dana:N Yes No Heavy Light Average Sarah:B Annie:B Dana:N Katie:N Dana:N Annie:B Sarah:B Katie:N
Entropy in Sunburn Example S=[2B,2N] Height = 1-2/4(-1/2log1/2-1/2log1/2) + 1/4*0+1/4*0 = 1- 0.5 =0.5 Weight = 1-2/4(-1/2log1/2-1/2log1/2) +2/4(-1/2log1/2-1/2log1/2) = 1 Lotion = 1- 0 = 1
Building Decision Trees with Information Gain • Until there are no inhomogeneous leaves
Building Decision Trees with Information Gain • Until there are no inhomogeneous leaves • Select an inhomogeneous leaf node
Building Decision Trees with Information Gain • Until there are no inhomogeneous leaves • Select an inhomogeneous leaf node • Replace that leaf node by a test node creating subsets that yield highest information gain
Building Decision Trees with Information Gain • Until there are no inhomogeneous leaves • Select an inhomogeneous leaf node • Replace that leaf node by a test node creating subsets that yield highest information gain • Effectively creates set of rectangular regions • Repeatedly draws lines in different axes
Alternate Measures • Issue with Information Gain:
Alternate Measures • Issue with Information Gain: • Favors features with more values • Option:
Alternate Measures • Issue with Information Gain: • Favors features with more values • Option: • Gain Ratio
Alternate Measures • Issue with Information Gain: • Favors features with more values • Option: • Gain Ratio • Sa : elements of S with value A=a
Overfitting • Overfitting: • Model fits the training data TOO well • Fits noise, irrelevant details
Overfitting • Overfitting: • Model fits the training data TOO well • Fits noise, irrelevant details • Why is this bad?
Overfitting • Overfitting: • Model fits the training data TOO well • Fits noise, irrelevant details • Why is this bad? • Harms generalization • Fits training data too well, fits new data badly
Overfitting • Overfitting: • Model fits the training data TOO well • Fits noise, irrelevant details • Why is this bad? • Harms generalization • Fits training data too well, fits new data badly • For model m, training_error(m), D_error(m) – D=all data
Overfitting • Overfitting: • Model fits the training data TOO well • Fits noise, irrelevant details • Why is this bad? • Harms generalization • Fits training data too well, fits new data badly • For model m, training_error(m), D_error(m) – D=all data • If overfit, for another model m’, • training_error(m) < training_error(m’), but • D_error(m) > D_error(m’)
Avoiding Overfitting • Strategies to avoid overfitting:
Avoiding Overfitting • Strategies to avoid overfitting: • Early stopping:
Avoiding Overfitting • Strategies to avoid overfitting: • Early stopping: • Stop when InfoGain < threshold • Stop when number of instances < threshold • Stop when tree depth > threshold • Post-pruning
Avoiding Overfitting • Strategies to avoid overfitting: • Early stopping: • Stop when InfoGain < threshold • Stop when number of instances < threshold • Stop when tree depth > threshold • Post-pruning • Grow full tree and remove branches • Which is better?
Avoiding Overfitting • Strategies to avoid overfitting: • Early stopping: • Stop when InfoGain < threshold • Stop when number of instances < threshold • Stop when tree depth > threshold • Post-pruning • Grow full tree and remove branches • Which is better? • Unclear, both used. • For some applications, post-pruning better
Post-Pruning • Divide data into • Training set: used to build the original tree • Validation set: used to perform pruning
Post-Pruning • Divide data into • Training set: used to build the original tree • Validation set: used to perform pruning • Build decision tree based on training data
Post-Pruning • Divide data into • Training set: used to build the original tree • Validation set: used to perform pruning • Build decision tree based on training data • Until pruning does not reduce validation set performance • Compute perf. for pruning each nodes (& its children) • Greedily remove nodes that do not reduce VS performance
Post-Pruning • Divide data into • Training set: used to build the original tree • Validation set: used to perform pruning • Build decision tree based on training data • Until pruning does not reduce validation set performance • Compute perf. for pruning each nodes (& its children) • Greedily remove nodes that do not reduce VS performance • Yields smaller tree with best performance
Performance Measures • Compute accuracy on:
Performance Measures • Compute accuracy on: • Validation set • k-fold cross-validation
Performance Measures • Compute accuracy on: • Validation set • k-fold cross-validation • Weighted classification error cost: • Weight some types of errors more heavily
Performance Measures • Compute accuracy on: • Validation set • k-fold cross-validation • Weighted classification error cost: • Weight some types of errors more heavily • Minimum description length:
Performance Measures • Compute accuracy on: • Validation set • k-fold cross-validation • Weighted classification error cost: • Weight some types of errors more heavily • Minimum description length: • Favor good accuracy on compact models • MDL = error(tree) + model_size(tree)
Rule Post-Pruning • Convert tree to rules
Rule Post-Pruning • Convert tree to rules • Prune rules independently
Rule Post-Pruning • Convert tree to rules • Prune rules independently • Sort final rule set
Rule Post-Pruning • Convert tree to rules • Prune rules independently • Sort final rule set • Probably most widely used method (toolkits)
Modeling Features • Different types of features need different tests • Binary: Test branches on
Modeling Features • Different types of features need different tests • Binary: Test branches on true/false • Discrete: Branches
Modeling Features • Different types of features need different tests • Binary: Test branches on true/false • Discrete: Branches for each discrete value • Continuous?
Modeling Features • Different types of features need different tests • Binary: Test branches on true/false • Discrete: Branches for each discrete value • Continuous? • Need to discretize
Modeling Features • Different types of features need different tests • Binary: Test branches on true/false • Discrete: Branches for each discrete value • Continuous? • Need to discretize • Enumerate all values