Decision Trees

Decision Trees Advanced Statistical Methods in NLP Ling572 January 10, 2012

Information Gain • InfoGain(S,A): expected reduction in entropy due to A

Information Gain • InfoGain(S,A): expected reduction in entropy due to A • Select A with max InfoGain • Resulting in lowest average entropy

Fraction of samples down branch i Disorder of class distribution on branch i Computing Average Entropy |S| instances Branch 2 Branch1 Sa2a Sa2b Sa1a Sa1b

Sunburn Example

Hair Color Height Lotion Weight Picking a Test Brown Blonde Tall Short Red Average Alex:N Annie:B Katie:N Sarah:B Emily:B John:N Sarah: B Dana: N Annie: B Katie: N Alex: N Pete: N John: N Dana:N Pete:N Emily: B Yes No Heavy Light Average Sarah:B Annie:B Emily:B Pete:N John:N Dana:N Alex:N Katie:N Dana:N Alex:N Annie:B Emily:B Pete:N John:N Sarah:B Katie:N

Entropy in Sunburn Example

Entropy in Sunburn Example S = [3B,5N]

Entropy in Sunburn Example S = [3B,5N] Hair color= 0.954-(4/8(-2/4 log 2/4 - 2/4log2/4) + 1/8*0 + 3/8 *0) = 0.954- 0.5 = 0.454 Height = 0.954 - 0.69= 0.264 Weight = 0.954 - 0.94= 0.014 Lotion = 0.954 - 0.61= 0.344

Height Lotion Weight Picking a Test Tall Short Average Annie:B Katie:N Sarah:B Dana:N Yes No Heavy Light Average Sarah:B Annie:B Dana:N Katie:N Dana:N Annie:B Sarah:B Katie:N

Entropy in Sunburn Example S=[2B,2N] Height = 1-2/4(-1/2log1/2-1/2log1/2) + 1/4*0+1/4*0 = 1- 0.5 =0.5 Weight = 1-2/4(-1/2log1/2-1/2log1/2) +2/4(-1/2log1/2-1/2log1/2) = 1 Lotion = 1- 0 = 1

Building Decision Trees with Information Gain • Until there are no inhomogeneous leaves

Building Decision Trees with Information Gain • Until there are no inhomogeneous leaves • Select an inhomogeneous leaf node

Building Decision Trees with Information Gain • Until there are no inhomogeneous leaves • Select an inhomogeneous leaf node • Replace that leaf node by a test node creating subsets that yield highest information gain

Building Decision Trees with Information Gain • Until there are no inhomogeneous leaves • Select an inhomogeneous leaf node • Replace that leaf node by a test node creating subsets that yield highest information gain • Effectively creates set of rectangular regions • Repeatedly draws lines in different axes

Alternate Measures • Issue with Information Gain:

Alternate Measures • Issue with Information Gain: • Favors features with more values • Option:

Alternate Measures • Issue with Information Gain: • Favors features with more values • Option: • Gain Ratio

Alternate Measures • Issue with Information Gain: • Favors features with more values • Option: • Gain Ratio • Sa : elements of S with value A=a

Overfitting • Overfitting: • Model fits the training data TOO well • Fits noise, irrelevant details

Overfitting • Overfitting: • Model fits the training data TOO well • Fits noise, irrelevant details • Why is this bad?

Overfitting • Overfitting: • Model fits the training data TOO well • Fits noise, irrelevant details • Why is this bad? • Harms generalization • Fits training data too well, fits new data badly

Overfitting • Overfitting: • Model fits the training data TOO well • Fits noise, irrelevant details • Why is this bad? • Harms generalization • Fits training data too well, fits new data badly • For model m, training_error(m), D_error(m) – D=all data

Overfitting • Overfitting: • Model fits the training data TOO well • Fits noise, irrelevant details • Why is this bad? • Harms generalization • Fits training data too well, fits new data badly • For model m, training_error(m), D_error(m) – D=all data • If overfit, for another model m’, • training_error(m) < training_error(m’), but • D_error(m) > D_error(m’)

Avoiding Overfitting • Strategies to avoid overfitting:

Avoiding Overfitting • Strategies to avoid overfitting: • Early stopping:

Avoiding Overfitting • Strategies to avoid overfitting: • Early stopping: • Stop when InfoGain < threshold • Stop when number of instances < threshold • Stop when tree depth > threshold • Post-pruning

Avoiding Overfitting • Strategies to avoid overfitting: • Early stopping: • Stop when InfoGain < threshold • Stop when number of instances < threshold • Stop when tree depth > threshold • Post-pruning • Grow full tree and remove branches • Which is better?

Avoiding Overfitting • Strategies to avoid overfitting: • Early stopping: • Stop when InfoGain < threshold • Stop when number of instances < threshold • Stop when tree depth > threshold • Post-pruning • Grow full tree and remove branches • Which is better? • Unclear, both used. • For some applications, post-pruning better

Post-Pruning • Divide data into • Training set: used to build the original tree • Validation set: used to perform pruning

Post-Pruning • Divide data into • Training set: used to build the original tree • Validation set: used to perform pruning • Build decision tree based on training data

Post-Pruning • Divide data into • Training set: used to build the original tree • Validation set: used to perform pruning • Build decision tree based on training data • Until pruning does not reduce validation set performance • Compute perf. for pruning each nodes (& its children) • Greedily remove nodes that do not reduce VS performance

Post-Pruning • Divide data into • Training set: used to build the original tree • Validation set: used to perform pruning • Build decision tree based on training data • Until pruning does not reduce validation set performance • Compute perf. for pruning each nodes (& its children) • Greedily remove nodes that do not reduce VS performance • Yields smaller tree with best performance

Performance Measures • Compute accuracy on:

Performance Measures • Compute accuracy on: • Validation set • k-fold cross-validation

Performance Measures • Compute accuracy on: • Validation set • k-fold cross-validation • Weighted classification error cost: • Weight some types of errors more heavily

Performance Measures • Compute accuracy on: • Validation set • k-fold cross-validation • Weighted classification error cost: • Weight some types of errors more heavily • Minimum description length:

Performance Measures • Compute accuracy on: • Validation set • k-fold cross-validation • Weighted classification error cost: • Weight some types of errors more heavily • Minimum description length: • Favor good accuracy on compact models • MDL = error(tree) + model_size(tree)

Rule Post-Pruning • Convert tree to rules

Rule Post-Pruning • Convert tree to rules • Prune rules independently

Rule Post-Pruning • Convert tree to rules • Prune rules independently • Sort final rule set

Rule Post-Pruning • Convert tree to rules • Prune rules independently • Sort final rule set • Probably most widely used method (toolkits)

Modeling Features • Different types of features need different tests • Binary: Test branches on

Modeling Features • Different types of features need different tests • Binary: Test branches on true/false • Discrete: Branches

Modeling Features • Different types of features need different tests • Binary: Test branches on true/false • Discrete: Branches for each discrete value • Continuous?

Modeling Features • Different types of features need different tests • Binary: Test branches on true/false • Discrete: Branches for each discrete value • Continuous? • Need to discretize

Modeling Features • Different types of features need different tests • Binary: Test branches on true/false • Discrete: Branches for each discrete value • Continuous? • Need to discretize • Enumerate all values

Decision Trees

Decision Trees

Presentation Transcript

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

DECISION TREES

Decision Trees

Decision Trees

Decision trees

Decision Trees

Decision Trees