Decision Tree Pruning

Decision Tree Pruning

Problem Statement • We like to output small decision tree • Model Selection • The building is done until zero training error • Option I : Stop Early • Small decrease in index function • Cons: may miss structure • Option 2: Prune after building.

Pruning • Input: tree T • Sample: S • Output: Tree T’ • Basic Pruning: T’ is a sub-tree of T • Can only replace inner nodes by leaves • More advanced: • Replace an inner node by one of its children

Reduced Error Pruning • Split the sample to two part S1 and S2 • Use S1 to build a tree. • Use S2 to sample whether to prune. • Process every inner node v • After all its children has been process • Compute the observed error of Tv and leaf(v) • If leaf(v) has less errors replace Tv by leaf(v)

Reduced Error Pruning: Example

Pruning: CV & SRM • Generate for each pruning size • compute the minimal error pruning • At most m different sub-trees • Select between the prunings • Cross Validation • Structural Risk Minimization • Any other index method

Finding the minimum pruning • Procedure Compute • Inputs: • k : number of errors • T : tree • S : sample • Output: • P : pruned tree • size : size of P

Procedure compute • IF IsLeaf(T) • IF Errors(T)  k • THEN size=1 • ELSE size =  • P=T; return; • IF Errors(root(T))  k • size=1; P=root(T); return;

Procedure compute • For i = 0 to k DO • Call Compute(i, T[0], S0, sizei,0,Pi.0) • Call Compute(k-i, T[1], S1, sizei,1,Pi.1) • size = minimum {sizei,0 + sizei,1 +1} • I = arg min {sizei,0 + sizei,1 +1} • P = MakeTree(root(T),PI,0, PI,1} • What is the time complexity?

Cross Validation • Split the sample S1 and S2 • Build a tree using S1 • Compute the candidate pruning • Select using S2 • Output the tree with smallest error on S2

SRM • Build a Tree T using S • Compute the candidate pruning • kd the size of the pruning with d errors • Select using the SRM formula

Drawbacks • Running time • Since |T| = O(m) • Running time O(m2) • Many passes over the data • Significant drawback for large data sets

Linear Time Pruning • Single Bottom-up pass • linear time • Use SRM like formula • Local soundness • Competitiveness to any pruning

Algorithm • Process a node after processing its children • Local parameters: • Tv current sub-tree at v, of size sizev • Sv sample reaching v, of size mv • lv length of path leading to v • Local Test: obs(Tv,Sv) + a(mv,sizev,lv,) > obs (root(Tv),Sv)

The function a() • Parameters: • paths(H,l) set of paths of length l over H. • trees(H,s) set of trees of size s over H. • Formula:

The function a() • Finite Class H • |paths(H,l)| < |H|l. • |trees(H,s)| < (4|H|)s. • Formula: • Infinite Classes: VC-dim

Example m lv =3 mv a(mv,sizev,lv,) sizev

Local uniform convergence • Sample S • Sc = { x S |c(x)=1}, mc=|Sc| • Finite classes C and H • e(h|c) = Pr[ h(x) f(x) | c(x)=1 ] • obs(h|c) • Lemma: with probability 1-

Global Analysis • Notation • Toriginal tree (depends on S) • T*pruned tree • Toptoptimal tree • rv= (lv+sizev)log|H| +log (mv/) • a(mv,sizev,lv,)= O( sqrt{ rv/mv }) 1

Sub-Tree Property • Lemma: with probability 1- T* is a sub-tree of Topt • Proof: • Assume the all the local lemmas hold. • Each pruning reduces the error. • Assume T* has a subtree outside Topt • Adding that subtree to Topt will improve it!

Comparing T*and Topt • Additional pruned nodes: V={v1, … , vt} • Additional error: • e(T*) - e(Topt) =  (e(vi)-e(T*vi))Pr[vi] • Claim: With high probability

Analysis • Lemma: With probability 1- • If Pr[vi] > 12(lopt log |H|+ log 1/ )/m =b • THEN Pr[vi] > 2obs(vi) • Proof: • Relative Chernoff Bound. • Union over |H|lpaths. • V’ = {viV | Pr[vi]>b}

Analysis of  • Sum over V-V’ bounded by sopt b

Analysis of  • Sum of mv < loptsizeopt • Sum of rv <sizeopt(lopt log |H|+ log m/ ) • Putting it all together

Decision Tree Pruning