150 likes | 188 Views
This outline covers the motivation behind model selection, the dangers of overfitting in machine learning, and strategies like structural risk minimization and cross-validation. Theoretical models, penalties, and performance metrics are discussed, highlighting the importance of balancing complexity with training error. Minimum description length as a penalty and the application of the Minimum Message Length principle are also explored.
E N D
Outline • Motivation • Overfitting • Structural Risk Minimization • Cross Validation • Minimum Description Length
Motivation: • Suppose we have a class of infinite Vcdim • We have too few examples • How can we find the best hypothesis • Alternatively, • Usually we choose the hypothesis class • How should we go about doing it?
Overfitting • Concept class: Intervals on a line • Can classify any training set • Zero training error: The only goal?!
Overfitting: Intervals • Can always get zero error • Are we interested?! • Recall Occam Razor!
Overfitting • Simple concept plus noise • A very complex concept • insufficient number of examples + noise 1/3
Theoretical Model • Nested Hypothesis classes • H1H2H3 … Hi • Let VC-dim(Hi)=I • For simplicity |Hi| = 2i • There is a target function c(x), • For some i, c Hi • e(h) = Pr [ h c] • ei = minhHi e(h) • e* = miniei
Theoretical Model • Training error • obs(h) = Pr [ h c] • obsi = minhHi obs(h) • Complexity of h • d(h) = mini {h Hi} • Add a penalty for d(h) • minimize: obs(h)+penalty(h)
Structural Risk Minimization • Penalty based. • Chose the hypothesis which minimizes: • obs(h)+penalty(h) • SRM penalty:
SRM: Performance • THEOROM • With probability 1- • h* : best hypothesis • g* : SRM choice • e(h*) e(g*) e(h*)+ 2 penalty(h*) • Claim: The theorem is “tight” • Hiincludes 2i coins
Proof • Bounding the error in Hi • Bounding the error across Hi
Cross Validation • Separate sample to training and selection. • Using the training • Select from each Hi a candidate gi • Using the selection sample • select between g1, … ,gm • The split size • (1-)m training set • m selection set
Cross Validation: Performance • Errors • ecv(m), eA(m) • Theorem: with probability 1- • Is CV always near-optimal ?!
Minimum Description length • Penalty: size of h • Related to MAP • size of h: log(Pr[h]) • errors: log(Pr[D|h])