1 / 44

Model Assessment and Selection

Model Assessment and Selection. Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007. Goal. Model Selection Model Assessment. A Regression Problem. y = f(x) + noise Can we learn f from this data? Let ’ s consider three methods. . Linear Regression. Quadratic Regression.

borka
Download Presentation

Model Assessment and Selection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007

  2. Goal • Model Selection • Model Assessment

  3. A Regression Problem • y = f(x) + noise • Can we learn f from this data? • Let’s consider three methods...

  4. Linear Regression

  5. Quadratic Regression

  6. Joining the dots

  7. Which is best? • Why not choose the method with the best fit to the data? “How well are you going to predict future data drawn from the same distribution?”

  8. Model Selection and Assessment • Model Selection: Estimating performances of different models to choose the best one (produces the minimum of the test error) • Model Assessment: Having chosen a model, estimating the prediction error on new data

  9. Why Errors • Why do we want to study errors? • In a data-rich situation split the data: Train Validation Test Model Selection Model assessment • But, that’s not usually the case

  10. Overall Motivation • Errors • Measurement of errors (Loss functions) • Decomposing Test Error into Bias & Variance • Estimating the true error • Estimating in-sample error (analytically ) AIC, BIC, MDL, SRM with VC • Estimating extra-sample error (efficientsample reuse) Cross Validation & Bootstrapping

  11. Measuring Errors: Loss Functions • Typical regression loss functions • Squared error: • Absolute error:

  12. Measuring Errors: Loss Functions • Typical classification loss functions • 0-1 Loss: • Log-likelihood (cross-entropy loss / deviance):

  13. The Goal: Low Test Error • We want to minimize generalization error or test error: • But all we really know is training error: ? • And this is a bad estimate of test error

  14. Bias, Variance & Complexity Training error can always be reduced when increasing model complexity, but risks over-fitting. Typically

  15. For squared-error loss & additive noise: Decomposing Test Error Model: Deviation of the average estimate from the true function’s mean Irreducible error of target Y Expected squared deviation of our estimate around its mean

  16. Average Model Bias Average Estimation Bias Further Bias Decomposition For standard linear regression, Estimation Bias = 0 • For linear models (eg. Ridge), bias can be further decomposed: * is the best fitting linear approximation

  17. Closest fit (given our observation)  Shrunken fit Closest fit In population (if epsilon=0) Graphical representation of bias & variance Model Space (basic linear regression) Hypothesis Space Realization Model Fitting Truth Regularized Model Space (ridge regression) Model Bias Estimation Variance Estimation Bias

  18. Averaging over the training set: Linear weights on y: Bias & Variance Decomposition Examples • kNN Regression • Linear Regression

  19. Simulated Example of Bias Variance Decomposition Prediction error -- + -- = -- -- + -- = -- Bias2 Regression with squared error loss Variance Bias-Variance different for 0-1 loss than for squared error loss -- + -- <> -- -- + -- <> -- Classification with 0-1 loss Estimation errors on the right side of the boundary don’t hurt!

  20. Optimism of The Training Error Rate • Typically: training error rate < true error (same data is being used to fit the method and assess its error) < overly optimistic

  21. Estimating Test Error • Can we estimate the discrepancy between err and Err? extra-sample error Expectation over N new responses at each xi Errin --- In-sample error: Adjustment for optimism of training error

  22. For linear fit with d indep inputs/basis funcs: optimism linearly with # d Optimism as training sample size Optimism Summary: for squared error, 0-1 and other loss functions:

  23. Ways to Estimate Prediction Error • In-sample error estimates: • AIC • BIC • MDL • SRM • Extra-sample error estimates: • Cross-Validation • Leave-one-out • K-fold • Bootstrap

  24. Estimates of In-Sample Prediction Error • General form of the in-sample estimate: • For linear fit :

  25. AIC & BIC Similarly: Akaike Information Criterion (AIC) Bayesian Information Criterion (BIC)

  26. AIC & BIC

  27. MDL(Minimum Description Length) • Regularity ~ Compressibility • Learning ~ Finding regularities Learning model Input Samples Rn Predictions R1 Real class R1 Real model =? error

  28. MDL(Minimum Description Length) • Regularity ~ Compressibility • Learning ~ Finding regularities Description of the model under optimal coding Length of transmitting the discrepancy given the model + optimal coding under the given model MDL principle: choose the model with the minimum description length Equivalent to maximizing the posterior:

  29. Vapnik showed that with probability 1- SRM with VC (Vapnik-Chernovenkis) Dimension As h increases A method of selecting a class F from a family of nested classes

  30. Errin Estimation • A trade-off between the fit to the data and the model complexity

  31. Estimation of Extra-Sample Err • Cross Validation • Bootstrap

  32. Cross-Validation test train K-fold ……

  33. How many folds? Computation increases Variance decreases bias decreases k fold Leave-one-out k increases

  34. Cross-Validation: Choosing K Popular choices for K: 5,10,N

  35. Generalized Cross-Validation • LOOCV can be computational expensive for linear fitting with large N • Linear fitting • For linear fitting under squared-error loss: • GCV provides a computationally cheaper approximation

  36. Bootstrap: Main Concept “The bootstrap is a computer-based method of statistical inference that can answer many real statistical questions without formulas” (An Introduction to the Bootstrap, Efron and Tibshirani, 1993) Step 2: Calculate the statistic Step 1: Draw samples with replacement

  37. Sampling distribution of sample mean The sample stands for the population and the distribution of in many resamples stands for the sampling distribution How is it coming In practice cannot afford large number of random samples The theory tells us the sampling distribution

  38. Bootstrap: Error Estimation with Errboot Depends on the unknown true distribution F A straightforward application of bootstrap to error prediction

  39. Bootstrap: Error Estimation with Err(1) A CV-inspired improvement on Errboot

  40. Bootstrap: Error Estimation with Err(.632) An improvement on Err(1)in light-fitting cases ?

  41. Bootstrap: Error Estimation with Err(.632+) An improvement on Err(.632)by adaptively accounting for overfitting • Depending on the amount of overfitting, the best error estimate is as little as Err(.632) , or as much as Err(1),or something in between • Err(.632+)is like Err(.632)with adaptive weights, with Err(1) weighted at least .632 • Err(.632+) adaptively mixes training error and leave-one-out error using the relative overfitting rate (R)

  42. Bootstrap: Error Estimation with Err(.632+)

  43. Cross Validation & Bootstrap • Why bother with cross-validation and bootstrap when analytical estimates are known? • AIC, BIC, MDL, SRM all requires knowledge of d, which is difficult to attain in most situations. 2) Bootstrap and cross validation gives similar results to above but also applicable in more complex situation. 3) Estimating the noise variance requires a roughly working model, cross validation and bootstrap will work well even if the model is far from correct.

  44. Conclusion • Test error plays crucial roles in model selection • AIC, BIC and SRMVC have the advantage that you only need the training error • If VC-dimension is known, then SRM is a good method for model selection – requires much less computation than CV and bootstrap, but is wildly conservative • Methods like CV, Bootstrap give tighter error bounds, but might have more variance • Asymptotically AIC and Leave-one-out CV should be the same • Asymptotically BIC and a carefully chosen k-fold should be the same • BIC is what you want if you want the best structure instead of the best predictor • Bootstrap has much wider applicability than just estimating prediction error

More Related