1 / 45

Lecture 12 – Model Assessment and Selection

Lecture 12 – Model Assessment and Selection. Rice ECE697 Farinaz Koushanfar Fall 2006. Summary. Bias, variance, model complexity Optimism of training error rate Estimates of in-sample prediction error, AIC Effective number of parameters The Bayesian approach and BIC

Download Presentation

Lecture 12 – Model Assessment and Selection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006

  2. Summary • Bias, variance, model complexity • Optimism of training error rate • Estimates of in-sample prediction error, AIC • Effective number of parameters • The Bayesian approach and BIC • Vapnik-Chernovekis dimension • Cross-Validation • Bootstrap method

  3. Model Selection Criteria • Training Error • Loss Function • Generalization Error

  4. Training Error vs. Test Error

  5. Model Selection and Assessment • Model selection: • Estimating the performance of different models in order to chose the best • Model assessment: • Having chosen a final model, estimating its prediction error (generalization error) on new data • If we were rich in data: Train Validation Test

  6. Bias-Variance Decomposition • As we have seen before, • The first term is the variance of the target around the true mean f(x0); the second term is the average by which our estimate is off from the true mean; the last term is variance of f^(x0) * The more complex f, the lower the bias, but the higher the variance

  7. Bias-Variance Decomposition (cont’d) • For K-nearest neighbor • For linear regression

  8. Bias-Variance Decomposition (cont’d) • For linear regression, where h(x0) is the vector of weights that produce fp(x0)=x0T(XTX)-1XTy and hence Var[(fp(x0)]=||h(x0)||22 • This variance changes with x0, but its average over the sample values xi is (p/N) 2

  9. Example • 50 observations and 20 predictors, uniformly distributed in the hypercube [0,1]20. • Left: Y is 0 if X11/2 and apply k-NN • Right: Y is 1 if j=110Xj is 5 and 0 otherwise Prediction error Squared bias Variance

  10. Example – loss function Prediction error Squared bias Variance

  11. Optimism of Training Error • The training error • Is typically less than the true error • In sample error • Optimism • For squared error, 0-1, and other losses, on can show in general

  12. Optimism (cont’d) • Thus, the amount by which the error under estimates the true error depends on how much yi affects its own prediction • For linear model • For additive model Y=f(X)+ and thus, Optimism increases linearly with number of inputs or basis d, decreases as training size increases

  13. How to count for optimism? • Estimate the optimism and add it to the training error, e.g., AIC, BIC, etc. • Bootstrap and cross-validation, are direct estimates of this optimism error

  14. Estimates of In-Sample Prediction Error • General form of in-sample estimate is computed from • Cp statistic: for an additive error model, when d parameters are fit under squared error loss, • Using this criterion, adjust the training error by a factor proportional to the number of basis • Akaike Information Criterion (AIC) is a similar but a more generally applicable estimate of Errin, when the log-likelihood loss function is used

  15. Akaike Information Criterion (AIC) • AIC relies on a relationship that holds asymptotically as N • Pr(Y) is a family of densities for Y (contains the “true” density), “ hat” is the max likelihood estimate of , “loglik” is the maximized log-likelihood:

  16. AIC (cont’d) • For the Gaussian model, the AICCp • For the logistic regression, using the binomial log-likelihood, we have • AIC=-2/N. loglik + 2. d/N • Choose the model that produces the smallest possible AIC • What if we don’t know d? • How about having tuning parameters?

  17. AIC (cont’d) • Given a set of models f(x) indexed by a tuning parameter , denote by err() and d() the training error and number of parameters • The function AIC provides an estimate of the test error curve and we find the tuning parameter  that maximizes it • By choosing the best fitting model with d inputs, the effective number of parameters fit is more than d

  18. AIC- Example: Phenome recognition

  19. The effective number of parameters • Generalize num of parameters to regularization • Effective num of parameters is: d(S) = trace(S) • In sample error is:

  20. The effective number of parameters • Thus, for a regularized model: • Hence • and

  21. The Bayesian Approach and BIC • Bayesian information criterion (BIC) • BIC/2 is also known as Schwartz criterion BIC is proportional to AIC (Cp) with a factor 2 replaced by log (N). BIC penalizes complex models more heavily, prefering Simpler models

  22. BIC (cont’d) • BIC is asymptotically consistent as a selection criteria: given a family of models, including the true one, the prob. of selecting the true one is 1 for N • Suppose we have a set of candidate models Mm, m=1,..,M and corresponding model parameters m, and we wish to chose a best model • Assuming a prior distribution Pr(m|Mm) for the parameters of each model Mm, compute the posterior probability of a given model!

  23. BIC (cont’d) • The posterior probability is • Where Z represents the training data. To compare two models Mm and Ml, form the posterior odds • If the posterior greater than one, chose m, otherwise l.

  24. BIC (cont’d) • Bayes factor: the rightmost term in posterior odds • We need to approximate Pr(Z|Mm) • A Laplace approximation to the integral gives • ^m is the maximum likelihood estimate and dm is the number of free parameters of model Mm • If the loss function is set as -2 log Pr(Z|Mm,^m), this is equivalent to the BIC criteria

  25. BIC (cont’d) • Thus, choosing the model with minimum BIC is equivalent to choosing the model with largest (approximate) posterior probability • If we compute the BIC criterion for a set of M models, BICm, m=1,…,M, then the posterior of each model is estimates as • Thus, we can estimate not only the best model, but also • asses the relative merits of the models considered

  26. Vapnik-Chernovenkis Dimension • It is difficult to specify the number of parameters • The Vapnik-Chernovenkis (VC) provides a general measure of complexity and associated bounds on optimism • For a class of functions {f(x,)} indexed by a parameter vector , and xp. • Assume f is in indicator function, either 0 or 1 • If =(0,1) and f is a linear indicator, I(0+1Tx>0), then it is reasonable to say complexity is p+1 • How about f(x, )=I(sin .x)?

  27. VC Dimension (cont’d)

  28. VC Dimension (cont’d) • The Vapnik-Chernovenkis dimension is a way of measuring the complexity of a class of functions by assessing how wiggly its members can be • The VC dimension of the class {f(x,)} is defined to be the largest number of points (in some configuration) that can be shattered by members of {f(x,)}

  29. VC Dimension (cont’d) • A set of points is shattered by a class of functions if no matter how we assign a binary label to each point, a member of the class can perfectly separate them • Example: VC dim of linear indicator function in 2D

  30. VC Dimension (cont’d) • Using the concepts of VC dimension, one can prove results about the optimism of training error when using a class of functions. E.g. • If we fit N data points using a class of functions {f(x,)} having VC dimension h, then with probability at least 1- over training sets For regression, a1=a2=1 Cherkassky and Mulier, 1998

  31. VC Dimension (cont’d) • The bounds suggest that the optimism increases with h and decreases with N in qualitative agreement with the AIC correction d/N • The results of VC dimension bounds are stronger: they give a probabilistic upper bounds for all functions f(x,) and hence allow for searching over the class

  32. VC Dimension (cont’d) • Vapnik’s Structural Risk Minimization (SRM) is built around the described bounds • SRM fits a nested sequence of models of increasing VC dimensions h1<h2<…, and then chooses the model with the smallest value of the upper bound • Drawback is difficulty in computing VC dim • A crude upper bound may not be adequate

  33. Example – AIC, BIC, SRM

  34. Cross Validation (CV) • The most widely used method • Directly estimate the generalization error by applying the model to the test sample • K-fold cross validation • Use part of data to build a model, different part to test • Do this for k=1,2,…,K and calculate the prediction error when predicting the kth part

  35. CV (cont’d) • :{1,…,N}{1,…,K} divides the data to groups • Fitted function f^-(x), computed when  removed • CV estimate of prediction error is • If K=N, is called leave-one-out CV • Given a set of models f^-(x), the th model fit with the kth part removed. For this set of models we have

  36. CV (cont’d) • CV() should be minimized over  • What should we chose for K? • With K=N, CV is unbiased, but can have a high variance since the K training sets are almost the same • Computational complexity

  37. CV (cont’d)

  38. CV (cont’d) • With lower K, CV has a lower variance, but bias could be a problem! • The most common are 5-fold and 10-fold!

  39. CV (cont’d) • Generalized leave-one-out cross validation, for linear fitting with square error loss ỷ=Sy • For linear fits (Sii is the ith on S diagonal) • The GCV approximation is GCV maybe sometimes advantageous where the trace is computed more easily than the individual Sii’s

  40. Bootstrap • Denote the training set by Z=(z1,…,zN) where zi=(xi,yi) • Randomly draw a dataset with replacement from training data • This is done B times (e.g., B=100) • Refit the model to each of the bootstrap datasets and examine the behavior over the B replications • From the bootstrap sample, we can estimate any aspect of the distribution of S(Z) – where S(z) can be any quantity computed from the data

  41. Bootstrap - Schematic For e.g.,

  42. Bootstrap (Cont’d) • Bootstrap to estimate the prediction error • E^rrboot does not provide a good estimate • Bootstrap dataset is acting as both training and testing and these two have common observations • The overfit predictions will look unrealistically good • By mimicking CV, better bootstrap estimates • Only keep track of predictions from bootstrap samples not containing the observations

  43. Bootstrap (Cont’d) • The leave-one-out bootstrap estimate of prediction error • C-i is the set of indices of the bootstrap sample b that do not contain observation I • We either have to choose B large enough to ensure that all of |C-i| is greater than zero, or just leave-out the terms that correspond to |C-i|’s that are zero

  44. Bootstrap (Cont’d) • The leave-one-out bootstrap solves the overfitting problem, we has a training size bias • The average number of distinct observations in each bootstrap sample is 0.632.N • Thus, if the learning curve has a considerable slope at sample size N/2, leave-one-out bootstrap will be biased upward in estimating the error • There are a number of proposed methods to alleviate this problem, e.g., .632 estimator, information error rate (overfitting rate)

  45. Bootstrap (Example) • Five-fold CV and .632 estimate for the same problems as before • Any of the measures could be biased but not affecting, as long as relative performance is the same

More Related