Cross-validation for the selection of statistical models

Cross-validation for the selection of statistical models Simon J. Mason Michael K. Tippett IRI

The Model Selection Problem • Given: Family of models Ma and observations. • Question: Which model to use? • Goals: • Maximize predictive ability given limited observations. • Accurately estimate predictive ability. • Example: Linear regression: • Observations (n=50); • Possible predictors sorted by correlation; • M1 uses first predictor, M2 uses first two predictors, etc.

Estimating predictive ability Wrong way: Calibrate model with all data. Choose the model that best fits the data.

In-sample skill estimates • Akaike information criterion (AIC). • AIC = -2 log (L) + 2p • asymptotic estimate of expected out of sample error. • Maximizing Mallows’ Cp = minimizing AIC • Bayesian information criterion (BIC) • BIC = -2 log(L)+p log(n) • Difference approximates Bayes factor. • L=likelihood, p=# parameters,n=# samples. Maximize fit, penalize complexity.

AIC and BIC AIC = -2 log (L) + 2p BIC = -2 log(L)+p log(n) • BIC tends to select simpler models. • AIC is asymptotically (many obs.) inconsistent. • BIC consistent. • For constant model size, pick best fit. • Large pool of predictors leads to over-fitting.

Out-of-sample skill estimates Calibrate and validate models using independent data sets. • Split data into calibration and validation data sets. • Repeatedly divide data. • Leave-1-out cross-validation; • Leave-k-out cross-validation. Properties?

Leave-k-out CV is biased • Single predictor and predictand. Underestimates correlation. Increasing k reduces (increases) the bias for low (high) correlations. (Barnston & van den Dool 1993). • Multivariate linear regression. Overestimates RMS error with a bias ~ k/[n(n-k)] (Burman 1989). For a given model with significant skill, large k underestimates skill.

On the other hand … Selection bias “If one begins with a very large collection of rival models, then we can be fairly sure that the winning model will have an accidentally high maximum likelihood term.” (Forster). • True predictive skill likely to be overestimated. • Impacts goals of • optimal model choice • accurate skill estimate. Ideally use an independent data set to estimate skill.

In-sample and CV estimate • Leave-1-out cross-validation asymptotically equivalent to AIC (and Mallows’ Cp; Stone 1979). • Leave-k-out cross-validation asymptotically equivalent to BIC for well chosen k. • Increasing k tends to simpler models • CV with large k complex models by require them to estimate many parameters with little data.

Leave-k-out cross validation • Leaving more out tends to select simpler models. • Choice of metric matters. • Correlation and rms error not simply related. • RMS error selects simpler models in numerical experiments.

Impact on skill estimates • Leaving more out reduces skill estimate biases in numerical experiments.

Better model selected? • If the “true” model is simple, leaving out more selects a better model.

Conclusions • Increasing pool of predictors, increases chance of over-fitting and over-estimating skill. • AIC and BIC balance data-fit and model complexity. BIC chooses simpler models. • Leave-k-out cross-validation also penalizes model complexity. (Leave-1-out asymptotic to AIC). • Leaving more out • selects simpler models • reduces skill estimate bias.

Cross-validation for the selection of statistical models