190 likes | 199 Views
This chapter discusses the bias-variance tradeoff, model selection methods, and the concept of cross-validation in stochastic search and optimization. It also explores the Fisher information matrix and the decomposition of mean squared model error.
E N D
Slides for Introduction to Stochastic Search and Optimization (ISSO)by J. C. Spall CHAPTER 13MODELING CONSIDERATIONSAND STATISTICAL INFORMATION “All models are wrong; some are useful.” George E. P. Box Organization of chapter in ISSO Bias-variance tradeoff Model selection: Cross-validation Fisher information matrix: Definition, examples, and efficient computation
Model Definition and MSE • Assume model z = h(,x) + v,where z is output, h(·) is some function, x is input, v is noise, and is vector of model parameters • h(·) may represent simulation model • h(·) may represent “metamodel” (response surface) of existing simulation • A fundamental goal is to take n data points and estimate , forming • A common measure of effectiveness for estimate is mean of squared model error (MSE) at fixed x:
Bias-Variance Decomposition • The MSE of the model at a fixed x can be decomposed as: E{[h( ,x) E(z|x)]2|x} = E{[h( ,x) E(h( ,x))]2|x}+ [E(h( ,x)) E(z|x)]2 = variance at x + (bias at x)2 where expectations are computed w.r.t. • Above implies: Model too simple High bias/low variance Model too complex Low bias/high variance
Unbiased Estimator May Not be Best (Example 13.1 from ISSO) • Unbiased estimator is such that (i.e., mean of prediction is same as mean of data z) • Example: Let denote sample mean of scalar i.i.d. data as estimator of true mean (h(,x) = in notation above) • Alternative biased estimator of iswhere 0 < r < 1 • MSE of biased and unbiased estimators generally satisfy • Biased estimate better in MSE sense • However, optimal value of r requires knowledge of unknown (true)
Example 13.2 in ISSO: Bias-Variance Tradeoff • Suppose true process produces output according to z = f(x) + noise, where f(x) = (x + x2)1.1 • Compare linear, quadratic, and cubic approximations • Table below gives average bias, variance, and MSE • Overall pattern of decreasing bias and increasing variance; optimal tradeoff is quadratic model
Model Selection • The bias-variance tradeoff provides conceptual framework for determining a good model • Bias-variance tradeoff not directly useful • Need a practical method for optimizing bias-variance tradeoff • Practical aim is to pick a model that minimizes a criterion: f1(fittingerrorfromgivendata)+ f2(modelcomplexity) where f1 andf2 are increasing functions • All methods based on a tradeoff between fitting error (high variance) and model complexity (low bias) • Criterion above may/may not be explicitly used in given method
Methods for Model Selection • Among many popular methods are: • Akaike Information Criterion (AIC) (Akaike, 1974) • Popular in time series analysis • Bayesian selection (Akaike, 1977) • Bootstrap-based selection (Efron and Tibshirini, 1997) • Cross-validation (Stone, 1974) • Minimum description length (Risannen, 1978) • V-C dimension (Vapnik and Chervonenkis, 1971) • Popular in computer science • Cross-validation appears to be most popular model fitting method
Cross-Validation • Cross-validation is simple, general method for comparing candidate models • Other specialized methods may work better in specific problems • Cross-validation uses the training set of data • Method is based on iteratively partitioning the full set of training data into training and test subsets • For each partition, estimatemodel from training subset and evaluate model on test subset • Number of training (or test) subsets = number of model fits required • Select model that performs best over all test subsets
Choice of Training and Test Subsets • Let n denote total size of data set, nT denote size of test subset, nT < n • Common strategy is leave-one-out: nT= 1 • Implies n test subsets during cross-validation process • Often better to choose nT> 1 • Sometimes more efficient (sampling w/o replacement) • Sometimes more accurate model selection • If nT> 1, sampling may be with or without replacement • “With replacement” indicates that there are “n choose nT” test subsets, written • With replacement may be prohibitive in practice: e.g., n = 30, nT = 6 implies nearly 600K model fits! • Sampling without replacement reduces number of test subsets to n/nT (disjoint test subsets) • “With replacement” indicates that there are “n choose nT” samplings • Above may be prohibitive in practice • ee means have may lead to huge number of samlingslarge tno Cross-validation uses the training set of data • Method is based on iteratively partitioning the full set of training data into training and test subsets • For each partition, estimatemodel from training subset and evaluate model on test subset • Select model that performs best over all test subsets
Conceptual Example of Sampling Without Replacement: Cross-Validation with 3 Disjoint Test Subsets
Typical Steps for Cross-Validation Step 0 (initialization) Determine size of test subsets and candidate model. Let i be counter for test subset being used. Step 1 (estimation) For the ith test subset, let the remaining data be the ithtraining subset. Estimate from this training subset. Step 2 (error calculation) Based on estimate for from Step 1 (ith training subset), calculate MSE (or other measure) with data in ith test subset. Step 3 (new training and test subsets) Update i to i+ 1 and return to step 1. Form mean of MSE when all test subsets have been evaluated. Step 4 (new model) Repeat steps 1 to 3 for next model. Choose model with lowest mean MSE as best.
Numerical Illustration of Cross-Validation (Example 13.4 in ISSO) • Consider true system corresponding to a sine function of the input with additive normally distributed noise • Consider three candidate models • Linear (affine) model • 3rd-order polynomial • 10th-order polynomial • Suppose 30 data points are available, divided into 5 disjoint test subsets (sampling w/o replacement) • Based on RMS error (equiv. to MSE) over test subsets, 3rd-order polynomial is preferred • See following plot
Sine wave (process mean) 3rd-order 10th-order Linear Numerical Illustration (cont’d): Relative Fits for 3 Models with Low-Noise Observations
Fisher Information Matrix • Fundamental role of data analysis is to extract information from data • Parameter estimation for models is central to process of extracting information • The Fisher information matrix plays a central role in parameter estimation for measuring information Information matrix summarizes the amount of information in the data relative to the parameters being estimated
Problem Setting • Consider the classical statistical problem of estimating parameter vector from n data vectors z1, z2 ,…, zn • Suppose have a probability density and/or mass function associated with the data • The parameters appear in the probability function and affect the nature of the distribution • Example: ziN(mean(), covariance()) for all i • Let l(|z1, z2 ,…, zn) represent the likelihood function, i.e., the p.d.f./p.m.f. viewed as a function of conditioned on the data
Information Matrix—Definition • Recall likelihood functionl(|z1, z2 ,…, zn) • Information matrix defined as where expectation is w.r.t. z1, z2,…, zn • Equivalent form based on Hessian matrix: • Fn() is positive semidefinite of dimension pp (p=dim())
Information Matrix—Two Key Properties • Connection of Fn() and uncertainty in estimate is rigorously specified via two famous results ( = true value of ): 1. Asymptotic normality: where 2. Cramér-Rao inequality: Above two results indicate: greater variability of “smaller” Fn()(and vice versa)
Selected Applications • Information matrix is measure of performance for several applications. Four uses are: 1. Confidence regions for parameter estimation • Uses asymptotic normality and/or Cramér-Rao inequality 2.Prediction bounds for mathematical models 3.Basis for “D-optimal” criterion for experimental design • Information matrix serves as measure of how well can be estimated for a given set of inputs 4.Basis for “noninformative prior” in Bayesian analysis • Sometimes used for “objective” Bayesian inference