1 / 19

Modeling Considerations and Statistical Information in Stochastic Optimization

This chapter discusses the bias-variance tradeoff, model selection methods, and the concept of cross-validation in stochastic search and optimization. It also explores the Fisher information matrix and the decomposition of mean squared model error.

christinah
Download Presentation

Modeling Considerations and Statistical Information in Stochastic Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Slides for Introduction to Stochastic Search and Optimization (ISSO)by J. C. Spall CHAPTER 13MODELING CONSIDERATIONSAND STATISTICAL INFORMATION “All models are wrong; some are useful.” George E. P. Box Organization of chapter in ISSO Bias-variance tradeoff Model selection: Cross-validation Fisher information matrix: Definition, examples, and efficient computation

  2. Model Definition and MSE • Assume model z = h(,x) + v,where z is output, h(·) is some function, x is input, v is noise, and is vector of model parameters • h(·) may represent simulation model • h(·) may represent “metamodel” (response surface) of existing simulation • A fundamental goal is to take n data points and estimate , forming • A common measure of effectiveness for estimate is mean of squared model error (MSE) at fixed x:

  3. Bias-Variance Decomposition • The MSE of the model at a fixed x can be decomposed as: E{[h( ,x) E(z|x)]2|x} = E{[h( ,x)  E(h( ,x))]2|x}+ [E(h( ,x)) E(z|x)]2 = variance at x + (bias at x)2 where expectations are computed w.r.t. • Above implies: Model too simple  High bias/low variance Model too complex  Low bias/high variance

  4. Unbiased Estimator May Not be Best (Example 13.1 from ISSO) • Unbiased estimator is such that (i.e., mean of prediction is same as mean of data z) • Example: Let denote sample mean of scalar i.i.d. data as estimator of true mean  (h(,x) =  in notation above) • Alternative biased estimator of iswhere 0 < r < 1 • MSE of biased and unbiased estimators generally satisfy • Biased estimate better in MSE sense • However, optimal value of r requires knowledge of unknown (true) 

  5. Bias-Variance Tradeoff in ModelSelection in Simple Problem

  6. Example 13.2 in ISSO: Bias-Variance Tradeoff • Suppose true process produces output according to z = f(x) + noise, where f(x) = (x + x2)1.1 • Compare linear, quadratic, and cubic approximations • Table below gives average bias, variance, and MSE • Overall pattern of decreasing bias and increasing variance; optimal tradeoff is quadratic model

  7. Model Selection • The bias-variance tradeoff provides conceptual framework for determining a good model • Bias-variance tradeoff not directly useful • Need a practical method for optimizing bias-variance tradeoff • Practical aim is to pick a model that minimizes a criterion: f1(fittingerrorfromgivendata)+ f2(modelcomplexity) where f1 andf2 are increasing functions • All methods based on a tradeoff between fitting error (high variance) and model complexity (low bias) • Criterion above may/may not be explicitly used in given method

  8. Methods for Model Selection • Among many popular methods are: • Akaike Information Criterion (AIC) (Akaike, 1974) • Popular in time series analysis • Bayesian selection (Akaike, 1977) • Bootstrap-based selection (Efron and Tibshirini, 1997) • Cross-validation (Stone, 1974) • Minimum description length (Risannen, 1978) • V-C dimension (Vapnik and Chervonenkis, 1971) • Popular in computer science • Cross-validation appears to be most popular model fitting method

  9. Cross-Validation • Cross-validation is simple, general method for comparing candidate models • Other specialized methods may work better in specific problems • Cross-validation uses the training set of data • Method is based on iteratively partitioning the full set of training data into training and test subsets • For each partition, estimatemodel from training subset and evaluate model on test subset • Number of training (or test) subsets = number of model fits required • Select model that performs best over all test subsets

  10. Choice of Training and Test Subsets • Let n denote total size of data set, nT denote size of test subset, nT < n • Common strategy is leave-one-out: nT= 1 • Implies n test subsets during cross-validation process • Often better to choose nT> 1 • Sometimes more efficient (sampling w/o replacement) • Sometimes more accurate model selection • If nT> 1, sampling may be with or without replacement • “With replacement” indicates that there are “n choose nT” test subsets, written • With replacement may be prohibitive in practice: e.g., n = 30, nT = 6 implies nearly 600K model fits! • Sampling without replacement reduces number of test subsets to n/nT (disjoint test subsets) • “With replacement” indicates that there are “n choose nT” samplings • Above may be prohibitive in practice • ee means have may lead to huge number of samlingslarge tno Cross-validation uses the training set of data • Method is based on iteratively partitioning the full set of training data into training and test subsets • For each partition, estimatemodel from training subset and evaluate model on test subset • Select model that performs best over all test subsets

  11. Conceptual Example of Sampling Without Replacement: Cross-Validation with 3 Disjoint Test Subsets

  12. Typical Steps for Cross-Validation Step 0 (initialization) Determine size of test subsets and candidate model. Let i be counter for test subset being used. Step 1 (estimation) For the ith test subset, let the remaining data be the ithtraining subset. Estimate  from this training subset. Step 2 (error calculation) Based on estimate for  from Step 1 (ith training subset), calculate MSE (or other measure) with data in ith test subset. Step 3 (new training and test subsets) Update i to i+ 1 and return to step 1. Form mean of MSE when all test subsets have been evaluated. Step 4 (new model) Repeat steps 1 to 3 for next model. Choose model with lowest mean MSE as best.

  13. Numerical Illustration of Cross-Validation (Example 13.4 in ISSO) • Consider true system corresponding to a sine function of the input with additive normally distributed noise • Consider three candidate models • Linear (affine) model • 3rd-order polynomial • 10th-order polynomial • Suppose 30 data points are available, divided into 5 disjoint test subsets (sampling w/o replacement) • Based on RMS error (equiv. to MSE) over test subsets, 3rd-order polynomial is preferred • See following plot

  14. Sine wave (process mean) 3rd-order 10th-order Linear Numerical Illustration (cont’d): Relative Fits for 3 Models with Low-Noise Observations

  15. Fisher Information Matrix • Fundamental role of data analysis is to extract information from data • Parameter estimation for models is central to process of extracting information • The Fisher information matrix plays a central role in parameter estimation for measuring information Information matrix summarizes the amount of information in the data relative to the parameters being estimated

  16. Problem Setting • Consider the classical statistical problem of estimating parameter vector  from n data vectors z1, z2 ,…, zn • Suppose have a probability density and/or mass function associated with the data • The parameters  appear in the probability function and affect the nature of the distribution • Example: ziN(mean(), covariance()) for all i • Let l(|z1, z2 ,…, zn) represent the likelihood function, i.e., the p.d.f./p.m.f. viewed as a function of  conditioned on the data

  17. Information Matrix—Definition • Recall likelihood functionl(|z1, z2 ,…, zn) • Information matrix defined as where expectation is w.r.t. z1, z2,…, zn • Equivalent form based on Hessian matrix: • Fn() is positive semidefinite of dimension pp (p=dim())

  18. Information Matrix—Two Key Properties • Connection of Fn() and uncertainty in estimate is rigorously specified via two famous results ( = true value of ): 1. Asymptotic normality: where 2. Cramér-Rao inequality: Above two results indicate: greater variability of “smaller” Fn()(and vice versa)

  19. Selected Applications • Information matrix is measure of performance for several applications. Four uses are: 1. Confidence regions for parameter estimation • Uses asymptotic normality and/or Cramér-Rao inequality 2.Prediction bounds for mathematical models 3.Basis for “D-optimal” criterion for experimental design • Information matrix serves as measure of how well  can be estimated for a given set of inputs 4.Basis for “noninformative prior” in Bayesian analysis • Sometimes used for “objective” Bayesian inference

More Related