Additional Topics in Prediction Methodology

Additional Topics in Prediction Methodology

Introduction • Predictive distribution for random variable Y0 is meant to capture all the information about Y0 that is contained in Yn. • not completely specify Y0 but does provide a probability distribution of more likely and less likely values of Y0 • E{Y0|Yn} is the best MSPE predictor of Y0

Hierarchical models have two stages • X Rd • f0=f(x0) known p*1 vector • F=(fj(xj)) known n*p matrix •  unknown p*1 vector regression coefficients • R=(R(xi-xj)) known n*n matrix correlations among trainning data Yn • r0=(R(xi-x0)) known n*1 vector correlations of Y0 with Yn

Predictive Distributions when Z2, R and r0 are known

Interesting features of (a) and (b) • Non-informative Prior is the limit of the normal prior as  • While the prior is non-informative, it is not a proper distribution. The corresponding predictive distribution is proper. • The same conditioning argument can be applied to drive posterior mean for the non-informative prior and normal prior.

The mean and variance of the predictive distribution (mean) • 0|n(x0) and  0|n(x0) depend on x0 only through the regression function f0 and correlation vector r0 • 0|n(x0) is a linear unbiased predictor of Y(x0) • The continuity and other smoothness properties of 0|n(x0) are inherited from correlation function R(.) and the regressors {f(.)}j=1p

0|n(x0) depends on the parameters z2 2 only through their ratio • 0|n(x0) interpolate the training data. When x0=xi, f0=f(xi), and r0TR-1=eiT, the ith unit vector.

The mean and variance of the predictive distribution (Variance) • MSPE(0|n(x0) )=  0|n2(x0) • The variance of the posterior of Y(x0) given Yn should be 0 whenever x0=xi  0|n2(xi)=0

Most important use of Theorem 4.1.1

Predictive Distributions when R and r0 are known The posterior is a location shifted and scaled univariate t distribution having degrees of freedom that are enhanced when there is informative prior information for either  or z2

Degree of freedom • Base value for the degree of freedom i=n-p • P additional degrees of freedom when prior  is informative • 0 additional degree of freedom when z2 is informative

Location shift The same centering value as Theorem 4.1.1 (known z2) The non-informative prior gives the BLUP

Scale factor i2(x0) (compare 4.1.15 with 4.1.6) • Estimate of the scale factor 0|n2(x0). • Qi2/i : estimate z2 • Qi2: get information about z2 from the conditional distribution Yn given z2 and information from the prior of z2 • i2(xi)=0, xi is any of the training data points.

Prediction Distributions when Correlation parameters are unknown • If the correlations among the observations is unknown (R r0 are unknown)? • Assume y(.) has a Gaussian prior with correlation function R(.|),  is unknown vector parameters • Two issues • Standard error of Plug-in predictor 0|n(x0|) by substituting  comes from MLE or REML • Bayesian approach to uncertainty in  which is to model it by a prior distribution

Prediction of Multiple Response Models • Several outputs are available for from a computer experiment • Several codes are available for computing the same response (fast and slow code) • Competing response • Several stochastic models for joint response • Using these models to describe the optimal predictor for one of the several computed responses.

Modeling Multiple Outputs • Zi(.): marginally mean zero stationary Gaussian stochastic processes with unknown variance and correlation function R • Zi(x) implies that the correlation between Zi(x1) and Zi(x2) only depends on x1-x2 • Assume Cov(Zi(x1), Zj(x2))=ijRij(x1-x2) • Rij(.) cross-correlation function of Zi(.) and Zj(.) • Linear model: global mean of the Yi process. fi(.): known regression functions • i: unknown regression parameters

Selection of correlation and cross-correlation functions are complicated • Reason: for any input sites xli, the multivariate normal distributed random vector (Z1(x11), ….)T must have a nonnegative definite covariance matrix • Solution: construct the Zi(.) from a set of elementary processes (usually this processes are mutually independent)

Example by Kennedy and O’Hagan • Yi(x): prior for the ith code level (i=m top-level code). The autoregressive model: • Yi(x)=i-1Yi-1(x)+i(x), i=2, … , m • The output for each successive higher level code i at x is related to the output of the less precise code i-1 at x plus the refinement i(x) • Cov(Yi(x), Yi-1(w)|Yi-1(x))=0 for all w~=x • No additional second-order knowledge of code i at x can be obtained from the lower-level code i-1 if the value of code i-1 at x is known (Markov property on the hierarchy of codes) • Since there is no natural hierarchy of computer code in such applications, we need find something better.

More reasonable Model • Each constraint function is associated with the objective function plus a refinement • Yi(x)=iY1(x)+i(x), i=2, … , m+1 • Ver Hoef and Marry • Form models in the environmental sciences • Include an unknown smooth surface plus a random measurement error. • Moving averages over white noise processes

Morris and Mitchell model • Prior information about y(x) is specified by a Gaussian processor Y(.) • Prior information about the partial derivatives y(j)(x) is obtained by considering the “derivative” processes of Y(.) • Y1(.)=y(.), y2(.)= y(1)(.), y1+m(.)=y(m)(.) • Natural prior for y(j)(x): • The covariances between Y(x1), Y(j)(x2) and Y(i)(x1), Y(j)(x2) are:

Optimal Predictors for Multiple Outputs • The best MSPE predictor based on training data is: • Where Y0=Y1(X0), Yini=(Yi(x1i), …), and yini is observed value for i=[1,m]

The joint distribution is the multivariate normal distribution

Conditional expectation ….. • In practice, this is useless (it requires knowledge of marginal correlation functions, joint correlation function and ratio of all the process variance) • Empirical versions are of practical use: • Every time we assume each of the correlation matrices Ri and cross-correlation matrices Rij are known up to a vector of parameters. • Estimate  using MLE or REML

example1 • 14 point training data has feature that it allows us to learn over the entire input space: space-filling • Compare two model • Using the predictor of y(.) based on y(.) alone • Using the predictor of y(.) base on (y(.), y(1)(.), y(2)(.)) • Second one is both more visually fit and has 24% smaller ERMSPE

Thank you!

Additional Topics in Prediction Methodology

Additional Topics in Prediction Methodology

Presentation Transcript

Part 3 Additional Topics

Principal Component Analysis: Additional Topics

Additional Topics in Regression

Additional Topics in Trigonometry

Additional Topics – Brief Overview

INST212D Additional topics

3.4 Additional Topics in Probability and Counting

Introduction to FreeBSD Additional Topics

Additional Topics in Trigonometry

Additional Topics ARTIFICIAL INTELLIGENCE

Some additional Topics

Additional Transformation Topics

Lecture 11: Additional Topics

Lecture 11 – Additional topics in Logistic Regression

Topics in Object-Oriented Methodology

Additional Topics

Appendix A: Additional Topics

Part 3 Additional Topics

Additional Topics ARTIFICIAL INTELLIGENCE

Chapter 3 Additional Derivative Topics

Chapter 3 Additional Derivative Topics

Some additional Topics