300 likes | 427 Views
Additional Topics in Prediction Methodology. Introduction. Predictive distribution for random variable Y 0 is meant to capture all the information about Y 0 that is contained in Y n .
E N D
Introduction • Predictive distribution for random variable Y0 is meant to capture all the information about Y0 that is contained in Yn. • not completely specify Y0 but does provide a probability distribution of more likely and less likely values of Y0 • E{Y0|Yn} is the best MSPE predictor of Y0
Hierarchical models have two stages • X Rd • f0=f(x0) known p*1 vector • F=(fj(xj)) known n*p matrix • unknown p*1 vector regression coefficients • R=(R(xi-xj)) known n*n matrix correlations among trainning data Yn • r0=(R(xi-x0)) known n*1 vector correlations of Y0 with Yn
Interesting features of (a) and (b) • Non-informative Prior is the limit of the normal prior as • While the prior is non-informative, it is not a proper distribution. The corresponding predictive distribution is proper. • The same conditioning argument can be applied to drive posterior mean for the non-informative prior and normal prior.
The mean and variance of the predictive distribution (mean) • 0|n(x0) and 0|n(x0) depend on x0 only through the regression function f0 and correlation vector r0 • 0|n(x0) is a linear unbiased predictor of Y(x0) • The continuity and other smoothness properties of 0|n(x0) are inherited from correlation function R(.) and the regressors {f(.)}j=1p
0|n(x0) depends on the parameters z2 2 only through their ratio • 0|n(x0) interpolate the training data. When x0=xi, f0=f(xi), and r0TR-1=eiT, the ith unit vector.
The mean and variance of the predictive distribution (Variance) • MSPE(0|n(x0) )= 0|n2(x0) • The variance of the posterior of Y(x0) given Yn should be 0 whenever x0=xi 0|n2(xi)=0
Predictive Distributions when R and r0 are known The posterior is a location shifted and scaled univariate t distribution having degrees of freedom that are enhanced when there is informative prior information for either or z2
Degree of freedom • Base value for the degree of freedom i=n-p • P additional degrees of freedom when prior is informative • 0 additional degree of freedom when z2 is informative
Location shift The same centering value as Theorem 4.1.1 (known z2) The non-informative prior gives the BLUP
Scale factor i2(x0) (compare 4.1.15 with 4.1.6) • Estimate of the scale factor 0|n2(x0). • Qi2/i : estimate z2 • Qi2: get information about z2 from the conditional distribution Yn given z2 and information from the prior of z2 • i2(xi)=0, xi is any of the training data points.
Prediction Distributions when Correlation parameters are unknown • If the correlations among the observations is unknown (R r0 are unknown)? • Assume y(.) has a Gaussian prior with correlation function R(.|), is unknown vector parameters • Two issues • Standard error of Plug-in predictor 0|n(x0|) by substituting comes from MLE or REML • Bayesian approach to uncertainty in which is to model it by a prior distribution
Prediction of Multiple Response Models • Several outputs are available for from a computer experiment • Several codes are available for computing the same response (fast and slow code) • Competing response • Several stochastic models for joint response • Using these models to describe the optimal predictor for one of the several computed responses.
Modeling Multiple Outputs • Zi(.): marginally mean zero stationary Gaussian stochastic processes with unknown variance and correlation function R • Zi(x) implies that the correlation between Zi(x1) and Zi(x2) only depends on x1-x2 • Assume Cov(Zi(x1), Zj(x2))=ijRij(x1-x2) • Rij(.) cross-correlation function of Zi(.) and Zj(.) • Linear model: global mean of the Yi process. fi(.): known regression functions • i: unknown regression parameters
Selection of correlation and cross-correlation functions are complicated • Reason: for any input sites xli, the multivariate normal distributed random vector (Z1(x11), ….)T must have a nonnegative definite covariance matrix • Solution: construct the Zi(.) from a set of elementary processes (usually this processes are mutually independent)
Example by Kennedy and O’Hagan • Yi(x): prior for the ith code level (i=m top-level code). The autoregressive model: • Yi(x)=i-1Yi-1(x)+i(x), i=2, … , m • The output for each successive higher level code i at x is related to the output of the less precise code i-1 at x plus the refinement i(x) • Cov(Yi(x), Yi-1(w)|Yi-1(x))=0 for all w~=x • No additional second-order knowledge of code i at x can be obtained from the lower-level code i-1 if the value of code i-1 at x is known (Markov property on the hierarchy of codes) • Since there is no natural hierarchy of computer code in such applications, we need find something better.
More reasonable Model • Each constraint function is associated with the objective function plus a refinement • Yi(x)=iY1(x)+i(x), i=2, … , m+1 • Ver Hoef and Marry • Form models in the environmental sciences • Include an unknown smooth surface plus a random measurement error. • Moving averages over white noise processes
Morris and Mitchell model • Prior information about y(x) is specified by a Gaussian processor Y(.) • Prior information about the partial derivatives y(j)(x) is obtained by considering the “derivative” processes of Y(.) • Y1(.)=y(.), y2(.)= y(1)(.), y1+m(.)=y(m)(.) • Natural prior for y(j)(x): • The covariances between Y(x1), Y(j)(x2) and Y(i)(x1), Y(j)(x2) are:
Optimal Predictors for Multiple Outputs • The best MSPE predictor based on training data is: • Where Y0=Y1(X0), Yini=(Yi(x1i), …), and yini is observed value for i=[1,m]
The joint distribution is the multivariate normal distribution
Conditional expectation ….. • In practice, this is useless (it requires knowledge of marginal correlation functions, joint correlation function and ratio of all the process variance) • Empirical versions are of practical use: • Every time we assume each of the correlation matrices Ri and cross-correlation matrices Rij are known up to a vector of parameters. • Estimate using MLE or REML
example1 • 14 point training data has feature that it allows us to learn over the entire input space: space-filling • Compare two model • Using the predictor of y(.) based on y(.) alone • Using the predictor of y(.) base on (y(.), y(1)(.), y(2)(.)) • Second one is both more visually fit and has 24% smaller ERMSPE