Chapter 4 Prediction and Bayesian Inference

Chapter 4Prediction and Bayesian Inference • 4.1 Estimators versus predictors • 4.2 Prediction for one-way ANOVA models • Shrinkage estimation, types of predictions • 4.3 Best linear unbiased predictors (BLUPs) • 4.4 Mixed model predictors • 4.5 Bayesian inference • 4.6 Case study: Forecasting lottery sales • 4.7 Credibility Theory • Appendix 4A Linear unbiased predictors

4.1 Estimators versus predictors • In the longitudinal data model, yit = zit´ ai + xit´b + eit , the variables {ai} describe subject-specific effects. • Given the data {yit, zit, xit}, in some problems it is of interest to “summarize” subject effects. • We have discussed how to estimate fixed, unknown parameters . • It is also of interest to summarize subject-specific effects, such as those described by the random variable ai. • Predictors are “estimators” of random variables. • Like estimators, predictors are said to be linear if they are formed from a linear combination of the response y.

Applications of prediction • In animal and plant breeding, one wishes to predict the production of milk for cows based on (1) their lineage (random) and (2) herds (fixed) • In credibility theory, one wishes to predict expected claims for a policyholder given exposure to several risk factors • In sample surveys, one wishes to predict the size of a specific age-sex-race cohort within a small geographical area (known as “small area estimation”). • In a survey article, Robinson (1991) also cites (1) ore reserve estimation in geological surveys, (2) measuring quality of a production plan and (3) ranking baseball players abilities.

4.2. Prediction for one-way ANOVA models • Consider the traditional one-way random effects ANOVA (analysis of variance) model: yit = ma + ai+ eit • Suppose that we wish to summarize the subject-specific conditional mean, ma + ai. • For contrast, first consider using the fixed effects model with ma = 0. • Here, we have that is the “best” (Gauss-Markov) estimate of ai. • This estimate is unbiased, that is, E = ai. • This estimate has minimum variance among all linear unbiased estimators (BLUE).

Shrinkage estimator • Using the one-way random effects model. • Consider an “estimator” of ma + ai that is a linear combination of and , that is, for constants c1 and c2. • Calculations show that the best values of c1 and c2 that minimize are c2 = 1 – c1 and • For large n, we have the shrinkage estimator, or predictor, of ma + ai to be , where

Example of shrinkage estimator Hypothetical Run Times for Three Machines • Machine Run Times Average Run Time • 1 14, 12, 10, 12 1 = 12 • 2 9, 16, 15, 12 2 = 13 • 3 8, 10, 7, 7 3 = 8 • Notation: yijmeans the jth run from the ith machine. • For example, y21 = 9 and y23 = 15. • Are there real differences among machines?

Example - Continued • To see the “shrinkage” effect, consider • Figure 4.1 Comparison of Subject-Specific Means to Shrinkage Estimators. 8 11 12 13 11.825 12.650 8.525

More on shrinkage estimators • Under the random effects model, is an unbiased predictor of ma+ai in the sense that E - (ma + ai) = 0. • However, is inefficient in the sense that has a smaller mean square error than . • Here, has been “shrunk” towards the stable estimator • The “estimator” is said to “borrow strength” from the stable estimator • Recall • Note that zi®1 as either (i) Ti®¥ or (ii) sa2/ s2®¥.

Best predictors • From Section 3.1, it is easy to check that the generalized least square estimator of ma is • The linear predictor of ma + ai that has minimum variance is = zi + (1 - zi ) ma,GLS . • Here, the acronym BLUP stands for best linear unbiased predictor.

Types of Predictors • We have now introduced the BLUP of ma + ai . This quantity is a linear combination of global parameters and subject-specific effects. • Two other types of predictors are of interest. • Residuals. Here, we wish to “predict” eit. The BLUP residual turns out to be • Forecasts. Here, we wish to predict, for “L” lead time units into the future, • Without serial correlation, the predictor is the same as the predictor of ma + ai . However, we will see that the mean square error turns out to be larger.

4.3 Best linear unbiased predictors • This section develops best linear unbiased predictors in the context of mixed linear models, then specializes the consideration to longitudinal data mixed models. • BLUPs are developed by examining the minimum mean square error predictor of a random variable, w. • We give a development due to Harville (1976). • The argument is originally due to Goldberger (1962), who coined the phrase best linear unbiased predictor. • The acronym was first used by Henderson (1973). • BLUPs can also be developed as conditional expectations using multivariate normality • BLUPs can also be developed in a Bayesian context.

Mixed linear models • Suppose that we observe an N 1 random vector y with mean E y = X b and variance Var y = V. • We wish to predict a random variable w, that has mean E w = lb and Var w = sw2. • Denote the covariance between w and y as Cov(w,y) = covwy. • Assuming known regression parameters (b), the best linear (in y) predictor of w is w* = E w + covwy V-1(y - E y ) = lb + covwy V-1(y - X b ). • If w,y are multivariate normal, then w* equals E (w | y ) and hence is a minimum mean square predictor of w. • The predictor w* is also a minimum mean square predictor of w without the assumption of normality. See Appendix 4A.1.

BLUP’s as predictors • To develop the BLUP, • define bGLS = ( XV-1 X)-1 XV-1 y to be the generalized least squares (GLS) estimator of b. • This is the best linear unbiased estimator (BLUE). • Replace b by bGLS in the definition of w* to get the BLUP wBLUP = lbGLS + covwy V-1(y - XbGLS ) = (l- covwyV-1X) bGLS + covwyV-1y. • See Appendix 4A.2 for a check, establishing wBLUP as the best linear unbiased predictor of w. • From Appendix 4A.3, we also have the form for the minimum mean square error: Var (wBLUP - w) = (l- covwy V-1X) ( XV-1 X)-1 (l- covwy V-1X) - covwy V-1 covwy+ sw2.

Example: One-way model • Recall, yit = ma + ai + eit • Thus, yi = 1i (ma + ai) + ei. Thus, Xi = 1i and • With this, we note that Vi-1 (yi - Xi bGLS)= • Thus, for predicting w = ma + ai we have l=1 and Cov(w, yi) = 1i sa2for the ith subject, 0 otherwise. Thus,

Random effect ANOVA model • For predicting residuals eit we have l=0 and Cov(w, yi) = se2for the ith subject, tth time period, 0 otherwise. • Let 1it be a Ti 1 vector with a 1 in the tth position, 0 otherwise. Thus, • is our BLUP residual.

4.4 Mixed model predictors • Recall the longitudinal data mixed model yi = Ziai + Xib + ei • As described in Section 3.3, this is a special case of the mixed linear model. We use V = block diagonal (V1, ..., Vn) , where Vi = ZiDZi + Ri. X = (X1, ... Xn) • For BLUP calculations, note that covwy = ( Cov(w, y1),…, Cov(w, yn) )

Longitudinal data mixed model BLUP • Recall that the r.v. w has mean E w = lb and Var w = sw2. • The BLUP is • The mean square error is Var (wBLUP - w) =

BLUP special cases • Global parameters and subject-specific effects. • Suppose that the interest is in predicting linear combinations of global parameters b and subject-specific effect ai. • Consider linear combinations of the form w= c1¢ai + c2¢b. • Residuals. Here, w = eit . • Forecasts. Suppose that the ith subject is included in the data set; predict • for L lead time units in the future.

Predicting global parameters and subject-specific effects • Consider linear combinations of the form w= c1¢ai + c2¢b. • Straightforward calculations show that • E w = c2¢b so that l = c2, • Cov (w, yj ) = c1¢DZi¢ for j = i • Cov (w, yj ) = 0 for j¹i. • Thus, wBLUP = c2¢bGLS + c1¢DZi¢Vi-1 (yi - XibGLS ).

Special case 1 • Take c2 = 0 . Because the means and variance expressions are true for all vectors c2, we may write this in vector notation to get the BLUP of ai, the vector ai,BLUP = DZi¢Vi-1 (yi - XibGLS ). • This is unbiased in the sense that E ai,BLUP- ai = 0. • This estimate has minimum variance among all linear unbiased predictors (BLUP). • In the case of the error components model (zit = 1), this reduces to • For comparison, recall the fixed effects parameter estimate,

Motivating BLUP’s • We can also motivate BLUP’s using normal theory: • Consider the case where ai and e are multivariate normally distributed. • Then, it can be shown that E (ai | yi) = DZi Vi-1 (yi -Xib). • To motivate this, consider asking the question: what realization of ai could be associated with yi? The expectation! • The BLUP is the BLUE of E (ai | yi). (That is, replace b by bGLS.)

Special case 2 • As another example, it is of interest to predict • Choose and • This yields • This predictor is of interest in actuarial science, where it is known as the credibility estimator.

BLUP Residuals • Here, w = eit . Because E w = 0, it follows that l = 0. • Straightforward calculations show that • Cov (w, yj ) = se21it for j = i and • Cov (w, yj ) = 0 for j¹i. • Here, the symbol 1it¢ denotes a Ti´ 1 vector that has a “one” in the tth position and is zero otherwise. • Thus eit,BLUP = se21it¢Vi-1 (yi - XibGLS ). • This can also be expressed as

Predicting future observations • Suppose that the ith subject is included in the data set; predict • for L lead time units in the future. • We will assume that and are known. • It follows that • Straightforward calculations show that • Thus, the forecast of yi,Ti+L is • Thus, the forecast is the estimate of the conditional mean plus the serial correlation correction factor

Predicting future observations • To illustrate, consider the special case where we have autoregressive of order 1 (AR(1)), serially correlated errors. • Thus, we have • After some algebra, the L step forecast is

4.5 Bayesian Inference • With Bayesian statistical models, one views both the model parameters and the data as random variables. • We assume distributions for each type of random variable. • Given the parameters β and α, the response model is • Specifically, we assume that the responses y conditional on α and β are normally distributed and that E (y | α, β ) = Zα + X β and Var (y | α, β) = R. • Assume that α is distributed normally with mean α and variance D and that β is distributed normally with mean μβ and variance β, each independent of the other.

Distributions • The joint distribution of (α, β) is known as the prior distribution. • To summarize, the joint distribution of (α, β, y)is • where V = R + Z D Z.

Posterior Distribution • The distribution of parameters given the data is known as the posterior distribution. • The posterior distribution of (α, β) given y is normal. • The conditional moments are

Relation with BLUPs • In longitudinal data applications, one typically has more information about the global parameters β than subject-specific parameters α. • Consider first the case β = 0, so that β = β with probability one. • Intuitively, this means that β is precisely known, generally from collateral information. • Assuming that α = 0, it is easy to check that the best linear unbiased estimator (BLUE) of E ( α | y ) is aBLUP = D Z V-1 ( y – X bGLS) • Recall from equation (4.11) that aBLUP is also the best linear unbiased predictor in the frequentist (non-Bayesian) model framework.

Relation with BLUPs • Consider second the case where β-1 = 0. • In this case, prior information about the parameter β is vague; this is known as using a diffuse prior. • Assuming α = 0, one can show that E ( α | y ) = aBLUP • It is interesting that in both extreme cases, we arrive at the statistic aBLUP as a predictor of α. • This analysis assumes D and R are matrices of fixed parameters. • It is also possible to assume distributions for these parameters; typically, independent Wishart distributions are used for D-1 and R-1 as these are conjugate priors. • The general strategy of substituting point estimates for certain parameters in a posterior distribution is called empirical Bayes estimation.

Example – One-way random effects ANOVA model • The posterior means turn out to be • where • Note that  measures the precision of knowledge about . Specifically, we see that  approaches one as 2, and approaches zeroas 20.

4.6 Wisconsin Lottery Sales • T=40 weeks of sales from n =50 zip codes

Lottery Sales Data Analysis • Cross-sectional analysis shows that population size heavily influences sales, with Kenosha as an outlier • Multiple time series plots • show the effect of jackpots that is common to all postal codes • show the heterogeneity among postal codes (reaffirmed by a pooling test) • show the heteroscedasticity that is accommodated through a logarithmic transformation

Lottery Sales Model Selection • In-sample results show that • One-way error components dominates pooled cross-sectional models • An AR(1) error specification significantly improves the fit. • The best model is probably the two-way error component model, with an AR(1) error specification (not yet documented) • Out-of-sample analysis suggests that • logarithmic sales is the preferred choice of response; it outperforms sales and percentage change.

4.7. What is Credibility? • Hickman’s (1975) Analogy • In politics, leaders begin with a reservoir of credibility which decreases as executive experience is compiled. • Insurance behaves in a reverse fashion! • Here, credibility increases as experience increases.

Credibility Theory • Credibility is a technique for predicting future expected claims for a risk class, given past claims of that and related risk classes. • Importance • Credibility is widely used for pricing property and casualty, worker’s compensation and health care coverages. • According to Rodermund (1989), “the concept of credibility has been the casualty actuaries’ most important and enduring contribution to casualty actuarial science.”

History • Mowbray (1914 - PCAS) • Asked the question, “how extensive is an exposure necessary to give a dependable pure premium?” • This approach is now known as the “limited fluctuation” or “American” credibility • Question 1 – do we have enough exposure to give full weight to the risk class under consideration? • Question 2 – if not, how can we combine information from this and related risk classes?

More History • Whitney (1918 - PCAS) • introduced the idea of using a weighted average of average claims of (1) a given risk class and (2) all risk classes. • The weight is known as the credibility factor. • It is of the form New Premium = ZClaims Experience + (1 – Z) Old Premium.

Example - Balanced Bühlmann • Consider the model yit =  + i + it. • The credibility factor is • The traditional credibility estimator is

Example Hypothetical Claims for Three Towns Town Claims Average Claim 1 14, 12, 10, 12 1 = 12 2 9, 16, 15, 12 2 = 13 3 8, 10, 7, 7 3 = 8 • Are there real differences among towns? • Mowbray - does Town 3 have enough data to support its own estimator of pure premiums? • Whitney - how can I use the information in Towns 1 and 2 to help determine my rate for Town 3?

Response toWhitney • Known as the “shrinkage” effect • Comparison of Subject-Specific Means to Credibility Estimators. 8 11 12 13 11.825 12.650 8.525

Why study credibility theory? • Long history of applications – “a business necessity” • More recently, many theoretical advances with fewer innovative applications • Credibility techniques required in legal statutes and standards of practice • Standard of Practice 25 by the Actuarial Standards Board of the American Academy of Actuaries • Wisconsin statutes on credibility insurance and disability income • Advanced techniques are critical for keeping up with competition (health insurance – health economists) • Innovative techniques enhance the “credibility” of the profession

Chapter 4 Prediction and Bayesian Inference

Chapter 4 Prediction and Bayesian Inference

Presentation Transcript

Chapter 6: Inference and Prediction

Bayesian Inference

Bayesian Inference!!!

Bayesian Inference

Bayesian Inference

Bayesian Inference

Bayesian Inference

Bayesian Inference

Bayesian Inference

Bayesian Inference

Bayesian inference

Bayesian Inference

Bayesian inference

Bayesian Inference

Bayesian inference

Bayesian Inference

Bayesian inference

Bayesian Inference

Bayesian inference

Bayesian Inference

Bayesian inference