440 likes | 653 Views
Chapter 4 Prediction and Bayesian Inference. 4.1 Estimators versus predictors 4.2 Prediction for one-way ANOVA models Shrinkage estimation, types of predictions 4.3 Best linear unbiased predictors (BLUPs) 4.4 Mixed model predictors 4.5 Bayesian inference
E N D
Chapter 4Prediction and Bayesian Inference • 4.1 Estimators versus predictors • 4.2 Prediction for one-way ANOVA models • Shrinkage estimation, types of predictions • 4.3 Best linear unbiased predictors (BLUPs) • 4.4 Mixed model predictors • 4.5 Bayesian inference • 4.6 Case study: Forecasting lottery sales • 4.7 Credibility Theory • Appendix 4A Linear unbiased predictors
4.1 Estimators versus predictors • In the longitudinal data model, yit = zit´ ai + xit´b + eit , the variables {ai} describe subject-specific effects. • Given the data {yit, zit, xit}, in some problems it is of interest to “summarize” subject effects. • We have discussed how to estimate fixed, unknown parameters . • It is also of interest to summarize subject-specific effects, such as those described by the random variable ai. • Predictors are “estimators” of random variables. • Like estimators, predictors are said to be linear if they are formed from a linear combination of the response y.
Applications of prediction • In animal and plant breeding, one wishes to predict the production of milk for cows based on (1) their lineage (random) and (2) herds (fixed) • In credibility theory, one wishes to predict expected claims for a policyholder given exposure to several risk factors • In sample surveys, one wishes to predict the size of a specific age-sex-race cohort within a small geographical area (known as “small area estimation”). • In a survey article, Robinson (1991) also cites (1) ore reserve estimation in geological surveys, (2) measuring quality of a production plan and (3) ranking baseball players abilities.
4.2. Prediction for one-way ANOVA models • Consider the traditional one-way random effects ANOVA (analysis of variance) model: yit = ma + ai+ eit • Suppose that we wish to summarize the subject-specific conditional mean, ma + ai. • For contrast, first consider using the fixed effects model with ma = 0. • Here, we have that is the “best” (Gauss-Markov) estimate of ai. • This estimate is unbiased, that is, E = ai. • This estimate has minimum variance among all linear unbiased estimators (BLUE).
Shrinkage estimator • Using the one-way random effects model. • Consider an “estimator” of ma + ai that is a linear combination of and , that is, for constants c1 and c2. • Calculations show that the best values of c1 and c2 that minimize are c2 = 1 – c1 and • For large n, we have the shrinkage estimator, or predictor, of ma + ai to be , where
Example of shrinkage estimator Hypothetical Run Times for Three Machines • Machine Run Times Average Run Time • 1 14, 12, 10, 12 1 = 12 • 2 9, 16, 15, 12 2 = 13 • 3 8, 10, 7, 7 3 = 8 • Notation: yijmeans the jth run from the ith machine. • For example, y21 = 9 and y23 = 15. • Are there real differences among machines?
Example - Continued • To see the “shrinkage” effect, consider • Figure 4.1 Comparison of Subject-Specific Means to Shrinkage Estimators. 8 11 12 13 11.825 12.650 8.525
More on shrinkage estimators • Under the random effects model, is an unbiased predictor of ma+ai in the sense that E - (ma + ai) = 0. • However, is inefficient in the sense that has a smaller mean square error than . • Here, has been “shrunk” towards the stable estimator • The “estimator” is said to “borrow strength” from the stable estimator • Recall • Note that zi®1 as either (i) Ti®¥ or (ii) sa2/ s2®¥.
Best predictors • From Section 3.1, it is easy to check that the generalized least square estimator of ma is • The linear predictor of ma + ai that has minimum variance is = zi + (1 - zi ) ma,GLS . • Here, the acronym BLUP stands for best linear unbiased predictor.
Types of Predictors • We have now introduced the BLUP of ma + ai . This quantity is a linear combination of global parameters and subject-specific effects. • Two other types of predictors are of interest. • Residuals. Here, we wish to “predict” eit. The BLUP residual turns out to be • Forecasts. Here, we wish to predict, for “L” lead time units into the future, • Without serial correlation, the predictor is the same as the predictor of ma + ai . However, we will see that the mean square error turns out to be larger.
4.3 Best linear unbiased predictors • This section develops best linear unbiased predictors in the context of mixed linear models, then specializes the consideration to longitudinal data mixed models. • BLUPs are developed by examining the minimum mean square error predictor of a random variable, w. • We give a development due to Harville (1976). • The argument is originally due to Goldberger (1962), who coined the phrase best linear unbiased predictor. • The acronym was first used by Henderson (1973). • BLUPs can also be developed as conditional expectations using multivariate normality • BLUPs can also be developed in a Bayesian context.
Mixed linear models • Suppose that we observe an N 1 random vector y with mean E y = X b and variance Var y = V. • We wish to predict a random variable w, that has mean E w = lb and Var w = sw2. • Denote the covariance between w and y as Cov(w,y) = covwy. • Assuming known regression parameters (b), the best linear (in y) predictor of w is w* = E w + covwy V-1(y - E y ) = lb + covwy V-1(y - X b ). • If w,y are multivariate normal, then w* equals E (w | y ) and hence is a minimum mean square predictor of w. • The predictor w* is also a minimum mean square predictor of w without the assumption of normality. See Appendix 4A.1.
BLUP’s as predictors • To develop the BLUP, • define bGLS = ( XV-1 X)-1 XV-1 y to be the generalized least squares (GLS) estimator of b. • This is the best linear unbiased estimator (BLUE). • Replace b by bGLS in the definition of w* to get the BLUP wBLUP = lbGLS + covwy V-1(y - XbGLS ) = (l- covwyV-1X) bGLS + covwyV-1y. • See Appendix 4A.2 for a check, establishing wBLUP as the best linear unbiased predictor of w. • From Appendix 4A.3, we also have the form for the minimum mean square error: Var (wBLUP - w) = (l- covwy V-1X) ( XV-1 X)-1 (l- covwy V-1X) - covwy V-1 covwy+ sw2.
Example: One-way model • Recall, yit = ma + ai + eit • Thus, yi = 1i (ma + ai) + ei. Thus, Xi = 1i and • With this, we note that Vi-1 (yi - Xi bGLS)= • Thus, for predicting w = ma + ai we have l=1 and Cov(w, yi) = 1i sa2for the ith subject, 0 otherwise. Thus,
Random effect ANOVA model • For predicting residuals eit we have l=0 and Cov(w, yi) = se2for the ith subject, tth time period, 0 otherwise. • Let 1it be a Ti 1 vector with a 1 in the tth position, 0 otherwise. Thus, • is our BLUP residual.
4.4 Mixed model predictors • Recall the longitudinal data mixed model yi = Ziai + Xib + ei • As described in Section 3.3, this is a special case of the mixed linear model. We use V = block diagonal (V1, ..., Vn) , where Vi = ZiDZi + Ri. X = (X1, ... Xn) • For BLUP calculations, note that covwy = ( Cov(w, y1),…, Cov(w, yn) )
Longitudinal data mixed model BLUP • Recall that the r.v. w has mean E w = lb and Var w = sw2. • The BLUP is • The mean square error is Var (wBLUP - w) =
BLUP special cases • Global parameters and subject-specific effects. • Suppose that the interest is in predicting linear combinations of global parameters b and subject-specific effect ai. • Consider linear combinations of the form w= c1¢ai + c2¢b. • Residuals. Here, w = eit . • Forecasts. Suppose that the ith subject is included in the data set; predict • for L lead time units in the future.
Predicting global parameters and subject-specific effects • Consider linear combinations of the form w= c1¢ai + c2¢b. • Straightforward calculations show that • E w = c2¢b so that l = c2, • Cov (w, yj ) = c1¢DZi¢ for j = i • Cov (w, yj ) = 0 for j¹i. • Thus, wBLUP = c2¢bGLS + c1¢DZi¢Vi-1 (yi - XibGLS ).
Special case 1 • Take c2 = 0 . Because the means and variance expressions are true for all vectors c2, we may write this in vector notation to get the BLUP of ai, the vector ai,BLUP = DZi¢Vi-1 (yi - XibGLS ). • This is unbiased in the sense that E ai,BLUP- ai = 0. • This estimate has minimum variance among all linear unbiased predictors (BLUP). • In the case of the error components model (zit = 1), this reduces to • For comparison, recall the fixed effects parameter estimate,
Motivating BLUP’s • We can also motivate BLUP’s using normal theory: • Consider the case where ai and e are multivariate normally distributed. • Then, it can be shown that E (ai | yi) = DZi Vi-1 (yi -Xib). • To motivate this, consider asking the question: what realization of ai could be associated with yi? The expectation! • The BLUP is the BLUE of E (ai | yi). (That is, replace b by bGLS.)
Special case 2 • As another example, it is of interest to predict • Choose and • This yields • This predictor is of interest in actuarial science, where it is known as the credibility estimator.
BLUP Residuals • Here, w = eit . Because E w = 0, it follows that l = 0. • Straightforward calculations show that • Cov (w, yj ) = se21it for j = i and • Cov (w, yj ) = 0 for j¹i. • Here, the symbol 1it¢ denotes a Ti´ 1 vector that has a “one” in the tth position and is zero otherwise. • Thus eit,BLUP = se21it¢Vi-1 (yi - XibGLS ). • This can also be expressed as
Predicting future observations • Suppose that the ith subject is included in the data set; predict • for L lead time units in the future. • We will assume that and are known. • It follows that • Straightforward calculations show that • Thus, the forecast of yi,Ti+L is • Thus, the forecast is the estimate of the conditional mean plus the serial correlation correction factor
Predicting future observations • To illustrate, consider the special case where we have autoregressive of order 1 (AR(1)), serially correlated errors. • Thus, we have • After some algebra, the L step forecast is
4.5 Bayesian Inference • With Bayesian statistical models, one views both the model parameters and the data as random variables. • We assume distributions for each type of random variable. • Given the parameters β and α, the response model is • Specifically, we assume that the responses y conditional on α and β are normally distributed and that E (y | α, β ) = Zα + X β and Var (y | α, β) = R. • Assume that α is distributed normally with mean α and variance D and that β is distributed normally with mean μβ and variance β, each independent of the other.
Distributions • The joint distribution of (α, β) is known as the prior distribution. • To summarize, the joint distribution of (α, β, y)is • where V = R + Z D Z.
Posterior Distribution • The distribution of parameters given the data is known as the posterior distribution. • The posterior distribution of (α, β) given y is normal. • The conditional moments are
Relation with BLUPs • In longitudinal data applications, one typically has more information about the global parameters β than subject-specific parameters α. • Consider first the case β = 0, so that β = β with probability one. • Intuitively, this means that β is precisely known, generally from collateral information. • Assuming that α = 0, it is easy to check that the best linear unbiased estimator (BLUE) of E ( α | y ) is aBLUP = D Z V-1 ( y – X bGLS) • Recall from equation (4.11) that aBLUP is also the best linear unbiased predictor in the frequentist (non-Bayesian) model framework.
Relation with BLUPs • Consider second the case where β-1 = 0. • In this case, prior information about the parameter β is vague; this is known as using a diffuse prior. • Assuming α = 0, one can show that E ( α | y ) = aBLUP • It is interesting that in both extreme cases, we arrive at the statistic aBLUP as a predictor of α. • This analysis assumes D and R are matrices of fixed parameters. • It is also possible to assume distributions for these parameters; typically, independent Wishart distributions are used for D-1 and R-1 as these are conjugate priors. • The general strategy of substituting point estimates for certain parameters in a posterior distribution is called empirical Bayes estimation.
Example – One-way random effects ANOVA model • The posterior means turn out to be • where • Note that measures the precision of knowledge about . Specifically, we see that approaches one as 2, and approaches zeroas 20.
4.6 Wisconsin Lottery Sales • T=40 weeks of sales from n =50 zip codes
Lottery Sales Data Analysis • Cross-sectional analysis shows that population size heavily influences sales, with Kenosha as an outlier • Multiple time series plots • show the effect of jackpots that is common to all postal codes • show the heterogeneity among postal codes (reaffirmed by a pooling test) • show the heteroscedasticity that is accommodated through a logarithmic transformation
Lottery Sales Model Selection • In-sample results show that • One-way error components dominates pooled cross-sectional models • An AR(1) error specification significantly improves the fit. • The best model is probably the two-way error component model, with an AR(1) error specification (not yet documented) • Out-of-sample analysis suggests that • logarithmic sales is the preferred choice of response; it outperforms sales and percentage change.
4.7. What is Credibility? • Hickman’s (1975) Analogy • In politics, leaders begin with a reservoir of credibility which decreases as executive experience is compiled. • Insurance behaves in a reverse fashion! • Here, credibility increases as experience increases.
Credibility Theory • Credibility is a technique for predicting future expected claims for a risk class, given past claims of that and related risk classes. • Importance • Credibility is widely used for pricing property and casualty, worker’s compensation and health care coverages. • According to Rodermund (1989), “the concept of credibility has been the casualty actuaries’ most important and enduring contribution to casualty actuarial science.”
History • Mowbray (1914 - PCAS) • Asked the question, “how extensive is an exposure necessary to give a dependable pure premium?” • This approach is now known as the “limited fluctuation” or “American” credibility • Question 1 – do we have enough exposure to give full weight to the risk class under consideration? • Question 2 – if not, how can we combine information from this and related risk classes?
More History • Whitney (1918 - PCAS) • introduced the idea of using a weighted average of average claims of (1) a given risk class and (2) all risk classes. • The weight is known as the credibility factor. • It is of the form New Premium = ZClaims Experience + (1 – Z) Old Premium.
Example - Balanced Bühlmann • Consider the model yit = + i + it. • The credibility factor is • The traditional credibility estimator is
Example Hypothetical Claims for Three Towns Town Claims Average Claim 1 14, 12, 10, 12 1 = 12 2 9, 16, 15, 12 2 = 13 3 8, 10, 7, 7 3 = 8 • Are there real differences among towns? • Mowbray - does Town 3 have enough data to support its own estimator of pure premiums? • Whitney - how can I use the information in Towns 1 and 2 to help determine my rate for Town 3?
Response toWhitney • Known as the “shrinkage” effect • Comparison of Subject-Specific Means to Credibility Estimators. 8 11 12 13 11.825 12.650 8.525
Why study credibility theory? • Long history of applications – “a business necessity” • More recently, many theoretical advances with fewer innovative applications • Credibility techniques required in legal statutes and standards of practice • Standard of Practice 25 by the Actuarial Standards Board of the American Academy of Actuaries • Wisconsin statutes on credibility insurance and disability income • Advanced techniques are critical for keeping up with competition (health insurance – health economists) • Innovative techniques enhance the “credibility” of the profession