600 likes | 1.07k Views
Hypothesis tests for slopes in multiple linear regression model. Using the general linear test and sequential sums of squares. An example. Study on heart attacks in rabbits. An experiment in 32 anesthetized rabbits subjected to an infarction (“heart attack”) Three experimental groups:
E N D
Hypothesis tests for slopes in multiple linear regression model Using the general linear test and sequential sums of squares
Study on heart attacks in rabbits • An experiment in 32 anesthetized rabbits subjected to an infarction (“heart attack”) • Three experimental groups: • Hearts cooled to 6º C within 5 minutes of occluded artery (“early cooling”) • Hearts cooled to 6º C within 25 minutes of occluded artery (“late cooling”) • Hearts not cooled at all (“no cooling”)
Study on heart attacks in rabbits • Measurements made at end of experiment: • Size of the infarct area (in grams) • Size of region at risk for infarction (in grams) • Primary research question: • Does the mean size of the infarcted area differ among the three treatment groups – no cooling, early cooling, late cooling – when controlling for the size of the region at risk for infarction?
and … the independent error terms i follow a normal distribution with mean 0 and equal variance 2. A potential regression model • where … • yi is size of infarcted area (in grams) of rabbit i • xi1 is size of the region at risk (in grams) of rabbit i • xi2 = 1 if early cooling of rabbit i, 0 if not • xi3 = 1 if late cooling of rabbit i, 0 if not
The estimated regression function The regression equation is InfSize = - 0.135 + 0.613 AreaSize - 0.243 X2 - 0.0657 X3
#1. Is the regression model containing all three predictors useful in predicting the size of the infarct? Possible hypothesis tests for slopes #2. Is the size of the infarct significantly (linearly) related to the area of the region at risk?
Possible hypothesis tests for slopes #3. (Primary research question) Is the size of the infarct area significantly (linearly) related to the type of treatment after controlling for the size of the region at risk for infarction?
Three basic steps • Define a (larger) full model. • Define a (smaller) reduced model. • Use an F statistic to decide whether or not to reject the smaller reduced model in favor of the larger full model.
The full model The full model (or unrestricted model) is the model thought to be most appropriate for the data. For simple linear regression, the full model is:
The reduced model The reduced model (or restricted model) is the model described by the null hypothesis H0. For simple linear regression, the null hypothesis is H0: β1 = 0. Therefore, the reduced modelis:
The general linear test approach • “Fit the full model” to the data. • Obtain least squares estimates of β0 and β1. • Determine error sum of squares – “SSE(F).” • “Fit the reduced model” to the data. • Obtain least squares estimate of β0. • Determine error sum of squares – “SSE(R).”
The general linear test approach • Compare SSE(R) and SSE(F). • SSE(R) is always larger than (or same as) SSE(F). • If SSE(F) is close to SSE(R), then variation around fitted full model regression function is almost as large as variation around fitted reduced model regression function. • If SSE(F) and SSE(R) differ greatly, then the additional parameter(s) in the full model substantially reduce the variation around the fitted regression function.
How close is close? The test statistic is a function of SSE(R)-SSE(F): The degrees of freedom (dfR and dfF) are those associated with the reduced and full model error sum of squares, respectively. Reject H0 if F* is large (or if the P-value is small).
But for simple linear regression, it’s just the same F test as before
Null hypothesis H0: β1 = 0 Alternative hypothesis HA: β1 ≠ 0 Test statistic P-value = What is the probability that we’d get an F* statistic as large as we did, if the null hypothesis is true? The P-value is determined by comparing F* to an F distribution with 1numerator degree of freedom and n-2denominator degrees of freedom. The formal F-test for slope parameter β1
Example: Alcoholism and muscle strength? • Report on strength tests for a sample of 50 alcoholic men • x = total lifetime dose of alcohol (kg per kg of body weight) • y = strength of deltoid muscle in man’s non-dominant arm
The ANOVA table Analysis of Variance Source DF SS MS F P Regression 1 504.04 504.040 33.5899 0.000 Error 48 720.27 15.006 Total 49 1224.32 SSE(R)=SSTO SSE(F)=SSE There is a statistically significant linear association between alcoholism and arm strength.
Sequential(orextra)sums of squares Another aside
What is a sequential sum of squares? • It can be viewed in either of two ways: • It is the reduction in the error sum of squares (SSE) when one or more predictor variables are added to the model. • Or, it is the increase in the regression sum of squares (SSR) when one or more predictor variables are added to the model.
Notation • The error sum of squares (SSE) and regression sum of squares (SSR) depend on what predictors are in the model. • So, note what variables are in the model. • SSE(X1) denotes the error sum of squares when X1 is the only predictor in the model • SSR(X1, X2) denotes the regression sum of squares when X1 and X2 are both in the model
Notation • The sequential sum of squares of adding: • X2 to the model in which X1 is the only predictor is denoted SSR(X2 | X1) • X1 to the model in which X2 is the only predictor is denoted SSR(X1 | X2) • X1 to the model in which X2 and X3 are predictors is denoted SSR(X1 | X2, X3) • X1 and X2 to the model in which X3 is the only predictor is denoted SSR(X1, X2 | X3)
Allen Cognitive Level (ACL) Study • David and Riley (1990) investigated relationship of ACL test to level of psychopathology in a set of 69 patients in a hospital psychiatry unit: • Response y = ACL score • x1 = vocabulary (Vocab) score on Shipley Institute of Living Scale • x2 = abstraction (Abstract) score on Shipley Institute of Living Scale • x3 = score on Symbol-Digit Modalities Test (SDMT)
Regress y = ACL on x1 = Vocab The regression equation is ACL = 4.23 + 0.0298 Vocab ... Analysis of Variance Source DF SS MS F P Regression 1 2.6906 2.6906 4.47 0.038 Residual Error 67 40.3590 0.6024 Total 68 43.0496
Regress y = ACL on x1 = Vocab and x3 = SDMT The regression equation is ACL = 3.85 - 0.0068 Vocab + 0.0298 SDMT ... Analysis of Variance Source DF SS MS F P Regression 2 11.7778 5.8889 12.43 0.000 Residual Error 66 31.2717 0.4738 Total 68 43.0496 Source DF Seq SS Vocab 1 2.6906 SDMT 1 9.0872
The sequential sum of squares SSR(X3 | X1) SSR(X3 | X1) is the reduction in the error sum of squares when X3 is added to the model in which X1 is the only predictor:
The sequential sum of squares SSR(X3 | X1) SSR(X3 | X1) is the increase in the regression sum of squares when X3 is added to the model in which X1 is the only predictor:
The sequential sum of squares SSR(X3 | X1) The regression equation is ACL = 3.85 - 0.0068 Vocab + 0.0298 SDMT ... Analysis of Variance Source DF SS MS F P Regression 2 11.7778 5.8889 12.43 0.000 Residual Error 66 31.2717 0.4738 Total 68 43.0496 Source DF Seq SS Vocab 1 2.6906 SDMT 1 9.0872
(Order in which predictors are added determine the “Seq SS” you get.) Regress y = ACL on x3 = SDMT The regression equation is ACL = 3.75 + 0.0281 SDMT ... Analysis of Variance Source DF SS MS F P Regression 1 11.680 11.680 24.95 0.000 Residual Error 67 31.3700.468 Total 68 43.050
(Order in which predictors are added determine the “Seq SS” you get.) Regress y = ACL on x3 = SDMT and x1 = Vocab The regression equation is ACL = 3.85 + 0.0298 SDMT - 0.0068 Vocab ... Analysis of Variance Source DF SS MS F P Regression 2 11.7778 5.8889 12.43 0.000 Residual Error 66 31.2717 0.4738 Total 68 43.0496 Source DF Seq SS SDMT 1 11.6799 Vocab 1 0.0979
The sequential sum of squares SSR(X1 | X3) SSR(X1 | X3) is the reduction in the error sum of squares when X1 is added to the model in which X3 is the only predictor:
The sequential sum of squares SSR(X1 | X3) SSR(X1 | X3) is the increase in the regression sum of squares when X1 is added to the model in which X3 is the only predictor:
(Order in which predictors are added determine the “Seq SS” you get.) Regress y = ACL on x3 = SDMT and x1 = Vocab The regression equation is ACL = 3.85 + 0.0298 SDMT - 0.0068 Vocab ... Analysis of Variance Source DF SS MS F P Regression 2 11.7778 5.8889 12.43 0.000 Residual Error 66 31.2717 0.4738 Total 68 43.0496 Source DF Seq SS SDMT 1 11.6799 Vocab 1 0.0979
More sequential sums of squares(Regress y on x3, x1, x2) The regression equation is ACL = 3.95 + 0.0274 SDMT - 0.0174 Vocab + 0.0122 Abstract ... Analysis of Variance Source DF SS MS F P Regression 3 12.3009 4.1003 8.67 0.000 Residual Error 65 30.7487 0.4731 Total 68 43.0496 Source DF Seq SS SDMT 1 11.6799 Vocab 1 0.0979 Abstract 1 0.5230
Two- (or three- or more-) degree of freedom sequential sums of squares The regression equation is ACL = 3.95 + 0.0274 SDMT - 0.0174 Vocab + 0.0122 Abstract ... Analysis of Variance Source DF SS MS F P Regression 3 12.3009 4.1003 8.67 0.000 Residual Error 65 30.7487 0.4731 Total 68 43.0496 Source DF Seq SS SDMT 1 11.6799 Vocab 1 0.0979 Abstract 1 0.5230
#1. Is the regression model containing all three predictors useful in predicting the size of the infarct? Possible hypothesis tests for slopes #2. Is the size of the infarct significantly (linearly) related to the area of the region at risk?
Possible hypothesis tests for slopes #3. (Primary research question) Is the size of the infarct area significantly (linearly) related to the type of treatment upon controlling for the size of the region at risk for infarction?
Reduced model Testing all slope parameters are 0 Full model
Testing all slope parameters are 0 The general linear test statistic: becomes the usual overallF-test:
Testing all slope parameters are 0 Use overall F-test and P-value reported in ANOVA table. The regression equation is InfSize = - 0.135 + 0.613 AreaSize - 0.243 X2 - 0.0657 X3 ... Analysis of Variance Source DF SS MS F P Regression 3 0.95927 0.31976 16.43 0.000 Residual Error 28 0.54491 0.01946 Total 31 1.50418