320 likes | 467 Views
A few more important topics in regression analysis. General linear test approach and descriptive measures of linear association. General linear test approach to regression analysis. Example: Alcoholism and Muscle strength?.
E N D
A few more important topics in regression analysis General linear test approach and descriptive measures of linear association
Example: Alcoholism and Muscle strength? • Urbano-Marquez (1989) report on strength tests for a sample of 50 alcoholic men • X = total lifetime dose of alcohol (kg per kg of body weight) • Y = strength of deltoid muscle in man’s non-dominant arm (determined by taking 5 measurements over a 20-minute period using an electronic myometer)
General linear test approach • Another way of looking at the F test for testing H0: β1 = 0 versus HA: β1 ≠ 0 . • It’s more general -- in multiple regression (more than one predictor variable), it allows us to test that any subset of the slope parameters are 0. • But easiest to understand the approach when applied to the simple linear regression model.
The Full Model The full model (or unrestricted model) is the model that is thought to be most appropriate for the data. For the simple linear regression case, the full model is:
The Reduced Model The reduced model (or restricted model) is the model described by the null hypothesis H0. For the simple linear regression case, the reduced model (H0: β1 = 0) is:
General idea of the general linear test approach • “Fit the full model” (obtain LS estimates of β0 and β1) to the data. Determine the error sum of squares – call it SSE(F). • “Fit the reduced model” (obtain LS estimate of β0) to the data. Determine the error sum of squares – call it SSE(R).
General idea of the general linear test approach (cont’d) • Compare SSE(R) and SSE(F). • Can be shown that SSE(F) is never greater than SSE(R). • If SSE(F) is “close to” SSE(R), then the variation around full model regression line is almost as great as the variation around reduced model regression line. • Conclude reduced model (H0 holds) does fine enough job of describing the data.
General idea of the general linear test approach (cont’d) • On the other hand, if SSE(F) and SSE(R) differ greatly, the additional parameter(s) in the full model substantially reduce the variation of the observations around the fitted regression line. • Conclude the full model (HA holds)does a better job of describing the data.
How close is close? The test statistic is a function of SSE(R)-SSE(F): The degrees of freedom (dfR and dfF) are those associated with the reduced and full model error sum of squares, respectively. A large F* leads to rejecting the null H0 in favor of the alternative HA.
But for simple linear model, it’s the same F test as before …
Example: ANOVA Table Analysis of Variance Source DF SS MS F P Regression 1 504.04 504.040 33.5899 0.000 Error 48 720.27 15.006 Total 49 1224.32 SSE(R)=SSTO SSE(F)=SSE There is a statistically significant linear association between alcoholism and arm strength.
Coefficient of determination • R2 is a number between 0 and 1, inclusive. • An R2 of 1 means all of the data points fall perfectly on the regression line. Predictor X accounts for all variation in Y. • An R2 of 0 means the fitted regression line is perfectly horizontal. Predictor X accounts for none of variation in Y. • Interpretation: “(R2) ×100 percent of the variation in Y is explained by the variation in the predictor X.” (But not in sense that the change in X causes the change in Y!!)
Correlation coefficient • Coefficient of correlation gets plus sign if slope of fitted regression line is positive and negative sign if slope is negative. • So, r is a number between -1 and 1, inclusive. • r = -1 implies perfect negative linear relationship • r = 1 implies perfect positive linear relationship • r = 0 implies no linear relationship
R2 = 70.1% and r = - 0.84 Norway Finland U.S. Italy France
Cautions about using R2 and r • A correlation coefficient is a measure of linear association. It is possible to get an r=0 with a perfect curvilinear relationship. • A large correlation coefficient does not necessarily imply that the estimated regression line fits the data well.
R2 = 25% and r = 0.50 With this game removed, R2 =7.8% and r = 0.28.
Caution about using R2 and r • The value of the correlation coefficient can be greatly affected by one (outlying) data point.
Caution about using R2 and r • A large correlation coefficient does not necessarily mean that useful predictions can be made … it’s still possible to get wide intervals. • Don’t put too much weight on just one summary measure. Look at the whole big picture and whole big science.