Lecture 8: ANOVA tables F-tests

Lecture 8:ANOVA tablesF-tests BMTRY 701Biostatistical Methods II

ANOVA • Analysis of Variance • Similar in derivation to ANOVA that is generalization of two-sample t-test • Partitioning of variance into several parts • that due to the ‘model’: SSR • that due to ‘error’: SSE • The sum of the two parts is the total sum of squares: SST

Total Deviations:

Regression Deviations:

Error Deviations:

Definitions

Example: logLOS ~ BEDS > ybar <- mean(data$logLOS) > yhati <- reg$fitted.values > sst <- sum((data$logLOS- ybar)^2) > ssr <- sum((yhati - ybar )^2) > sse <- sum((data$logLOS - yhati)^2) > > sst [1] 3.547454 > ssr [1] 0.6401715 > sse [1] 2.907282 > sse+ssr [1] 3.547454 >

Degrees of Freedom • Degrees of freedom for SST: n - 1 • one df is lost because it is used to estimate mean Y • Degrees of freedom for SSR: 1 • only one df because all estimates are based on same fitted regression line • Degrees of freedom for SSE: n - 2 • two lost due to estimating regression line (slope and intercept)

Mean Squares • “Scaled” version of Sum of Squares • Mean Square = SS/df • MSR = SSR/1 • MSE = SSE/(n-2) • Notes: • mean squares are not additive! That is, MSR + MSE ≠SST/(n-1) • MSE is the same as we saw previously

Standard ANOVA Table

ANOVA for logLOS ~ BEDS > anova(reg) Analysis of Variance Table Response: logLOS Df Sum Sq Mean Sq F value Pr(>F) BEDS 1 0.64017 0.64017 24.442 2.737e-06 *** Residuals 111 2.90728 0.02619

Inference? • What is of interest and how do we interpret? • We’d like to know if BEDS is related to logLOS. • How do we do that using ANOVA table? • We need to know the expected value of the MSR and MSE:

Implications • mean of sampling distribution of MSE is σ2regardless of whether or not β1= 0 • If β1= 0, E(MSE) = E(MSR) • If β1≠ 0, E(MSE) < E(MSR) • To test significance of β1, we can test if MSR and MSE are of the same magnitude.

F-test • Derived naturally from the arguments just made • Hypotheses: • H0: β1= 0 • H1:β1≠ 0 • Test statistic: F* = MSR/MSE • Based on earlier argument we expect F* >1 if H1 is true. • Implies one-sided test.

F-test • The distribution of F under the null has two sets of degrees of freedom • numerator degrees of freedom • denominator degrees of freedom • These correspond to the df as shown in the ANOVA table • numerator df = 1 • denominator df = n-2 • Test is based on

Implementing the F-test • The decision rule • If F* > F(1-α; 1, n-2), then reject Ho • If F* ≤ F(1-α; 1, n-2), then fail to reject Ho

ANOVA for logLOS ~ BEDS > anova(reg) Analysis of Variance Table Response: logLOS Df Sum Sq Mean Sq F value Pr(>F) BEDS 1 0.64017 0.64017 24.442 2.737e-06 *** Residuals 111 2.90728 0.02619 > qf(0.95, 1, 111) [1] 3.926607 > 1-pf(24.44,1,111) [1] 2.739016e-06

More interesting: MLR • You can test that several coefficients are zero at the same time • Otherwise, F-test gives the same result as a t-test • That is: for testing the significance of ONE covariate in a linear regression model, an F-test and a t-test give the same result: • H0: β1= 0 • H1:β1≠ 0

general F testing approach • Previous seems simple • It is in this case, but can be generalized to be more useful • Imagine more general test: • Ho: small model • Ha: large model • Constraint: the small model must be ‘nested’ in the large model • That is, the small model must be a ‘subset’ of the large model

Example of ‘nested’ models Model 1: Model 2: Model 3: Models 2 and 3 are nested in Model 1 Model 2 is not nested in Model 3 Model 3 is not nested in Model 2

Testing: Models must be nested! • To test Model 1 vs. Model 2 • we are testing that β2 = 0 • Ho: β2 = 0 vs. Ha: β2 ≠ 0 • If β2 = 0 , then we conclude that Model 2 is superior to Model 1 • That is, if we fail to reject the null hypothesis Model 1: Model 2:

R reg1 <- lm(LOS ~ INFRISK + ms + NURSE + nurse2, data=data) reg2 <- lm(LOS ~ INFRISK + NURSE + nurse2, data=data) reg3 <- lm(LOS ~ INFRISK + ms, data=data) > anova(reg1) Analysis of Variance Table Response: LOS Df Sum Sq Mean Sq F value Pr(>F) INFRISK 1 116.446 116.446 45.4043 8.115e-10 *** ms 1 12.897 12.897 5.0288 0.02697 * NURSE 1 1.097 1.097 0.4277 0.51449 nurse2 1 1.789 1.789 0.6976 0.40543 Residuals 108 276.981 2.565 ---

R > anova(reg2) Analysis of Variance Table Response: LOS Df Sum Sq Mean Sq F value Pr(>F) INFRISK 1 116.446 116.446 44.8865 9.507e-10 *** NURSE 1 8.212 8.212 3.1653 0.078 . nurse2 1 1.782 1.782 0.6870 0.409 Residuals 109 282.771 2.594 --- > anova(reg1, reg2) Analysis of Variance Table Model 1: LOS ~ INFRISK + ms + NURSE + nurse2 Model 2: LOS ~ INFRISK + NURSE + nurse2 Res.Df RSS Df Sum of Sq F Pr(>F) 1 108 276.981 2 109 282.771 -1 -5.789 2.2574 0.1359

R > summary(reg1) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.355e+00 5.266e-01 12.068 < 2e-16 *** INFRISK 6.289e-01 1.339e-01 4.696 7.86e-06 *** ms 7.829e-01 5.211e-01 1.502 0.136 NURSE 4.136e-03 4.093e-03 1.010 0.315 nurse2 -5.676e-06 6.796e-06 -0.835 0.405 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.601 on 108 degrees of freedom Multiple R-squared: 0.3231, Adjusted R-squared: 0.2981 F-statistic: 12.89 on 4 and 108 DF, p-value: 1.298e-08 >

Testing more than two covariates • To test Model 1 vs. Model 3 • we are testing that β3 = 0 AND β4 = 0 • Ho: β3 = β4 = 0 vs. Ha: β3 ≠ 0 or β4 ≠ 0 • If β3 = β4 = 0, then we conclude that Model 3 is superior to Model 1 • That is, if we reject the null hypothesis Model 1: Model 3:

R > anova(reg3) Analysis of Variance Table Response: LOS Df Sum Sq Mean Sq F value Pr(>F) INFRISK 1 116.446 116.446 45.7683 6.724e-10 *** ms 1 12.897 12.897 5.0691 0.02634 * Residuals 110 279.867 2.544 --- > anova(reg1, reg3) Analysis of Variance Table Model 1: LOS ~ INFRISK + ms + NURSE + nurse2 Model 2: LOS ~ INFRISK + ms Res.Df RSS Df Sum of Sq F Pr(>F) 1 108 276.981 2 110 279.867 -2 -2.886 0.5627 0.5713

R > summary(reg3) Call: lm(formula = LOS ~ INFRISK + ms, data = data) Residuals: Min 1Q Median 3Q Max -2.9037 -0.8739 -0.1142 0.5965 8.5568 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.4547 0.5146 12.542 <2e-16 *** INFRISK 0.6998 0.1156 6.054 2e-08 *** ms 0.9717 0.4316 2.251 0.0263 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.595 on 110 degrees of freedom Multiple R-squared: 0.3161, Adjusted R-squared: 0.3036 F-statistic: 25.42 on 2 and 110 DF, p-value: 8.42e-10

Testing multiple coefficients simultaneously • Region: it is a ‘factor’ variable with 4 categories

Lecture 8: ANOVA tables F-tests