Lecture 15: Logistic Regression: Inference and link functions

Lecture 15:Logistic Regression: Inference and link functions BMTRY 701Biostatistical Methods II

More on our example > pros5.reg <- glm(cap.inv ~ log(psa) + gleason, family=binomial) > summary(pros5.reg) Call: glm(formula = cap.inv ~ log(psa) + gleason, family = binomial) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -8.1061 0.9916 -8.174 2.97e-16 *** log(psa) 0.4812 0.1448 3.323 0.000892 *** gleason 1.0229 0.1595 6.412 1.43e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 512.29 on 379 degrees of freedom Residual deviance: 403.90 on 377 degrees of freedom AIC: 409.9

Other covariates: Simple logistic models

What is a good multiple regression model? • Principles of model building are analogous to linear regression • We use the same approach • Look for significant covariates in simple models • consider multicollinearity • look for confounding (i.e. change in betas when a covariate is removed)

Multiple regression model proposal • Gleason, logPSA, Volume, Digital Exam result, detection in RE • But, what about collinearity? 5 choose 2 pairs. gleason log.psa. vol gleason 1.00 0.46 -0.06 log.psa. 0.46 1.00 0.05 vol -0.06 0.05 1.00

Categorical pairs > dpros.dcaps <- epitab(dpros, dcaps) > dpros.dcaps$tab Outcome Predictor 1 p0 2 p1 oddsratio lower upper 1 95 0.2802360 4 0.09756098 1.000000 NA NA 2 123 0.3628319 9 0.21951220 1.737805 0.5193327 5.815089 3 84 0.2477876 12 0.29268293 3.392857 1.0540422 10.921270 4 37 0.1091445 16 0.39024390 10.270270 3.2208157 32.748987 Outcome Predictor p.value 1 NA 2 4.050642e-01 3 3.777900e-02 4 1.271225e-05 > fisher.test(table(dpros, dcaps)) Fisher's Exact Test for Count Data data: table(dpros, dcaps) p-value = 2.520e-05 alternative hypothesis: two.sided

Categorical vs. continuous • t-tests and anova: means by category > summary(lm(log(psa)~dcaps)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.2506 0.1877 6.662 9.55e-11 *** dcaps 0.8647 0.1632 5.300 1.97e-07 *** --- Residual standard error: 0.9868 on 378 degrees of freedom Multiple R-squared: 0.06917, Adjusted R-squared: 0.06671 F-statistic: 28.09 on 1 and 378 DF, p-value: 1.974e-07 > summary(lm(log(psa)~factor(dpros))) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1418087 0.0992064 21.589 < 2e-16 *** factor(dpros)2 -0.1060634 0.1312377 -0.808 0.419 factor(dpros)3 0.0001465 0.1413909 0.001 0.999 factor(dpros)4 0.7431101 0.1680055 4.423 1.28e-05 *** --- Residual standard error: 0.9871 on 376 degrees of freedom Multiple R-squared: 0.07348, Adjusted R-squared: 0.06609 F-statistic: 9.94 on 3 and 376 DF, p-value: 2.547e-06

Categorical vs. continuous > summary(lm(vol~dcaps)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 22.905 3.477 6.587 1.51e-10 *** dcaps -6.362 3.022 -2.106 0.0359 * --- Residual standard error: 18.27 on 377 degrees of freedom (1 observation deleted due to missingness) Multiple R-squared: 0.01162, Adjusted R-squared: 0.009003 F-statistic: 4.434 on 1 and 377 DF, p-value: 0.03589 > summary(lm(vol~factor(dpros))) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 17.417 1.858 9.374 <2e-16 *** factor(dpros)2 -1.638 2.453 -0.668 0.505 factor(dpros)3 -1.976 2.641 -0.748 0.455 factor(dpros)4 -3.513 3.136 -1.120 0.263 --- Residual standard error: 18.39 on 375 degrees of freedom (1 observation deleted due to missingness) Multiple R-squared: 0.003598, Adjusted R-squared: -0.004373 F-statistic: 0.4514 on 3 and 375 DF, p-value: 0.7164

Categorical vs. continuous > summary(lm(gleason~dcaps)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.2560 0.1991 26.401 < 2e-16 *** dcaps 1.0183 0.1730 5.885 8.78e-09 *** --- Residual standard error: 1.047 on 378 degrees of freedom Multiple R-squared: 0.08394, Adjusted R-squared: 0.08151 F-statistic: 34.63 on 1 and 378 DF, p-value: 8.776e-09 > summary(lm(gleason~factor(dpros))) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.9798 0.1060 56.402 < 2e-16 *** factor(dpros)2 0.4217 0.1403 3.007 0.00282 ** factor(dpros)3 0.4890 0.1511 3.236 0.00132 ** factor(dpros)4 0.9636 0.1795 5.367 1.40e-07 *** --- Residual standard error: 1.055 on 376 degrees of freedom Multiple R-squared: 0.07411, Adjusted R-squared: 0.06672 F-statistic: 10.03 on 3 and 376 DF, p-value: 2.251e-06

Lots of “correlation” between covariates • We should expect that there will be insignificance and confounding. • Still, try the ‘full model’ and see what happens

Full model results > mreg <- glm(cap.inv ~ gleason + log(psa) + vol + dcaps + factor(dpros), family=binomial) > > summary(mreg) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -8.617036 1.102909 -7.813 5.58e-15 *** gleason 0.908424 0.166317 5.462 4.71e-08 *** log(psa) 0.514200 0.156739 3.281 0.00104 ** vol -0.014171 0.007712 -1.838 0.06612 . dcaps 0.464952 0.456868 1.018 0.30882 factor(dpros)2 0.753759 0.355762 2.119 0.03411 * factor(dpros)3 1.517838 0.372366 4.076 4.58e-05 *** factor(dpros)4 1.384887 0.453127 3.056 0.00224 ** --- Null deviance: 511.26 on 378 degrees of freedom Residual deviance: 376.00 on 371 degrees of freedom (1 observation deleted due to missingness) AIC: 392

What next? • Drop or retain? • How to interpret?

Likelihood Ratio Test • Recall testing multiple coefficients in linear regression • Approach: ANOVA • We don’t have ANOVA for logistic • More general approach: Likelihood Ratio Test • Based on the likelihood (or log-likelihood) for “competing” nested models

Likelihood Ratio Test • Ho: small model • Ha: large model • Example:

Recall the likelihood function

Estimating the log-likelihood • Recall that we use the log-likelihood because it is simpler (back to linear regression) • MLEs: • Betas are selected to maximize the likelihood • Betas also maximize the log-likelihood • If we plus the estimated betas, we get our ‘maximized’ log-likelihood for that model • We compare the log-likelihoods from competing (nested) models

Likelihood Ratio Test • LR statistic = G2 = -2*(LogL(H0)-LogL(H1)) • Under the null: G2 ~ χ2(p-q) • If G2 < χ2(p-q),1-α, conclude H0 • If G2 > χ2(p-q),1-αconclude H1

LRT in R • -2LogL = Residual Deviance • So, G2 = Dev(0) - Dev(1) • Fit two models:

> mreg1 <- glm(cap.inv ~ gleason + log(psa) + vol + factor(dpros), + family=binomial) > mreg0 <- glm(cap.inv ~ gleason + log(psa) + vol, family=binomial) > mreg1 Coefficients: (Intercept) gleason log(psa) vol -8.31383 0.93147 0.53422 -0.01507 factor(dpros)2 factor(dpros)3 factor(dpros)4 0.76840 1.55109 1.44743 Degrees of Freedom: 378 Total (i.e. Null); 372 Residual (1 observation deleted due to missingness) Null Deviance: 511.3 Residual Deviance: 377.1 AIC: 391.1 > mreg0 Coefficients: (Intercept) gleason log(psa) vol -7.76759 0.99931 0.50406 -0.01583 Degrees of Freedom: 378 Total (i.e. Null); 375 Residual (1 observation deleted due to missingness) Null Deviance: 511.3 Residual Deviance: 399 AIC: 407

Testing DPROS • Dev(0) – Dev(1) = • p – q = • χ2(p-q),1-α, = • Conclusion? • p-value?

More in R qchisq(0.95,3) -2*(logLik(mreg0) - logLik(mreg1)) 1-pchisq(21.96, 3) > anova(mreg0, mreg1) Analysis of Deviance Table Model 1: cap.inv ~ gleason + log(psa) + vol Model 2: cap.inv ~ gleason + log(psa) + vol + factor(dpros) Resid. Df Resid. Dev Df Deviance 1 375 399.02 2 372 377.06 3 21.96 >

Notes on LRT • Again, models have to be NESTED • For comparing models that are not nested, you need to use other approaches • Examples: • AIC • BIC • DIC • Next time….

For next time, read the following article Low Diagnostic Yield of Elective Coronary Angiography Patel, Peterson, Dai et al. NEJM, 362(10). pp. 2886-95 March 11, 2010 http://content.nejm.org/cgi/content/short/362/10/886?ssource=mfv

Lecture 15: Logistic Regression: Inference and link functions