230 likes | 302 Views
Lecture 15: Logistic Regression: Inference and link functions. BMTRY 701 Biostatistical Methods II. More on our example. > pros5.reg <- glm(cap.inv ~ log(psa) + gleason, family=binomial) > summary(pros5.reg) Call: glm(formula = cap.inv ~ log(psa) + gleason, family = binomial)
E N D
Lecture 15:Logistic Regression: Inference and link functions BMTRY 701Biostatistical Methods II
More on our example > pros5.reg <- glm(cap.inv ~ log(psa) + gleason, family=binomial) > summary(pros5.reg) Call: glm(formula = cap.inv ~ log(psa) + gleason, family = binomial) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -8.1061 0.9916 -8.174 2.97e-16 *** log(psa) 0.4812 0.1448 3.323 0.000892 *** gleason 1.0229 0.1595 6.412 1.43e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 512.29 on 379 degrees of freedom Residual deviance: 403.90 on 377 degrees of freedom AIC: 409.9
What is a good multiple regression model? • Principles of model building are analogous to linear regression • We use the same approach • Look for significant covariates in simple models • consider multicollinearity • look for confounding (i.e. change in betas when a covariate is removed)
Multiple regression model proposal • Gleason, logPSA, Volume, Digital Exam result, detection in RE • But, what about collinearity? 5 choose 2 pairs. gleason log.psa. vol gleason 1.00 0.46 -0.06 log.psa. 0.46 1.00 0.05 vol -0.06 0.05 1.00
Categorical pairs > dpros.dcaps <- epitab(dpros, dcaps) > dpros.dcaps$tab Outcome Predictor 1 p0 2 p1 oddsratio lower upper 1 95 0.2802360 4 0.09756098 1.000000 NA NA 2 123 0.3628319 9 0.21951220 1.737805 0.5193327 5.815089 3 84 0.2477876 12 0.29268293 3.392857 1.0540422 10.921270 4 37 0.1091445 16 0.39024390 10.270270 3.2208157 32.748987 Outcome Predictor p.value 1 NA 2 4.050642e-01 3 3.777900e-02 4 1.271225e-05 > fisher.test(table(dpros, dcaps)) Fisher's Exact Test for Count Data data: table(dpros, dcaps) p-value = 2.520e-05 alternative hypothesis: two.sided
Categorical vs. continuous • t-tests and anova: means by category > summary(lm(log(psa)~dcaps)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.2506 0.1877 6.662 9.55e-11 *** dcaps 0.8647 0.1632 5.300 1.97e-07 *** --- Residual standard error: 0.9868 on 378 degrees of freedom Multiple R-squared: 0.06917, Adjusted R-squared: 0.06671 F-statistic: 28.09 on 1 and 378 DF, p-value: 1.974e-07 > summary(lm(log(psa)~factor(dpros))) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1418087 0.0992064 21.589 < 2e-16 *** factor(dpros)2 -0.1060634 0.1312377 -0.808 0.419 factor(dpros)3 0.0001465 0.1413909 0.001 0.999 factor(dpros)4 0.7431101 0.1680055 4.423 1.28e-05 *** --- Residual standard error: 0.9871 on 376 degrees of freedom Multiple R-squared: 0.07348, Adjusted R-squared: 0.06609 F-statistic: 9.94 on 3 and 376 DF, p-value: 2.547e-06
Categorical vs. continuous > summary(lm(vol~dcaps)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 22.905 3.477 6.587 1.51e-10 *** dcaps -6.362 3.022 -2.106 0.0359 * --- Residual standard error: 18.27 on 377 degrees of freedom (1 observation deleted due to missingness) Multiple R-squared: 0.01162, Adjusted R-squared: 0.009003 F-statistic: 4.434 on 1 and 377 DF, p-value: 0.03589 > summary(lm(vol~factor(dpros))) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 17.417 1.858 9.374 <2e-16 *** factor(dpros)2 -1.638 2.453 -0.668 0.505 factor(dpros)3 -1.976 2.641 -0.748 0.455 factor(dpros)4 -3.513 3.136 -1.120 0.263 --- Residual standard error: 18.39 on 375 degrees of freedom (1 observation deleted due to missingness) Multiple R-squared: 0.003598, Adjusted R-squared: -0.004373 F-statistic: 0.4514 on 3 and 375 DF, p-value: 0.7164
Categorical vs. continuous > summary(lm(gleason~dcaps)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.2560 0.1991 26.401 < 2e-16 *** dcaps 1.0183 0.1730 5.885 8.78e-09 *** --- Residual standard error: 1.047 on 378 degrees of freedom Multiple R-squared: 0.08394, Adjusted R-squared: 0.08151 F-statistic: 34.63 on 1 and 378 DF, p-value: 8.776e-09 > summary(lm(gleason~factor(dpros))) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.9798 0.1060 56.402 < 2e-16 *** factor(dpros)2 0.4217 0.1403 3.007 0.00282 ** factor(dpros)3 0.4890 0.1511 3.236 0.00132 ** factor(dpros)4 0.9636 0.1795 5.367 1.40e-07 *** --- Residual standard error: 1.055 on 376 degrees of freedom Multiple R-squared: 0.07411, Adjusted R-squared: 0.06672 F-statistic: 10.03 on 3 and 376 DF, p-value: 2.251e-06
Lots of “correlation” between covariates • We should expect that there will be insignificance and confounding. • Still, try the ‘full model’ and see what happens
Full model results > mreg <- glm(cap.inv ~ gleason + log(psa) + vol + dcaps + factor(dpros), family=binomial) > > summary(mreg) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -8.617036 1.102909 -7.813 5.58e-15 *** gleason 0.908424 0.166317 5.462 4.71e-08 *** log(psa) 0.514200 0.156739 3.281 0.00104 ** vol -0.014171 0.007712 -1.838 0.06612 . dcaps 0.464952 0.456868 1.018 0.30882 factor(dpros)2 0.753759 0.355762 2.119 0.03411 * factor(dpros)3 1.517838 0.372366 4.076 4.58e-05 *** factor(dpros)4 1.384887 0.453127 3.056 0.00224 ** --- Null deviance: 511.26 on 378 degrees of freedom Residual deviance: 376.00 on 371 degrees of freedom (1 observation deleted due to missingness) AIC: 392
What next? • Drop or retain? • How to interpret?
Likelihood Ratio Test • Recall testing multiple coefficients in linear regression • Approach: ANOVA • We don’t have ANOVA for logistic • More general approach: Likelihood Ratio Test • Based on the likelihood (or log-likelihood) for “competing” nested models
Likelihood Ratio Test • Ho: small model • Ha: large model • Example:
Estimating the log-likelihood • Recall that we use the log-likelihood because it is simpler (back to linear regression) • MLEs: • Betas are selected to maximize the likelihood • Betas also maximize the log-likelihood • If we plus the estimated betas, we get our ‘maximized’ log-likelihood for that model • We compare the log-likelihoods from competing (nested) models
Likelihood Ratio Test • LR statistic = G2 = -2*(LogL(H0)-LogL(H1)) • Under the null: G2 ~ χ2(p-q) • If G2 < χ2(p-q),1-α, conclude H0 • If G2 > χ2(p-q),1-αconclude H1
LRT in R • -2LogL = Residual Deviance • So, G2 = Dev(0) - Dev(1) • Fit two models:
> mreg1 <- glm(cap.inv ~ gleason + log(psa) + vol + factor(dpros), + family=binomial) > mreg0 <- glm(cap.inv ~ gleason + log(psa) + vol, family=binomial) > mreg1 Coefficients: (Intercept) gleason log(psa) vol -8.31383 0.93147 0.53422 -0.01507 factor(dpros)2 factor(dpros)3 factor(dpros)4 0.76840 1.55109 1.44743 Degrees of Freedom: 378 Total (i.e. Null); 372 Residual (1 observation deleted due to missingness) Null Deviance: 511.3 Residual Deviance: 377.1 AIC: 391.1 > mreg0 Coefficients: (Intercept) gleason log(psa) vol -7.76759 0.99931 0.50406 -0.01583 Degrees of Freedom: 378 Total (i.e. Null); 375 Residual (1 observation deleted due to missingness) Null Deviance: 511.3 Residual Deviance: 399 AIC: 407
Testing DPROS • Dev(0) – Dev(1) = • p – q = • χ2(p-q),1-α, = • Conclusion? • p-value?
More in R qchisq(0.975,3) -2*(logLik(mreg0) - logLik(mreg1)) 1-pchisq(21.96, 3) > anova(mreg0, mreg1) Analysis of Deviance Table Model 1: cap.inv ~ gleason + log(psa) + vol Model 2: cap.inv ~ gleason + log(psa) + vol + factor(dpros) Resid. Df Resid. Dev Df Deviance 1 375 399.02 2 372 377.06 3 21.96 >
Notes on LRT • Again, models have to be NESTED • For comparing models that are not nested, you need to use other approaches • Examples: • AIC • BIC • DIC • Next time….
For next time, read the following article Mary K. Townsend, Gary C. Curhan, Neil M. Resnick, Francine Grodstein, Oral Contraceptive Use and Incident Urinary Incontinence in Premenopausal Women, The Journal of Urology, In Press, (http://www.sciencedirect.com/science/article/B7XMT-4VVN50M-K/2/31c7620a20865c25c70c93736ef2814d) Keywords: urinary incontinence; contraceptives; oral; epidemiology