Lecture 15: Logistic Regression: Inference and link functions

Lecture 15:Logistic Regression: Inference and link functions BMTRY 701Biostatistical Methods II

More on our example > pros5.reg <- glm(cap.inv ~ log(psa) + gleason, family=binomial) > summary(pros5.reg) Call: glm(formula = cap.inv ~ log(psa) + gleason, family = binomial) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -8.1061 0.9916 -8.174 2.97e-16 *** log(psa) 0.4812 0.1448 3.323 0.000892 *** gleason 1.0229 0.1595 6.412 1.43e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 512.29 on 379 degrees of freedom Residual deviance: 403.90 on 377 degrees of freedom AIC: 409.9

Other covariates: Simple logistic models

What is a good multiple regression model? • Principles of model building are analogous to linear regression • We use the same approach • Look for significant covariates in simple models • consider multicollinearity • look for confounding (i.e. change in betas when a covariate is removed)

Multiple regression model proposal • Gleason, logPSA, Volume, Digital Exam result, detection in RE • But, what about collinearity? 5 choose 2 pairs. gleason log.psa. vol gleason 1.00 0.46 -0.06 log.psa. 0.46 1.00 0.05 vol -0.06 0.05 1.00

Categorical pairs > dpros.dcaps <- epitab(dpros, dcaps) > dpros.dcaps$tab Outcome Predictor 1 p0 2 p1 oddsratio lower upper 1 95 0.2802360 4 0.09756098 1.000000 NA NA 2 123 0.3628319 9 0.21951220 1.737805 0.5193327 5.815089 3 84 0.2477876 12 0.29268293 3.392857 1.0540422 10.921270 4 37 0.1091445 16 0.39024390 10.270270 3.2208157 32.748987 Outcome Predictor p.value 1 NA 2 4.050642e-01 3 3.777900e-02 4 1.271225e-05 > fisher.test(table(dpros, dcaps)) Fisher's Exact Test for Count Data data: table(dpros, dcaps) p-value = 2.520e-05 alternative hypothesis: two.sided

Categorical vs. continuous • t-tests and anova: means by category > summary(lm(log(psa)~dcaps)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.2506 0.1877 6.662 9.55e-11 *** dcaps 0.8647 0.1632 5.300 1.97e-07 *** --- Residual standard error: 0.9868 on 378 degrees of freedom Multiple R-squared: 0.06917, Adjusted R-squared: 0.06671 F-statistic: 28.09 on 1 and 378 DF, p-value: 1.974e-07 > summary(lm(log(psa)~factor(dpros))) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1418087 0.0992064 21.589 < 2e-16 *** factor(dpros)2 -0.1060634 0.1312377 -0.808 0.419 factor(dpros)3 0.0001465 0.1413909 0.001 0.999 factor(dpros)4 0.7431101 0.1680055 4.423 1.28e-05 *** --- Residual standard error: 0.9871 on 376 degrees of freedom Multiple R-squared: 0.07348, Adjusted R-squared: 0.06609 F-statistic: 9.94 on 3 and 376 DF, p-value: 2.547e-06

Categorical vs. continuous > summary(lm(vol~dcaps)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 22.905 3.477 6.587 1.51e-10 *** dcaps -6.362 3.022 -2.106 0.0359 * --- Residual standard error: 18.27 on 377 degrees of freedom (1 observation deleted due to missingness) Multiple R-squared: 0.01162, Adjusted R-squared: 0.009003 F-statistic: 4.434 on 1 and 377 DF, p-value: 0.03589 > summary(lm(vol~factor(dpros))) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 17.417 1.858 9.374 <2e-16 *** factor(dpros)2 -1.638 2.453 -0.668 0.505 factor(dpros)3 -1.976 2.641 -0.748 0.455 factor(dpros)4 -3.513 3.136 -1.120 0.263 --- Residual standard error: 18.39 on 375 degrees of freedom (1 observation deleted due to missingness) Multiple R-squared: 0.003598, Adjusted R-squared: -0.004373 F-statistic: 0.4514 on 3 and 375 DF, p-value: 0.7164

Categorical vs. continuous > summary(lm(gleason~dcaps)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.2560 0.1991 26.401 < 2e-16 *** dcaps 1.0183 0.1730 5.885 8.78e-09 *** --- Residual standard error: 1.047 on 378 degrees of freedom Multiple R-squared: 0.08394, Adjusted R-squared: 0.08151 F-statistic: 34.63 on 1 and 378 DF, p-value: 8.776e-09 > summary(lm(gleason~factor(dpros))) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.9798 0.1060 56.402 < 2e-16 *** factor(dpros)2 0.4217 0.1403 3.007 0.00282 ** factor(dpros)3 0.4890 0.1511 3.236 0.00132 ** factor(dpros)4 0.9636 0.1795 5.367 1.40e-07 *** --- Residual standard error: 1.055 on 376 degrees of freedom Multiple R-squared: 0.07411, Adjusted R-squared: 0.06672 F-statistic: 10.03 on 3 and 376 DF, p-value: 2.251e-06

Lots of “correlation” between covariates • We should expect that there will be insignificance and confounding. • Still, try the ‘full model’ and see what happens

Full model results > mreg <- glm(cap.inv ~ gleason + log(psa) + vol + dcaps + factor(dpros), family=binomial) > > summary(mreg) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -8.617036 1.102909 -7.813 5.58e-15 *** gleason 0.908424 0.166317 5.462 4.71e-08 *** log(psa) 0.514200 0.156739 3.281 0.00104 ** vol -0.014171 0.007712 -1.838 0.06612 . dcaps 0.464952 0.456868 1.018 0.30882 factor(dpros)2 0.753759 0.355762 2.119 0.03411 * factor(dpros)3 1.517838 0.372366 4.076 4.58e-05 *** factor(dpros)4 1.384887 0.453127 3.056 0.00224 ** --- Null deviance: 511.26 on 378 degrees of freedom Residual deviance: 376.00 on 371 degrees of freedom (1 observation deleted due to missingness) AIC: 392

What next? • Drop or retain? • How to interpret?

Likelihood Ratio Test • Recall testing multiple coefficients in linear regression • Approach: ANOVA • We don’t have ANOVA for logistic • More general approach: Likelihood Ratio Test • Based on the likelihood (or log-likelihood) for “competing” nested models

Likelihood Ratio Test • Ho: small model • Ha: large model • Example:

Recall the likelihood function

Estimating the log-likelihood • Recall that we use the log-likelihood because it is simpler (back to linear regression) • MLEs: • Betas are selected to maximize the likelihood • Betas also maximize the log-likelihood • If we plus the estimated betas, we get our ‘maximized’ log-likelihood for that model • We compare the log-likelihoods from competing (nested) models

Likelihood Ratio Test • LR statistic = G2 = -2*(LogL(H0)-LogL(H1)) • Under the null: G2 ~ χ2(p-q) • If G2 < χ2(p-q),1-α, conclude H0 • If G2 > χ2(p-q),1-αconclude H1

LRT in R • -2LogL = Residual Deviance • So, G2 = Dev(0) - Dev(1) • Fit two models:

> mreg1 <- glm(cap.inv ~ gleason + log(psa) + vol + factor(dpros), + family=binomial) > mreg0 <- glm(cap.inv ~ gleason + log(psa) + vol, family=binomial) > mreg1 Coefficients: (Intercept) gleason log(psa) vol -8.31383 0.93147 0.53422 -0.01507 factor(dpros)2 factor(dpros)3 factor(dpros)4 0.76840 1.55109 1.44743 Degrees of Freedom: 378 Total (i.e. Null); 372 Residual (1 observation deleted due to missingness) Null Deviance: 511.3 Residual Deviance: 377.1 AIC: 391.1 > mreg0 Coefficients: (Intercept) gleason log(psa) vol -7.76759 0.99931 0.50406 -0.01583 Degrees of Freedom: 378 Total (i.e. Null); 375 Residual (1 observation deleted due to missingness) Null Deviance: 511.3 Residual Deviance: 399 AIC: 407

Testing DPROS • Dev(0) – Dev(1) = • p – q = • χ2(p-q),1-α, = • Conclusion? • p-value?

More in R qchisq(0.975,3) -2*(logLik(mreg0) - logLik(mreg1)) 1-pchisq(21.96, 3) > anova(mreg0, mreg1) Analysis of Deviance Table Model 1: cap.inv ~ gleason + log(psa) + vol Model 2: cap.inv ~ gleason + log(psa) + vol + factor(dpros) Resid. Df Resid. Dev Df Deviance 1 375 399.02 2 372 377.06 3 21.96 >

Notes on LRT • Again, models have to be NESTED • For comparing models that are not nested, you need to use other approaches • Examples: • AIC • BIC • DIC • Next time….

For next time, read the following article Mary K. Townsend, Gary C. Curhan, Neil M. Resnick, Francine Grodstein, Oral Contraceptive Use and Incident Urinary Incontinence in Premenopausal Women, The Journal of Urology, In Press, (http://www.sciencedirect.com/science/article/B7XMT-4VVN50M-K/2/31c7620a20865c25c70c93736ef2814d) Keywords: urinary incontinence; contraceptives; oral; epidemiology

Lecture 15: Logistic Regression: Inference and link functions

Lecture 15: Logistic Regression: Inference and link functions

Presentation Transcript

Regression for Data Mining

Logistic Regression – Basic Relationships

Multiple Regression Analysis

VI. Logistic Regression

Thinking About How You Read READING STRATEGIES

MULTIPLE REGRESSION ANALYSIS

Standard Binary Logistic Regression

Review of last week

Statistical Inference and Regression Analysis: Stat-GB.3302.30, Stat-UB.0015.01

Probabilistic Inference Lecture 3

Chapter 3 Exponential, Logistic, and Logarithmic Functions

Probabilistic Inference Lecture 4 – Part 1

Chapter 2: Logistic Regression

Relative Importance of Predictors with Regression Models

Bridging the gap from LogR to IRT

Statistical inference for astrophysics

Lecture 3: Count Programs, While Programs and Recursively Defined Functions

Correlation and regression

Single-Row Functions

Chapter 3

Multilevel Regression Models

Regression