Logistic Regression

Logistic Regression

Example: Birth defects • We want to know if the probability of a certain birth defect is higher among women of a certain age • Outcome (y) = presence/absence of birth defect • Explanatory (x) = maternal age a birth > bdlog<-glm(bd$casegrp~bd$MAGE,family=binomial("logit")) Since “logit” is the default, you can actually use: > bdlog<-glm(bd$casegrp~bd$MAGE,binomial) > summary(bdlog)

Example in R Call: glm(formula = bd$casegrp ~ bd$MAGE, family = binomial("logit")) Deviance Residuals: Min 1Q Median 3Q Max -0.56672 -0.24047 -0.14728 -0.08994 3.60539 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.82555 0.32491 2.541 0.0111 * bd$MAGE -0.19793 0.01489 -13.290 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Null deviance: 2364.1 on 11892 degrees of freedom Residual deviance: 2130.8 on 11891 degrees of freedom AIC: 2134.8

Plot > plot(MAGE~ fitted(glm(casegrp~ MAGE,binomial)), xlab=“Maternal Age”, ylab=“P(Birth Defect)”, pch=15)

Example: Categorical x • Sometimes, it’s easier to interpret logistic regression output if the x variables are categorical • Suppose we categorize maternal age into 3 categories: > bd$magecat3 <- ifelse(bd$MAGE>25, c(1),c(0)) > bd$magecat2 <- ifelse(bd$MAGE>=20 & bd$MAGE<=25, c(1),c(0)) > bd$magecat1 <- ifelse(bd$MAGE<20, c(1),c(0))

Example in R > bdlog2<-glm(casegrp~magecat1+magecat2,binomial) > summary(bdlog2) • Remember, with a set of dummy variables, you always put in one less variable than category

Example in R Call: glm(formula = casegrp ~ magecat1 + magecat2, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -0.3752 -0.2349 -0.1050 -0.1050 3.2259 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.1977 0.1671 -31.101 <2e-16 *** magecat1 2.5794 0.1964 13.137 <2e-16 *** magecat2 1.6208 0.1942 8.345 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 2364.1 on 11892 degrees of freedom Residual deviance: 2148.6 on 11890 degrees of freedom AIC: 2154.6

Interpretation: odds ratios > exp(cbind(OR=coef(bdlog2),confint(bdlog2))) Waiting for profiling to be done... OR 2.5 % 97.5 % (Intercept) 0.005529105 0.003910843 0.007544544 magecat1 13.189149619 9.066531887 19.622868917 magecat2 5.057367954 3.492376718 7.495720840 • This tells us that: • women <20 years have a 13 times greater odds of a birth defect than women >24 years • women 20-24 years have a 5 times greater odds of a birth defect than women >24 years

What variables can we consider dropping? > anova(bd.log,test="Chisq") Analysis of Deviance Table Model: binomial, link: logit Response: casegrp Terms added sequentially (first to last) Df Deviance Resid. DfResid. Dev P(>|Chi|) NULL 11880 2355.87 magecat1 1 130.58 11879 2225.28 < 2.2e-16 *** magecat2 1 82.73 11878 2142.56 < 2.2e-16 *** bthparity2 1 13.37 11877 2129.18 0.0002555 *** smoke 1 16.10 11876 2113.09 6.022e-05 *** • Small p-values indicate that all variables are needed to explain the variation in y

Goodness of fit statistics Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.9100 0.1870 -26.252 < 2e-16 *** magecat1 2.2534 0.2073 10.872 < 2e-16 *** magecat2 1.4732 0.1965 7.497 6.52e-14 *** bthparity2parous -0.5932 0.1497 -3.962 7.45e-05 *** smokesmoker 0.6515 0.1546 4.213 2.52e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Null deviance: 2355.9 on 11880 degrees of freedom Residual deviance: 2113.1 on 11876 degrees of freedom AIC: 2123.1 • -2 log L • AIC

Binned residual plot > x<-predict(bd.log) > y<-resid(bd.log) > binnedplot(x,y) Plots the average residual and the average fitted (predicted) value for each bin, or category Category is based on the fitted values 95% of all values should fall within the dotted line

Poisson Regression Using count data

What is a Poisson distribution?

Example: children ever born • The dataset has 70 rows representing group-level data on the number of children ever born to women in Fiji: • Number of children ever born • Number of women in the group • Duration of marriage • 1=0-4, 2=5-9, 3=10-14, 4=15-19, 5=20-24, 6=25-29 • Residence • 1=Suva (capital city), 2=Urban, 3=Rural • Education • 1=none, 2=lower primary, 3=upper primary, 4=secondary+

Poisson regression in R > ceb1<-glm(y ~ educ + res, offset=log(n), family = "poisson", data = ceb) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.43029 0.01795 79.691 <2e-16 *** educnone 0.21462 0.02183 9.831 <2e-16 *** educsec+ -1.00900 0.05217 -19.342 <2e-16 *** educupper -0.40485 0.02956 -13.696 <2e-16 *** resSuva -0.05997 0.02819 -2.127 0.0334 * resurban 0.06204 0.02442 2.540 0.0111 * --- Null deviance: 3731.5 on 69 degrees of freedom Residual deviance: 2646.5 on 64 degrees of freedom AIC: Inf Need to account for different population sizes in each area/group unless data are from same-size populations

Assessing model fit 1. Examine AIC score – smaller is better 2. Examine the deviance as an approximate goodness of fit test • Expect the residual deviance/degrees of freedom to be approximately 1 > ceb2$deviance/ceb2$df.residual [1] 41.35172 3. Compare residual deviance to a 2 distribution > pchisq(2646.5, 64, lower=F) [1] 0

Model fitting: analysis of deviance • Similar to logistic regression, we want to compare the differences in the size of residuals between models > ceb1<-glm(y~educ, family=“poisson", offset=log(n), data= ceb) > ceb2<-glm(y~educ+res, family=“poisson", offset=log(n), data= ceb) > 1-pchisq(deviance(ceb1)-deviance(ceb2), df.residual(ceb1)-df.residual(ceb2)) [1] 0.0007124383 • Since the p-value is small, there is evidence that the addition of res explains a significant amount (more) of the deviance

Overdispersion in Poission models • A characteristic of the Poisson distribution is that its mean is equal to its variance • Sometimes the observed variance is greater than the mean • Known as overdispersion • Another common problem with Poisson regression is excess zeros • Are more zeros than a Poisson regression would predict

Overdispersion • Use family=“quasipoisson” instead of “poisson” to estimate the dispersion parameter • Doesn't change the estimates for the coefficients, but may change their standard errors > ceb2<-glm(y~educ+res, family="quasipoisson", offset=log(n), data=ceb)

Poisson vs. quasipoisson Family = “poisson” Family = “quasipoisson” Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.43029 0.01795 79.691 <2e-16 *** educnone 0.21462 0.02183 9.831 <2e-16 *** educsec+ -1.00900 0.05217 -19.342 <2e-16 *** educupper -0.40485 0.02956 -13.696 <2e-16 *** resSuva -0.05997 0.02819 -2.127 0.0334 * resurban 0.06204 0.02442 2.540 0.0111 * --- (Dispersion parameter for poisson family taken to be 1) Null deviance: 3731.5 on 69 degrees of freedom Residual deviance: 2646.5 on 64 degrees of freedom Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.43029 0.10999 13.004 < 2e-16 *** educnone 0.21462 0.13378 1.604 0.11358 educsec+ -1.00900 0.31968 -3.156 0.00244 ** educupper -0.40485 0.18115 -2.235 0.02892 * resSuva -0.05997 0.17277 -0.347 0.72965 resurban 0.06204 0.14966 0.415 0.67988 --- (Dispersion parameter for quasipoisson taken to be 37.55359) Null deviance: 3731.5 on 69 degrees of freedom Residual deviance: 2646.5 on 64 degrees of freedom

Models for overdispersion • When overdispersion is a problem, use a negative binomial model • Will adjust  estimates and standard errors > install.packages(“MASS”) > library(MASS) > ceb.nb <- glm.nb(y~educ+res+offset(log(n)), data= ceb) OR > ceb.nb<-glm.nb(ceb2) > summary(ceb.nb)

NB model in R glm.nb(formula = ceb2, x = T, init.theta = 3.38722121141125, link = log) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.490043 0.160589 9.279 < 2e-16 *** educnone 0.002317 0.183754 0.013 0.98994 educsec+ -0.630343 0.200220 -3.148 0.00164 ** educupper -0.173138 0.184210 -0.940 0.34727 resSuva -0.149784 0.165622 -0.904 0.36580 resurban 0.055610 0.165391 0.336 0.73670 --- (Dispersion parameter for Negative Binomial(3.3872) family taken to be 1) Null deviance: 85.001 on 69 degrees of freedom Residual deviance: 71.955 on 64 degrees of freedom AIC: 740.55 Theta: 3.387 Std. Err.: 0.583 2 x log-likelihood: -726.555 > ceb.nb$deviance/ceb.nb$df.residual [1] 1.124297

What if your data looked like…

Zero-inflated Poisson model (ZIP) • If you have a large number of 0 counts… > install.packages(“pscl”) > library(pscl) > ceb.zip <- zeroinfl(y~educ+res, offset=log(n), data= ceb)

Logistic Regression