1 / 72

Introduction to Biostatistics (ZJU 2008)

Introduction to Biostatistics (ZJU 2008). Wenjiang Fu, Ph.D Associate Professor Division of Biostatistics, Department of Epidemiology Michigan State University East Lansing, Michigan 48824, USA Email: fuw@msu.edu www: http://www.msu.edu/~fuw. Logistic regression model.

enrico
Download Presentation

Introduction to Biostatistics (ZJU 2008)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Biostatistics (ZJU 2008) Wenjiang Fu, Ph.D Associate Professor Division of Biostatistics, Department of Epidemiology Michigan State University East Lansing, Michigan 48824, USA Email: fuw@msu.edu www: http://www.msu.edu/~fuw

  2. Logistic regression model • Why use logistic regression? • Estimation by maximum likelihood • Coefficients Interpretation • Hypothesis testing • Evaluating the performance of the model

  3. Why logistic regression? • Many important research topics in which the dependent variable is Binary. • eg. disease vs no disease, damage vs no damage, death vs live, etc. • Binary logistic regression is a type of regression analysis where the dependent variable is a dummy variable: coded 0 (absence of disease) or 1 (presence of a disease), etc. • To explain the variability of the binary variable by other variables, either continuous or categorical, such as age, sex, BMI, marriage status, socio-economic status, etc. use a statistical model to relate the probability of the response event to the explanatory variables.

  4. Logistic Regression Model • Event (Y = 1), no event (Y = 0) want to model the mean: E(Y) = P(Y=1) * 1 + P(Y=0) *0 = P(Y=1) but π = P(Y=1) is between 0 and 1 and is bounded the linear predictor β0 +x1β1 + x2β2+ x3β3 +… is a linear combination and may take any value. The "logit" model solves this problem. Single independent variable: ln [p/(1-p)] =  + X where the probability p = p(Y=1 | X) • p/(1-p) is the "odds" - ratio of the probabilities an event to occur versus not to occur under condition A. • ln[p/(1-p)] is the “log odds” or "logit probability"

  5. More: • The logistic distribution constrains the estimated probabilities to lie between 0 and 1. • The estimated probability is: p=1/[1+exp(- - X)] = exp( +X) / [1+exp( +X)] • if you let  + X =0, then p = .50 • as  + X gets really big, p approaches 1 • as  + X gets really small, p approaches 0

  6. Logit Model Comparing LinReg and Logit Models Y LinReg Model Y=1 X Y=0

  7. Maximum Likelihood Estimation (MLE) • MLE is a statistical method for estimating the coefficients of a model. • The likelihood function (L) measures the probability of observing the particular set of dependent variable values (Y1=y1, … , Yn=yn) L = Prob (y1,y2,…,yn) The higher the L, the higher the probability of observing the sample data at hand.

  8. Maximum Likelihood Estimator • MLE involves finding the coefficients (, ) that makes the log of the likelihood function (logLik < 0) as large as possible • The maximum likelihood estimator maximizes following likelihood function Where p = exp( + X) / [ 1+ exp( + X) ]

  9. Maximum Likelihood Estimator • Or equivalently, the MLE maximizes the log-likelihood function Where p = exp( + X) / [ 1+ exp( + X) ] • MLE is biased estimator, but consistent (by large sample theory, the estimator converges to the true model parameters fast enough as sample size increases). Link Function

  10. Interpretation of Coefficients • Since ln [ p/(1-p) ] =  + X The slope  is interpreted as the rate of change in the "log odds" as X changes … not very useful. • More useful: p = Pr (Y=1|X). p / (1-p) is the odds with condition X. • Example. X: smoker; Y=1 Lung Ca. p/(1-p) is odds of Lung Ca w.r.t smoke.

  11. Interpretation of Coefficients • If X is continuous, exp() measures the change of odds with one unit increase of X. • ln (OR) = ln [odds(X+1)] - ln [odds(X)] = ln [ p(Y=1|X+1 ) / (1-p(Y=1|X+1)) ] – ln [ p(Y=1|X) / (1-p(Y=1|X)) ] =  +  (X+1) – ( + X) =  • ln(OR)=   OR=exp()

  12. Interpretation of Coefficients • If X is binary, exp() measures the change of odds for one group compared to the second ln [p(Y=1|X=1 ) /(1-p(Y=1|X=1)) ] – ln [p(Y=1|X=0) / (1-p(Y=1|X=0)) ] =  + .1 –( + .0) =  • ln(OR) =   OR = exp()

  13. Interpretation of parameter β Model : logit (p) = β0 +x1β1

  14. Model Assessment There are several statistics which can be used for comparing alternative models or evaluating the performance of a single model: • LRT, Wald tests • Percent Correct Predictions

  15. Model Chi-Squares Statistic • The model likelihood ratio test (LRT) statistic is LR = [-2 lik (Reduced model)] - [-2 lik (Full model)] Example: test of , LR = -2 [lik () -lik (, ) ] lik() is likelihood of model with only the intercept lik (, ) is a model with the intercept and X • The LR statistic follows a chi-squares distribution with r degrees of freedom, where r=difference in numbers of parameters between the two models • Use the LRT statistic to determine if the overall model is statistically significant.

  16. Percent Correct Predictions • Predicted outcome (majority vote method) if predicted prob Pr(x) ≥ 0.5, assign y = 1 otherwise assign y = 0 • Compare the predicted outcome y and the actual outcome y and compute the percentage of correct outcomes. ^ ^ ^

  17. An Example:

  18. Testing significance of variables • Omitted variable(s) can result in bias in the coefficient estimates. To test for omitted variables you can conduct a likelihood ratio test:LR[q] = {[-2 lik (constrained model, i=k-q)] - [-2 lik (unconstrained model, i=k)]} where LR follows chi-squares distribution with q degrees of freedom, with q = 1 or more omitted variables

  19. An Example: B/se(B)

  20. Constructing the LR Test “Since the chi-squared value is less than the critical value the set of coefficients is not statistically significant. The full model is not an improvement over the partial model.”

  21. Multiple Logistic regression • Prob of event labeled as binary outcome • Logistic regression model: logit function. log [π / (1- π)] = β0 +x1β1 +… + xpβp = η equivalent to p = exp(η) / [ 1+ exp(η) ] Why need multiple logistic regression? Simple RxC table cannot solve all problems, and can be misleading.

  22. Multiple Logistic Regression-Formulation The relationship between π and x is S-shaped The logit (log-odds) transformation (link function)

  23. Multiple Logistic RegressionAssessrisk factors • Individually Ho: βk = 0 • Globally Ho: βm =… βm+t= 0 while controlling for confounders and other important determinants of the event

  24. Interpretation of the parameters • If π is the probability of an event and O is the odds for that event then • The link function in logistic regression gives the log-odds

  25. Example: Snoring & Heart Disease • An epidemiologic study surveyed 2484 subjects to examine whether snoring was a possible risk factor for heart disease.

  26. Constructing Indicator variables • Let Z1=1 if occasional, 0 otherwise • Let Z2=1 if nearly every night, 0 otherwise • Let Z3=1 if every night, 0 otherwise

  27. SAS Codes no never 1355 no occa 603 no nearly 192 no every 224 ; run; proc logistic data=hd descending; model hd (event=‘yes’) =Z1 Z2 Z3; freq count; run; data hd; input hd $ snoring $ count; Z1=(snoring="occa"); Z2=(snoring="nearly"); Z3=(snoring="every"); cards; yes never 24 yes occa 35 yes nearly 21 yes every 30

  28. SAS OUTPUT Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 902.827 842.923 SC 908.645 866.193 -2 Log L 900.827 834.923 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr LikRatio 65.9045 3 <.0001 Score 72.7821 3 <.0001 Wald 58.9513 3 <.0001 Ordered Total Value hd Frequency 1 yes 110 2 no 2374 Probability modeled is hd='yes'. Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied.

  29. The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald Param DF Estimate Error Chi-Square P Inter 1 -4.0335 0.2059 383.6641 <.0001 Z1 1 1.1869 0.2695 19.3959 <.0001 Z2 1 1.8205 0.3086 34.8027 <.0001 Z3 1 2.0231 0.2832 51.0313 <.0001 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits Z1 3.277 1.932 5.558 Z2 6.175 3.373 11.306 Z3 7.561 4.341 13.172

  30. Calculating Probabilities • The fitted logistic regression function is Logit(π)= -4.0335 + 1.1869 Z1+ 1.8205 Z2+ 2.0231 Z3 • So, the probability of heart disease if never snore is exp(-4.0335) / (1+exp(-4.0335))=.0174 • If snore occasionally, exp(-4.0335+1.1869) / (1+exp(-4.0335 +1.1869)) =.0549

  31. Calculating Odds Ratios • If Z1=Z2=Z3=0, then odds are exp(-4.0335) • If Z2=Z3=0, but Z1=1, then odds are exp(-4.0335+1.1869) • The ratio of odds is then exp(1.1869) = 3.2769 • Interpretation: Compared with people who never snore, people who snore occasionally are 3.28 times as likely to develop heart disease. • What is the odds ratio for comparing those who snore nearly every night with occasional snorers? • What is the odds ratio for comparing those who snore every night with those who snore nearly every night?

  32. Example – Genetic Association study • Idiopathic Pulmonary Fibrosis (IPF) is known to be associated with age and gender (older and male are more likely) • One study had 174 cases and 225 controls found association of IPF with one gene genotype COX2.8473 (C  T). • P-value by Pearson Chi-squares test: p = 0.0241. • Q: Is this association true? Genotype CC CT TT total Case 88 72 14 174 Control 84 113 28 225 Total 172 185 42 399

  33. Example on genetic study • Logistic regression model logit [Pr(IPF)] = intercept + snp + sex + age • Results: Wald Effect DF Chi-square P-value SNP 2 2.7811 0.2489 sex 1 9.1172 0.0025 age 1 100.454 <.0001

  34. Example on genetic study • Investigate why it happens by age and sex • Disease (N-normal; D-disease) by age 0-29 30-49 50-64 65-74 75+ N 104 77 35 7 2 D 0 10 42 68 54 T 104 87 77 75 56

  35. Example on genetic study • Investigate why it happens by age and sex • Disease (N-normal; D-disease) by sex male female N 72 153 D 108 66 T 180 219

  36. Example on genetic study • Investigate why it happens by age and sex SNP genotype by sex male female CC 79 93 CT 75 110 TT 26 16 Total 180 219 Pearson Chi-squares test X2 = 6.3911, p = 0.0409

  37. Example on genetic study • Investigate why it happens by age and sex • Age class by genotype CC CT TT 29 36 58 10 30-49 34 42 11 50-64 35 32 10 65-74 37 30 8 75+ 30 23 3 T 172 185 42 Pearson Chi-squares test X2 = 10.01, p = 0.2643

  38. Loglinear regression model • Why use loglinear regression? Often, observations are counts (>0, and may be = 0), with potential number of people exposed to risk (total population at risk). • Estimation by maximum likelihood estimator or loglik function Link function

  39. R program for Logistic and Loglinear models • Both logistic and loglinear models are special generalized linear models (GLM) • R uses a function ‘glm’ for fitting GLM models with different link function: • Logistic regression: binomial family • Loglinear regression: Poisson family • Logistic: fit <- glm(as.factor(y)~x, family = binomial (link = logit)) • Loglinear: fit <- glm(y~x, family = poisson (link = log))

  40. R output for Logistic and Loglinear models • > fit • Call: glm(formula = y ~ x, family = binomial(link = logit)) • Coefficients: • (Intercept) x1 x2 x3 • -0.725709 0.014827 0.015248 0.007396 • x4 • 0.041097 • Degrees of Freedom: 19 Total (i.e. Null); 15 Residual • Null Deviance: 24.43 • Residual Deviance: 21.22 AIC: 31.22

  41. R output for Logistic • > summary(fit) • Call: • glm(formula = y ~ x, family = binomial(link = logit)) • Deviance Residuals: • Min 1Q Median 3Q Max • -1.3708 -0.9643 -0.3896 1.2518 1.6197 • Coefficients: • Estimate Std. Error z value Pr(>|z|) • (Intercept) -0.725709 1.988678 -0.365 0.715 • x1 0.014827 0.021880 0.678 0.498 • x2 0.015248 0.022781 0.669 0.503 • x3 0.007396 0.018840 0.393 0.695 • x4 -0.041097 0.030172 -1.362 0.173 • (Dispersion parameter for binomial family taken to be 1) • Null deviance: 24.435 on 19 degrees of freedom • Residual deviance: 21.223 on 15 degrees of freedom

  42. Disease Rate (Per 105) and Frequency Year Age Birth Cohort

  43. Statistical Models Loglinear model log(ERij) = log (nij) +  + i + j + ij  intercept ; ERij expected freq. in cell (i, j), nij population-yrs of expos – offset term. 1, …, a row (age) effects, ai=1i = 0 1, …, p column (period) effects,  pj=1j= 0 ij interaction effects,  ij= 0

  44. Statistical Models   APC loglinear model log(ERij) = log (nij) +  + i + j + k  intercept ; ERij expected freq. in cell (i, j), nij popul-yrs of expos – offset term. 1, …, a row (age) effects, i=1ai = 0 1, …, p column (period) effects, j=1pj= 0 1, …, a+p-1 diagonal (cohort) effects,  k=1a+p-1k = 0

  45. Problem • Linear dependency among covariates: Period – Age = Cohort • Rows, columns and diagonals on a Lexis diagram • Multiple estimators !

  46. Disease Rate (Per 105) and Frequency Year Age

  47. Matrix form Models in matrix form log(ER) = log (n) + X b b = (, 1,…a-1, 1,…, p-1, 1, …,a+p-2)T X singular design matrix : 1–less than full rank

  48. Matrix form Regression design Matrix (X) for APC model with 3x3 table • [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] • [1,] 1 0 1 0 0 0 1 0 • [2,] 1 0 0 1 0 0 0 1 • [3,] 1 0 -1 -1 -1 -1 -1 -1 • [4,] 0 1 1 0 0 1 0 0 • [5,] 0 1 0 1 0 0 1 0 • [6,] 0 1 -1 -1 0 0 0 1 • [7,] -1 -1 1 0 1 0 0 0 • [8,] -1 -1 0 1 0 1 0 0 • [9,] -1 -1 -1 -1 0 0 1 0 • Eigen values of (XTX): • 12.16 9.23 4.98 3.35 3.00 2.15 1.12 0.00

  49. R-program to generate APC Matrix apcmat1 <- function (a = 3, p = 3) { ## construct APC matrix for APC analysis x <- matrix(0, a*p, 2*(a+p-2)) gammac <- rbind( diag(1, a+p-2), -1) for (i in 1:(a-1)) { x[(i-1)*p+(1:p), i] <- 1 # for alpha x[(i-1)*p+(1:(p-1)), a:(a+p-2)] <- diag(1,p-1) # for beta x[i*p, a:(a+p-2)] <- -1 for (j in 1:p) x[(i-1)*p+j, -1:-(a+p-2) ] <- gammac [a-i+j, ] } for (j in 1:p) x[(a-1)*p+j, -1:-(a+p-2)] <- gammac[j, ] # last block (row in apc table) for gamma gammac <- NULL x[(a-1)*p+(1:p), 1:(a-1)] <- -1 # last block (row in apc table) for alpha x[(a-1)*p+(1:(p-1)), a:(a+p-2)] <- diag(1, p-1) # last block (row in apc table) for beta x[a*p, a:(a+p-2)] <- -1 x }

More Related