310 likes | 451 Views
Logistic Regression. Linear Regression Review. FEV 1 = b 0 + b 1 Age + b 2 Height. E(FEV 1 ) = m = b 0 + b 1 Age + b 2 Height Where data are assumed to be normally distributed with mean equal to m. Models such as these are appropriate for c ontinuous
E N D
Linear Regression Review FEV1= b0 + b1Age + b2Height E(FEV1 ) =m = b0 + b1Age + b2Height Where data are assumed to be normally distributed with mean equal to m Models such as these are appropriate for continuous outcome measures such as FEV1, weight, blood pressure What if our outcome is Binary?
Common binary outcome measures • Healthy vs unhealthy • E.g., heart disease (y/n), Cancer (y/n), COPD (y/n) • Progressive disease vs stable disease • Based on, e.g., cancer stage • Alive vs dead
Convenient coding of binary outcomes • COPD = 0 if FEV1/FVC > 0.70 = 1 if FEV1/FVC < 0.70 • Large = 0 if tumor size is “small” = 1 if tumor size is “large” • Dead = 0 if alive = 1 if deceased Note use of 0/1 coding and descriptive names that define “1”
The Logistic Regression Model • Consider the case of a binary indicator of vital status • Dead = 0 if alive = 1 if deceased • If Dead is coded 0/1, then its expected value is equal to the probability that Dead=1. i.e., E(Dead) = P = Probability of death
The Logistic Regression Model Suppose we want to model the association between vital status and age… • If we fit the data using standard linear regression, our model would be of the form P = β0 + β1Age • That is, we assume the probability of death varies in a linear manner with age.
The Logistic Regression Model When age = 60, estimated value of dead = .6 Is this a sensible result? What if predicted value is >1 or <0?
The Logistic Regression Model • Logistic regression analysis is tool for modeling binary data that overcomes some of the limitations of linear regression. • Rather than assuming the data are normally distributed, which we know isn’t true, we firstassume the data follow a binomial distribution, which implicitly assumes we have a series of 0/1 observations each with probability P of being dead, i.e., Dead = 1.
The Logistic Regression Model Rather than assuming P is a linear combination of variables of interest, e.g., P = β0 + β1Age + β2Male we instead assume or equivalently, ln[P/(1-P)] = β0 + β1Age + β2Male
The Logistic Regression Model ln[P/(1-P)] = β0 + β1Age + β2Height • The function ln[P/(1-P)] is referred to as the “logit” of P, hence the term “logistic” regression! • Unlike the linear regression model, the logit function has the desirable property that it is always between 0 and 1. • It also turns out to have some statistical properties that makes it a particularly desirable function of P to estimate.
The Logistic Regression ModelInterpretation of coefficients ln[P/(1-P)] = β0 + β1Age + β2Male • Recall that P/(1-P) is the odds of our outcome of interest, in this case death. • Hence the logit of P is the same as the ln(odds) of death, and so the odds of death can be written
The Logistic Regression ModelInterpretation of coefficients OR (male vs female) = = =>b2 = ln(OR males vs females )
The Logistic Regression ModelInterpretation of coefficients Similarly we can calculate the OR associated with an increase in age of 10 years as OR (10 yr incr in age) = = => 10b2 = ln(OR 10 year increase in age )
The Logistic Regression ModelInterpretation of coefficients Odds ratio for smokers to never-smokers = = 1.67 STATA logistic regression output for: logit dead smk Coef. Std. Err. z P>z [95% Conf. Interval] smk .5118667 .28225 1.81 0.070 -.0413331 1.065067 _cons -2.335375 .2468438 -9.46 0.000 -2.81918 -1.85157 OR = e.512 = 1.67
The Logistic Regression ModelHypothesis testing and confidence intervals • Testing H0: ln(OR) = b1 = 0 vs. Ha: b1 = 0 is equivalent to testing H0: OR = eb1= 1 vs. Ha: eb1 = 1 • Use large sample normality of b1 to compute p-values and to construct confidence limits • b1 / SE(b1) should look like a z-score under H0 … use to compute p-value • b1 ± 1.96 * SE(b1) is an approximate 95% confidence interval STATA logistic regression output for: logit dead smk Coef. Std. Err. z P>z [95% Conf. Interval] smk .5118667 .28225 1.81 0.070 -.0413331 1.065067 _cons -2.335375 .2468438 -9.46 0.000 -2.81918 -1.85157
The Logistic Regression Modelcomputing CIs for the odds ratio Because b1 is more normally distributed than eb1,we construct CIs for the ln(OR) and then exponentiate these to get corresponding CIs for the OR. 95% CI for ln(OR) = (-0.04, 1.07) 95% CI for OR = e(-0.04, 1.07) = (0.96, 2.90)
Variations in software output STATA logistic regression output: logit dead smk Coef. Std. Err. z P>z [95% Conf. Interval] smk .5118667 .28225 1.81 0.070 -.0413331 1.065067 _cons -2.335375 .2468438 -9.46 0.000 -2.81918 -1.85157 Default output in the log scale STATA logistic regression output: logit dead smk, or Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] smk 1.668403 .4709067 1.81 0.070 .9595094 2.901032 Output requested in the transformed scale
The Logistic Regression ModelAdjusting for potential confounder variables Suppose we conduct a cross-sectional study to investigate the association between gender and COPD. If P is the probability of having COPD and Male is a 0/1 indicator of male sex, then we might fit the logistic model ln[P/(1-P)] = β0 + β1Male to assess the OR for COPD associated with male sex. Might this association be confounded by smoking status, and if so how might we adjust for the potentially confounding effects of smoking?
The Logistic Regression ModelAdjusting for potential confounder variables If ES is a 0/1 indicator of ever having smoked, we might fit the model ln[P/(1-P)] = β0 + β1Male + β2ES Under this model, we say the effect of male sex is now adjusted for the potentially confounding effect of having ever smoked. The resulting odds ratio is analogous to the pooled OR that you would get from a stratified 2x2 table analysis that crosses Male by COPD for each level of ES. We could adjust for additional potential confounders, including continuous variables, by adding them to the model as main effects.
The Logistic Regression ModelAdjusting for potential effect modification Now suppose we want to know whether smoking modifies the effect of male sex on COPD prevalence. In classical epidemiology this means we want to know if the OR associated with male sex varies by smoking status. How would we test for the presence of effect modification in our logistic model? As we learned previously, we use interaction terms!
The Logistic Regression ModelAdjusting for potential effect modification ln[(P/(1-P)] = b0 + b1Male + b2ES + b3Male*ES b1 = ln(OR) for male sex in never smokers b2 = ln(OR) forever smoking in women b3 = difference in ln(OR) for ever smoking between men & women = difference in ln(OR) for male sex between ever & never smokers Testing H0: β3 = 0 is a test of whether there is effect modification.
Some Examples STATA output from: logit dead smk, or Odds Ratio Std. Err. Z P>z [95% Conf. Interval] Smk 1.668403 .4709067 1.81 0.070 .9595094 2.901032 STATA output from: logit dead age smk, or Odds Ratio Std. Err Z P>z [95% Conf. Interval] Smk 4.219142 1.397128 4.35 0.000 2.204738 8.074048 Age1 1.108467 .0147424 7.74 0.000 1.079946 1.137742 • The second model gives the OR for death associated with smoking after adjusting for age • Note the change in the size of the smoking OR between the two models – what might explain this change?
Some Examples In the PAD Trial, (non-medical) lay volunteers were trained to respond to cardiac arrests in public and to perform CPR. Volunteers received retraining at various intervals to see how long it took before their CPR skills degraded to the point that they were unlikely to perform adequate CPR. We want to know 1) whether the amount of time between trainings is related to CPR quality, and 2) whether the relationship of CPR quality and time between trainings differs across age groups.
Some Examples • We define the following variables: • Response variable: cprok = 0 if CPR performed during testing was inadequate = 1 if CPR performed during testing was adequate • Predictor variable: agegt50 = 0 if age is < 50 = 1 if age is > 50 • Predictor variable: late = 0 if volunteer was tested/retrained ≤ 7months after initial training = 1 if volunteer was tested/retrained > 7months after initial training
Some Examples STATA output from: logit cprok late, or robust Robust Odds Ratio Std. Err Z P>z [95% Conf. Interval] late 0.932 0.094 -0.69 0.490 0.7646 1.1372 agegt50 0.4129 0.0486 -7.51 0.000 0.3278 0.5201 STATA output from: logit cprok late if agegt50==1, or robust Robust Odds Ratio Std. Err Z P>z [95% Conf. Interval] late 1.3303 0.2252 1.69 0.092 0.9546 1.8538 STATA output from: logit cprok late if agegt50==0, or robust Robust Odds Ratio Std. Err Z P>z [95% Conf. Interval] late 0.74 0.0976 -2.25 0.025 0.5764 0.9631
Some Examples STATA output from: logit cprok late agegt50 agelate, or robust (agelate = late*agegt50) Robust Odds Ratio Std. Err Z P>z [95% Conf. Interval] Late 0.7709 1.090 -1. 0.066 0.5843 1.0170 agegt50 0.4129 0.0486 -7.51 0.000 0.3278 0.5201 agelate 1.6661 0.4134 2.06 0.040 1.0243 2.7099 We reject the null hypothesis of no interaction and conclude that the impact of time between retraining on CPR performance varies significantly for those over and under the age of 50.