190 likes | 420 Views
April 6. Logistic Regression Estimating probability based on logistic model Testing differences among multiple groups Assumptions for model. Logistic regression equation. Model log odds of outcome as a linear function of one or more variables X i = predictors, independent variables
E N D
April 6 • Logistic Regression • Estimating probability based on logistic model • Testing differences among multiple groups • Assumptions for model
Logistic regression equation Model log odds of outcome as a linear function of one or more variables Xi = predictors, independent variables b is increase in log odds of 1-unit increase in X ebis relative odds of a 1-unit increase in X The model is:
= exp ( b0 + b1x1 + b2x2) 1 + exp ( b0 + b1x1 + b2x2) ODDS 1 + ODDS Logistic Regression PredictionEstimating Probability of Y=1 Goal: Estimate p for a set of X values Solve forp The model is: =
Steps in Estimating p • Pick values for x1, x2, …, xp • Compute log odds for your values of Xs using results • LO = b0 + b1x1 + b2x2 + … bpxp • EXP LO to get odds • Odds = EXP (LO) • Compute estimate of p • p = ODDS/(ODDS + 1)
The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -6.0621 1.2884 22.1395 <.0001 AGE 1 0.0605 0.0223 7.3310 0.0068 women 1 -0.3967 0.3166 1.5701 0.2102 log(odds) = - 6.0621 + 0.0605*age –0.3967*women What is estimated probability of CVD for a man 60 years old? Log(odds) = -6.0621 + 0.0605(60) –0.3967(0) = -2.4321 Odds = exp(-2.4321) = 0.0878 Prob = 0.0878 / (1 + 0.0878) = 0.0808 How old does a women have to be to have the same risk? 1-Year of age increases log(odds) by 0.0605 Being female decreases log(odds) by –0.3967 Compute 0.3967/.0605 = 6.6 or women would have to 66.6 years to have P = .0808
Getting Odds Ratio for Differences Other Than 1 PROCLOGISTICDATA=temp DESCENDING; MODEL clinical = age women/CLODDS=WALD; UNITS age = 5 women = 1; RUN; SAS OUTPUT Wald Confidence Interval for Adjusted Odds Ratios Effect Unit Estimate 95% Confidence Limits AGE 5.0000 1.353 1.087 1.685 women 1.0000 0.673 0.362 1.251 EXP (5*0.0605)
Testing Differences Among Multiple Groups Using Logistic Regression • Ho: p1 = p2 = p3 • Ha: pi not all equal • Can test using logistic regression since if p’s are equal then log odds are equal • Can code in SAS two ways • Create dummy (design) variables to represent the groups • Use a CLASS statement under PROC LOGISTIC
TOMHS Example: Is CVD Rate EqualIn Four Clinical Centers? • Ho: p1 = p2 = p3= p4 • SAS CODE in datastep (create own design variables): DATA temp; SET tomhs.bpstudy; clinicA = 0; clinicB = 0; clinicC = 0; clinicD = 0; if clinic = 'A'then clinicA = 1; else if clinic = 'B'then clinicB = 1; else if clinic = 'C'then clinicC = 1; else if clinic = 'D'then clinicD = 1; RUN;
Do Simple Analyses First PROCMEANSNMEANSUMMINMAXDATA=temp; CLASS clinic; VAR clinical; RUN; Analysis Variable : CLINICAL Indicator - Clinical Endpoint N CLINIC Obs N Mean Sum Minimum Maximum ------------------------------------------------------------------------------ A 195 195 0.0974359 19.0000000 0 1.0000000 B 251 251 0.0517928 13.0000000 0 1.0000000 C 296 296 0.0472973 14.0000000 0 1.0000000 D 160 160 0.0312500 5.0000000 0 1.0000000 The relative odds (A/D) should be about 3. All betas should be > 0
PROC LOGISTIC CODE * Using class statement; PROCLOGISTICDATA=TEMP DESCENDINGSIMPLE; CLASS clinic/PARAM=REF; MODEL clinical = clinic ; RUN; * Using user defined design variables; PROCLOGISTICDATA=TEMP DESCENDINGSIMPLE; MODEL clinical = clinica clinicb clinicc; RUN; Uses 0/1 coding Last group as reference Gives summary statistics
SAS OUTPUT USING CLASS STATEMENT Response Profile Ordered Total Value CLINICAL Frequency 1 1 51 2 0 851 Probability modeled is CLINICAL=1. Class Level Information Design Variables Class Value 1 2 3 CLINIC A 1 0 0 B 0 1 0 C 0 0 1 D 0 0 0 Same coding as in datastep Clinic D reference
SAS OUTPUT USING CLASS STATEMENT Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 7.9632 3 0.0468 Score 8.6122 3 0.0349 Wald 8.1300 3 0.0434 Type III Analysis of Effects Wald Effect DF Chi-Square Pr > ChiSq CLINIC 3 8.1300 0.0434 These are equal because no other variables are in model
SAS OUTPUT USING CLASS STATEMENT Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -3.4339 0.4544 57.1196 <.0001 CLINIC A 1 1.2080 0.5145 5.5114 0.0189 CLINIC B 1 0.5266 0.5363 0.9644 0.3261 CLINIC C 1 0.4311 0.5305 0.6604 0.4164 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits CLINIC A vs D 3.347 1.221 9.175 CLINIC B vs D 1.693 0.592 4.844 CLINIC C vs D 1.539 0.544 4.353
SAS OUTPUT USING MODEL clinical = clinicA clinicB clinicC Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -3.4339 0.4544 57.1196 <.0001 clinicA 1 1.2080 0.5145 5.5114 0.0189 clinicB 1 0.5266 0.5363 0.9644 0.3261 clinicC 1 0.4311 0.5305 0.6604 0.4164 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits clinicA 3.347 1.221 9.175 clinicB 1.693 0.592 4.844 clinicC 1.539 0.544 4.353
Maybe clinic rates of CVD differ because age varies among centers SAS OUTPUT USING MODEL clinical = clinic age Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 16.5582 4 0.0024 Score 17.2001 4 0.0018 Wald 16.2760 4 0.0027 Type III Analysis of Effects Wald Effect DF Chi-Square Pr > ChiSq CLINIC 3 8.9604 0.0298 AGE 1 8.4904 0.0036 Test if age and clinic are related to CVD
SAS OUTPUT USING MODEL clinical = clinic age Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -7.2250 1.4096 26.2725 <.0001 CLINIC A 1 1.3211 0.5189 6.4816 0.0109 CLINIC B 1 0.6448 0.5400 1.4256 0.2325 CLINIC C 1 0.5163 0.5335 0.9366 0.3332 AGE 1 0.0662 0.0227 8.4904 0.0036 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits CLINIC A vs D 3.747 1.355 10.361 CLINIC B vs D 1.906 0.661 5.492 CLINIC C vs D 1.676 0.589 4.768 AGE 1.068 1.022 1.117
Y normally distributed my linearly related to X s2constant over X Each observation independent of other observations Large N not needed for tests if Y is normally distributed Y binary Log odds linearly related to X N/A Each observation independent of other observations Large enough N to justify using c2 Assumptions: Linear Versus Logistic Regression
Illustration of Linearity in Log Odds Assumption Log odds = -6.2428 + 0.0613* Age AGE ODDS 50 0.039 60 0.072 70 0.134 RO = 1.85 = .072/.039 RO = 1.85 = .134/.072 Increased relative odds from going from 50 to 60 year is same as going from 60 to 70 years Note: Absolute risk is not linear with age
Fitted regression line Curve based on: bo effects location b1 effects curvature