310 likes | 455 Views
Chocolate Cake Seminar Series on Statistical Applications. Today’s Talk: Binary Logistic and Probit Models at Work By Dr. Olga Korosteleva. Outline of Presentation. Binary Logistic Regression Model Probit Model for Binary Outcome. Binary Logistic Regression Model.
E N D
Chocolate Cake SeminarSeries on Statistical Applications Today’s Talk: Binary Logistic and Probit Models at Work By Dr. Olga Korosteleva
Outline of Presentation • Binary Logistic Regression Model • ProbitModel for Binary Outcome
Binary Logistic Regression Model Suppose a pair of variables is observed on some individuals, where is a continuous variable, whereas is a binary (or dichotomous) variable, that is, assumes only two values. Examples (Binary by nature). • relief v. no relief from a certain medical condition; • voted ‘yes’ v. ‘no’ on proposition XX; • HIV infection v. no infection; • won v. lost; • dead v. alive; (vi) accept offer v. decline offer.
Examples (Continuous by nature but dichotomized) (i) excess body weight loss <20% v. 20% or more; (ii) PTSD symptom score ranges 17 to 85 with a cutoff at 50: diagnosed with PTSD if score>=50 v. no PTSD if score <50; (iii) spends above $X on entertainment weekly v. spends less than $X; (iv) runs marathon under 3 hours v. runs longer than 3 hours.
Scatterplot If we plot (with values coded 0 and 1) against , the scatterplot may look something like this:
Linear regression model is invalid in this case: even if we fit a straight line to the data, the error terms won’t be normally distributed. So, what model should we use?
Binary Logistic Regression Model • Abinary (dichotomous) logistic regression is used to model . The model with predictors has the form • Define the odds in favor ofas the ratio We can rewrite the logistic regression above in terms on the odds,
Goodness of Model Fit There are three ways to check how well the model fits the data: • pseudo R-square (looks like R-square in the sense that they both range from 0 to 1 with higher values indicating better model fit, but pseudo R-square cannot be interpreted as the proportion of variation in that can be accounted for by the model). • max-rescaled R-square is defined as pseudo R-square divided by its maximum. • Hosmer-Lemeshowgoodness-of-fit test with the null hypothesis that the model has a good fit. P-value in excess of 0.05 is desirable.
Interpretation of Regression Coefficients • When is continuous, then the quantity represents the estimated percent change in odds in favor of when is increased by one unit, and the other variables are held fixed. Indeed,
Interpretation of Regression Coefficients • If is a categorical variable with two levels (its own dummy variable), then the quantity represents the estimated percent ratio in odds for the upper level of (when ) and that for the lower level (when ), provided the other variables are held fixed. To see that, write
Interpretation of Regression Coefficients • If is a categorical variable with levels, then dummy variables are included into the model that correspond to with the th level being the reference level. The quantity represents the estimated percent ratio in odds for the level and that for the reference level provided the other variables are held fixed. This follows from the fact that
Example Dermatologists at a large hospital study patients with acute psoriasis, a skin disease. They randomly assign patients to three groups: taking drug A, drug B, or placebo. There are 45 patients in the study, 15 per group. The outcome is whether the patient felt a relief from psoriasis symptoms (1=relief, 0=no relief). Data are collected on gender, age, and group. The following SAS code fits the logistic regression model to the data.
SAS Applications: Code data psoriasis; input gender$ age drug$ relief$ @@; datalines; M 25 A Yes M 25 A Yes M 41 A Yes M 42 A Yes M 43 A Yes M 51 A Yes M 59 A Yes M 59 A Yes F 29 A Yes F 35 A Yes F 42 A Yes F 56 A Yes F 65 A Yes F 40 A No F 61 A No M 29 B Yes M 33 B Yes M 39 B Yes M 42 B Yes M 46 B Yes M 42 B No M 48 B No M 62 B No F 36 B Yes F 47 B Yes F 28 B No F 38 B No F 39 B No F 50 B No F 60 B No M 42 P Yes M 46 P Yes M 24 P No M 25 P No M 60 P No M 67 P No F 28 P Yes F 32 P Yes F 35 P Yes F 42 P No F 48 P No F 53 P No F 57 P No F 58 P No F 65 P No ; proclogistic data=psoriasis; class gender (ref='F') drug(ref='P')/param=ref; model relief(event='Yes')=gender age drug/rsqlackfit; run;
The Important Features of the SAS Code • Options (ref='F')and (ref='P')define reference categories for gender and drug. • Option param=refcreates proper dummy variables for gender and drug. • Option rsq computes the pseudo R-square and max-rescaled R-square. • Option lackfit performs the Hosmer-Lemeshow goodness-of-fit test.
Results • Age and drug A are significant predictors of relief from psoriasis (age at the 5%, drug A at the 1%). • This model has a good fit because the P-value of the Hosmer-Lemeshow test is 0.6390 > 0.05. Also the pseudo R-squared (0.3304) and max-rescaled R-squared (0.4424) are not very small. • The fitted model is
Interpretation of Beta Coefficients • The odds in favor of psoriasis relief for males is 3.028 times that for females (302.8%). • As age increases by one year, the odds in favor of psoriasis relief decrease by 7%=(0.93-1)100%. • The odds in favor of psoriasis relief for drug A patients is 19.744 times that of placebo patients (or 1,974.4%). • The odds in favor of psoriasis relief for drug B patients is 1.411 times that of placebo patients (or 141.1%).
SPSS Applications: Syntax Hyperlink to SPSS Data File Hyperlink to SPSS Syntax File LOGISTIC REGRESSION VARIABLES relief /METHOD=ENTER gender age drugA drugB /PRINT=GOODFIT CI(95) /CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).
Probit Model for Binary Outcome Note that the binary logistic model may be written in the form The function is called a logit function. It is a link function between what is being predicted and the linear regression term:
Probit Model for Binary Outcome The probit model has the probit link function where is the cumulative distribution function of a standard normal random variable. The probit model with predictors has the form
Interpretation of Regression Coefficients • If is continuous, then represents the estimated change in z-score of the probability of for a unit increase in , provided all other variables are held fixed. Indeed,
Interpretation of Regression Coefficients • If is categorical with levels, then represents the estimated difference in z-score of the probability of for the level and the reference level , provided all other variables are held fixed. To see that, write
SAS Application: Example To fit the probit model to the psoriasis data, use the following code: proc logistic data=psoriasis; class gender (ref='F') drug(ref='P')/param=ref; model relief(event='Yes')=gender age drug/ link=probitrsqlackfit; run;
Results Similar to the results for the logistic regression, • Age and drug A are significant predictors of relief from psoriasis. • This model has a good fit. The Hosmer-Lemeshow test has P-value= 0.3652 > 0.05. • The pseudo R-squared=0.3343 (v. 0.3304 for logistic), and max-rescaled R-squared=0.4475 (v. 0.4434 for logistic). • The fitted model is
Interpretation of Beta Coefficients • The estimated z-score for the probability of psoriasis relief for males is 0.6862 points larger than that for females. • For a one-year increase in age, the z-score decreases by 0.0428 points. • The z-score for drug A is 1.8175 points larger than that for placebo. • The z-score for drug B is 0.2642 points larger than that for placebo.
SPSS Applications: Syntax Hyperlink to SPSS Data File Hyperlink to SPSS Syntax File GENLIN relief (REFERENCE=0) BY gender drugAdrugB (ORDER=DESCENDING) WITH age /MODEL gender age drugAdrugB INTERCEPT=YES DISTRIBUTION=BINOMIAL LINK=PROBIT /PRINT SOLUTION.