Logistic Regression

Logistic Regression Part I - Introduction

Logistic Regression • Regression where the response variable is dichotomous (not continuous) • Examples • effect of concentration of drug on whether symptoms go away • effect of age on whether or not a patient survived treatment • effect of negative cognitions about SELF, WORLD, or Self-BLAME on whether a participant has PTSD

Simple Linear Regression • Relationship between continuous response variable and continuous explanatory variable • Example • Effect of concentration of drug on reaction time • Effect of age of patient on number of years of post-operation survival

Simple Linear Regression • RT(ms) = β0 + β1x concentration (mg) • β0is value of RT when concentration is 0 • β1is change in RT caused by a change in concentration of 1mg. • E.g. RT = 400 + 50 x concentration

Logistic Regression • What do we do when we have a response variable which is not continuous, but is dichotomous

Probability of Disease Odds of Disease Log(Odds) of Disease Concentration Concentration Concentration

Odds • Odds are simply the ratio of the proportions for the two possible outcomes. • If p is the proportion for one outcome, then 1- p is the proportion for the second outcome.

Odds (Example) • At concentration level 16 we observe 75 participants out of 100 showing no disease (healthy) • If p is the probability of healthy is then p = 0.75. • Then 1 – p is the probability of not healthy, and equals 0.25 • Odds of showing healthy over not healthy given concentration level 16 • p / (1 – p) = 0.75/0.25 = 3 • Means that it is 3 times more likely that person is healthy at concentration level 16

Logarithms • Logarithms are a way of expressing numbers as powers of a base • Example • 102 = 100 • 10 is called the “base” • The power, 2 in this case, is called the “exponent” • Therefore 102 = 100 means that log10100 = 2

Log Odds • Odds of being healthy after 16mg of drug is 3 • Log odds is log(3) = 1.1 • Lets say that odds of being healthy after 2mg of drug is 0.25 • Means that it is four times more likely to not be healthy after 2mg of drug • Log odds is log(0.25) = -1.39

Logistic Regression • With Log-odds we can now look at the linear relationship between dichotomous response and continuous explanatory Where, for example, p is the probability of being healthy at different levels of drug concentration, X

Example: Simple Logistic Regression • Look at the effect of drug concentration on probability of NOT having disease (i.e. being healthy) • Use SPSS to do the regression (we’ll all do this soon) • Get

Looks Like

Interpreting parameters (b0 and b1) in logistic regression is a little tricky • An increase of 1mg of concentration increases the log(odds) of being healthy by 0.106 • An increase of 1mg of concentration increases the odds of being healthy by • Increasing concentration by 1mg increases odds of being healthy by a factor of 1.11

Slope Parameter • Parameter β1in general: • if positive then increasing X increases the odds of p • if negative then increasing X decreases the odds of p • the larger (in magnitude) the larger the effect of X on p • Like simple linear regression, can test whether or not β1 is significantly different from 0.

Let’s break to do simple Logistic Regression • Open XYZ.sav in SPSS • Fit logistic regression with • PTSD (Y/N) as response variable • Self-BLAME as explanatory variable • Is the effect of Self-BLAME significant? • Get parameter estimates • Write equation of model • What is the odds of having PTSD given Self-BLAME score of 3? • Use the interpretation of the regression coefficient to work out odds given Self-BLAME of 4.

Logistic Regression Part II – Multiple Logistic Regression

Multiple Linear Regression • Simple Linear Regression extended out to more than one explanatory variable • Example • Effect of both concentration and age on reaction time • Effect of age, number of previous operations, time in anaesthesia, cholesterol level, etc. on number of years of post-operation survival

Multiple Linear Regression RT(ms) = β0 + β1x concentration (mg) + β2 x age + β3x gender (0=male,1=female) β0is value of RT when concentration is 0. β1is change in RT caused by a change in concentration of 1mg. β2is change in RT caused by a change in age of 1 year. β3is change in RT caused by going from male to female in gender.

Multiple Logistic Regression • Look at the effect of drug concentration, age and gender on probability of NOT having disease Where p is the probability of not having thedisease, X1 is the concentration of drug (mg), X2 is age (years), and X3 is gender (0 for males, 1 for females)

Again, use SPSS to fit logistic model • Increasing concentrationincreasesodds of not having the disease (again, being healthy) • Increasing agedecreasesodds of being healthy • “Increasing” gender (from male to female) increases odds of being healthy • In particular, increasing age decreases the odds of being healthy by a factor of 0.95 • M to F increases odds by factor of 1.001

Was it worth adding the factors? • When we add parameters we make our model more complicated. • We really want this addition to be “worth it” • In other words, adding age and gender should improve our explanation of disease • But what constitutes an improvement

Was it worth adding the factors? • Quality (badness) of model fit is given by -2logL • If we fit want to see if it was worth adding parameters we can compare the quality of the fit of the simple and the morecomplex model • Quality of model fit follows a chi-square (χ2) distribution with degrees-of-freedom (df) equal to the number of parameters in the model • The difference between quality of fit also follows a χ2 distribution with df equal to the difference in the number of parameters between the two models

Was it worth adding these factors? • Simple logistic regression model has overall χ2 of 45.7 • This multiple logistic regression model with 2 extra parameters has χ2 of 40.02 • Test whether χ2 = 45.7 - 40.02 = 5.68 is a significant improvement • Critical χ2 for 2 df is 5.99 • Our χ2 is smaller and so NO, not worth it

BUT… • It doesn’t look like gender is having much of an effect • Check SPSS output and see that Wald χ2 for Gender is 0.527, which has p = .47 • Perhaps it wasn’t worth adding both parameters, but it will be worth just adding Age • Agehas Wald-χ2 = 4.33, p = .03 • When we only add Age, change in χ2 = 5.5 and we test against χ2 with df of 1, which has p = .02

Logistic Regression Model Building • What if we have a whole host of possible explanatory variables • We want to build a model which predicts whether a person will have a disease given a set of explanatory variables • SAME as multiple linear regression • Forward selection • Backward elimination • Stepwise • All subsets • Hierarchical

How to know if a model is good • All about having a model which does a good job of appropriately classifying participants as having disease or not • In particular, model predicts how many people have disease and how many people don’t have the disease • The model can be • Correct in two ways • Correctly categorise a person who has a disease as having a disease • Correctly say no disease when no disease • Incorrect in two ways • Incorrectly categorise a person who has a disease as not having a disease • Incorrectly say no disease when disease

Accuracy of model • Proportion of correct classifications • Number of correctdisease participants plus number of correctno disease participants divided by number of participants in total

Sensitivity of model • Proportion of ‘successes’ correctly identified • Number of correctno disease participants divided by total number of no disease participants

Specificity of model • Proportion of ‘failures’ correctly identified • Number of correctdisease participants divided by total number of disease participants

Now…a real example • Startup, Makgekgenene and Webster (2007) looked at whether or not the subscales of the Posttraumatic Cognitions Inventory (PTCI) are good predictors of Posttraumatic Stress Disorder (PTSD) • Subscales are • Negative Cognitions About SELF • Negative Cognitions about the WORLD • Self-BLAME

Descriptive Results • PTSD participants showed higher scores than non-PTSD in all three subscales variables

Multiple Logistic Regression • Response variable: • whether or not the participant has PTSD • Explanatory variables: • Negative Cognitions About SELF • Negative Cognitions about the WORLD • Self-BLAME

Let’s do the Logistic Regression • Open XYZ.sav in SPSS • Run the appropriate regression • What are the parameter estimates for our three explanatory variables? • Which of these are significant (at α= .05)? • What are the odds ratios for those that are significant? • Anything unusual?

Self-BLAME • Self-BLAME has a negative odds ratio. • This means that increasing self-blame decreases the chance of having PTSD • This is surprising, especially since participants with PTSD showed higher Self-BLAME scores • What’s going on?

Self-BLAME and SELF scales • Startup et al. (2007) explain this by stating that Self-BLAME is made up of both behavioural and characterological questions • SELF, however, may also tap into characterological aspects of self-blame • Behavioural self-blame can be considered adaptive. It may help avoid PTSD • Characterological self-blame, however, may be detrimental, and lead to PTSD

Suppressor Effect • The relationship between SELF and PTSD is strong, and accounts for the negative relationship. This includes the effect of characterological self-blame. • The variation in PTSD that is left for Self-BLAME to account for is the positive aspect of the relationship between the Self-BLAME scores and PTSD. • The negative aspect of Self-BLAME scores has been suppressed (already accounted for by SELF). The positive aspect of Self-BLAME can now come out.

Homework (haha) • Evaluate the model by looking at • Accuracy of model’s predictions • Sensitivity of model’s predictions • Specificity of model’s predictions

Logistic Regression