190 likes | 854 Views
Logistic Regression using SAS prepared by Voytek Grus for. SAS user group, Halifax February 24, 2006. What is Logistic Regression?. Regression Analysis where the response variable Y is discrete and represents either categories or counts. There are no restrictions on predictors.
E N D
Logistic Regression using SAS prepared byVoytek Grusfor SAS user group, Halifax February 24, 2006
What is Logistic Regression? • Regression Analysis where the response variable Y is discrete and represents either categories or counts. There are no restrictions on predictors. • Linear regression equation of the type yi=α+βxi+εi is not appropriate … • … but like in linear regression analysis logistic regression is used to • test statistical significance of relationship between response and predictor variables • predict the category of outcomes given its predictors • Falls into the category of generalized linear models and either complements or offers flexible alternative to • Multiple linear regression – similarity in equations, statistical diagnostics • Contingency tables (cross tabulation) • Loglinear models • Discriminant analysis – answers similar questions but is less restrictive • Relatively New statistical tool for the analysis of categorical data • Contingency tables – 1900’s • Regression Analysis – 1970’s • Loglinear modes – 1975 • Logistic Regression – late 70’s early 80’s but became more popular in the 90’s
Fields of application. • Health sciences - questions about disease: yes or no? • Social Sciences: deals with great deal of dichotomous variables: employed vs unemployed, married vs unmarried,etc • Attitude to work as based on demographic or behavioral predictors • Racial bias in judicial decisions, etc • Political science: • Which party voters will vote for and why? • Which voters will vote for a particular party? • Public Opinion Polls • Used in economics and marketing to study consumer choice. • Banks use it to assess credit rating of customers • Some regulators require that utilities submit customer choice studies on energy conservation options. • Choice of mode of transportation • Used in demand forecasting
Why not to use OLS for the estimation of the categorical response equation? • Multiple Linear Regression of categorical response variables does not satisfy two assumptions of a Linear Model necessary to produce unbiased and efficient coefficients. • Linearity of coefficients: yi=α+βxi+εi • E(εi)=0 • Heteroscedasticity: var(εi)≠σ2 • E(yi)=1*P(yi=1)+0*P(yi=0)=pi= α+βxi • var(εi)= var(yi)=pi*(1- pi)=(α+βxi)*(1-α-βxi) • Errors are uncorrelated: cov(εi, εj)=0 • Errors are not normally distributed: εi~ Binomial • Errors take on only two values: εi=1-α-βxi or εi=0- α-βxi and are bounded by 0 and 1. • As a result • coefficient estimates are no longer efficient • Standard error estimates are no longer consistent • Estimated values of the response variable Y may be implausible because • Linear function is unbounded (estimates will be outside of the (0, 1) interval but the Binary regression is a linear probability model:E(yi)=pi= α+βxi
Logit Transformation a remedy to violation of OLS assumptions • Instead of estimating this linear equation: yi=α+ βxi1+βxi1 + …+ βxk1 +εican apply logit transformation: log[pi/(1- pi)] =α+β1xi1+β2xi2 +. + βkxk1 where pi/(1- pi) is an odds ratio that an event of y=1 will occur. • Consequences: • pi=exp(α+β1xi1+β2xi2 +. + βkxk1 )/(1+exp(α+β1xi1+β2xi2 +. + βkxk1)) happens to be a cumulative logistic distribution function. • No matter what the coefficients are pi is always between 0 and 1 • Absence of εi complicates stats analysis: standardized coefficients? • Derivative of x is a function of p: Dpi/dxi= βpi(1-pi) and reflects changing slope of the S curve making interpreation of coefficients difficult. Need to be cautious when interpreting coefficients from the prob. perspective
Alternatives to logit transformation in the context of latent variables: probit and complementary log log • In a perfect world there is a model for a continuous response variable zi. The dichotomous logit model is only its simplification. There is a true equation zi=α0+ α1xi1+ α2xi1 + …+ α3xk1 + σεi but it can not be observed. It is latent. Instead we observe dichotomous y whose values of 1 and 0 depend on probability z. Y’s relationship with predictors X’s depends on the probability distribution of ε. • Assumption of distribution of ε help determine standardized coefficients.
Logistic Regression in the context of the generalized linear models.
I Logistic Regression compared to ordinary linear regression
II Logistic Regression compared to ordinary linear regression
Summary of SAS procedures for logistic regression analysis • Binary Logit Analysis: • PROCS: LOGISTIC; GENMOD; CATMOD; PROBIT, MDC, NLMIXED. • Multinomial Logit Analysis • Predictors are characteristics of the individual • Nominal (no ordering of Y’s): proc logistic; proc catmod • Ordinal (inherent ordering of Y’s): proc logistic; proc catmod; proc genmod. • Conditional Logit Analysis • Predictors are the characteristics of the response variable • Can use mdc proc & phreg proc. • Logit Analysis of Clustered data: • Proc Logistic or (Proc Phreg) • Proc Genmod (gee)
Binary Logit Models • PROC LOGISTIC at its simplest: Main effect Model • Individual-level data: PROCLOGISTIC DATA=input; FREQ frequency; /* optional */ MODEL y=X1 X2;RUN; or 2. Grouped data: PROCLOGISTIC DATA=input; MODEL events/trials=X1 X2;RUN; • PROC LOGISTIC with more features • PROCLOGISTICDATA=lrdata.penalty DESCENDING; • CLASS culp; • MODEL death=blackd|whitvic|culp / STB LACKFITAGGREGATERSQlink=logit technique=newton CLODDS=PL CLODDS=WALD SELECTION=stepwise SCALE=WILLIAMS CORRB influenceiplots; • UNITS culp=2 / DEFAULT=1;Outputout=results pred=phat lower=lb upper=up reschi=stres dfbetas=dfs;RUN; • PROC GENMOD at its simplest • PROCGENMOD DATA=lrdata.penalty; • MODEL y=X1 X2 /Dist=Binomial;RUN;
Multinomial Logit Models • Multinomial logit for nominal response (Generalized Logit) • The logit transformation of the type log (pi/(1-pi)) for more than 2 categories does not work because Σi=1kpi ≠1 • K-1 equations are estimated: log (pij/(pik)= +βjxi where j=1,2, … k-1. • Multinomial logit for ordinal response (Cumulative, adjacent categories, continuation ratio) • Inherent ordering of Y responses allows to relax the assumption of multiple odds equations. • Estimate k-1 equations of odds of Cum. Probabilities Fij • Log (Fij/(1-Fij)= αj+βxi - all coefficients except for intercept stay the same • Because there is a hierarchy in the categories of response variable • The model is easier to estimate and interpret • Hypothesis test are more powerful • one coefficient of each predictor but k-1 intercepts. Available tools in SAS: 1. PROCLOGISTIC DATA=lrdata.wallet; MODEL wallet = male business punish explain / link=glogit; /* or link=clogit */ RUN; 2. PROCCATMOD DATA=lrdata.wallet; DIRECT male business punish explain; MODEL wallet = male business punish explain / NOITER PRED; RUN;
Conditional logit Models • Consumer Choice Studies • Consumer taste preferences, choice of mode of transportation, locational characteristics for a retail store, • Conditional Logit: proc mdc; model decision = x1 x2 / type=clogit choice=(mode 1 2 3); id pid; run; • Nested Logit: proc mdc data=newdata; model decision = ttime / type=nlogit choice=(mode 1 2 3) covest=hess; id pid; utility u(1,) = ttime; nest level(1) = (1 2 @ 1, 3 @ 2), level(2) = (1 2 @ 1); run; • Analysis of clustered data • Observations within clusters can often be dependent: longitudinal data, students clustered in classrooms or schools, husbands & wives clustered in families, etc • Dependent observations produce underestimated errors and overestimated test statistics and coefficient estimates which are inefficient. • Remedies: Can use GEE (PROC GENMOD) or Conditional Logit (PROC LOGISTIC or PROC PHREG) and other methods such as Mixed Models or hybrids of the above.
Consumer choice Modeling: Nested Logit Example Decision Tree • Example • procmdc data=travel2 maxit=200 outest=a; • model choice = ttime time cost / type=nlogit choice=(mode); id id; • utility u(1,123 @ 1) = ttime time cost, • u(1,4 @ 2) = time cost; • nest level(1) = (123 @ 1, 4 @ 2), • level(2) = (12 @ 1);run; Level 2 Level 1
Literature • Logistic Regression Using The SAS system by Paul D. Allison (4th edition August, 2003) • Categorical Data Analysis Using The SAS System by Maura E. Stokes, Charles S. Davis, Gary G. Koch. (4th edition January, 2005) • Multivariate Statistical Methods by B. Tabachnik (1996) • SAS Help Examples