640 likes | 695 Views
Limited Dependent Variables. Required Readings: Long, Scott. Jeremy Freese *. ”Regression Models for Categorical dependent Variables using STATA”. 2006 * Chapter 2 of this book great for learning codes. Outline. Today (30 Jan ) –
E N D
LimitedDependentVariables Required Readings: • Long, Scott. Jeremy Freese*. ”Regression Models for CategoricaldependentVariablesusing STATA”. 2006 *Chapter 2 of thisbookgreat for learningcodes
Outline • Today(30 Jan) – • Overview ofLogit & Probit regression using a dichotomous Dep. Variable, interpretation of results, workshop exercise -Long & Freesekapitel 5 (recommended 6 also) • Tomorrow– modeldiagnostics, limitedand categoricaldependentvariables,Ordered & multinomiallogit. STATA exercise- Long & Freese7 • Friday – cont., count outcome variables and censured variables: Poisson, negative binomial models, Tobit, Heckman slection models - Long & Freese, Kapital8
Learning objectivesfor this part of the course • To understand the basicpurpose and ideabehind regression withbinary and limitedoutcomevariables & why OLS is inappropriate • To understand howto set up a model (logit, probit, ologit, mlogit, Poisson, nbreg, etc.), runestimates and interpret resultswith odds ratios,predictedprobabilities and marginal effectsusing STATA • To be abletoevaluate overall modelstrength and compare the relative explanitorystrengthof different modelspecification • Understand someof the potential problems with logistic estimation and be abletodiagnose and evaluatesuchissues. • To be abletoclearly present yourresults for readersoutsideofyour research fieldin several different wayswithtables & visuals.
Why use Logistic regression? • OLS is great, but often it’s not appropriate for our data.. • There are many important research topics for which the dependent variable is "limited." • Binary/Dichotomous logistic (or Probit) regression is a type of regression analysis where the dependent variable is a dummy variable: coded 0 (did not vote) or 1 (did vote) • Or when the Dep. Variable is “limited” (non-continuous). Or takes on only a few values. We cover this next class.. • Very important: The choice to use a logistic model is determined by the dependent variablerather than any independent variables in your model!!
Some Common examples in Social Sciences 1. Political science/ IP: -Whysomeindividualsvote for a ceraincandidate or party? -Why/ under whatcircomstancestwocountries go towarwithoneanother? 2. Economics/marketing -why a firmenters a marketplace -modelsexplainingwhyindividualschoosetobuy a certainproduct (e.g. Coke insteadof Pepsi) 3. Criminology -modelsexplainingwhypeoplecommit a crime or not 4. Sociology -models on gymnasiam/ university graduation, -marital status -health issues (having a disease or not.)
Whycan’twe just use OLS?? The Problem: In OLS regression: ‘e’ (the error term) is assumed to be uncorrelated with the DV (exogeneity), constant for all levels of X (homoskadasticity), and normally distributed but where Y = (0, 1) *This does not necessarilybias the coefficients (+ or -), but will bias (underestimate) the SE’s (lead to type I error ) *results in misleading hypothesis testing& produce wrong estimates of the MAGNITUDE of X on Y, especially at large/ small values.. Prior to Logit/probitmodels, mostpeople just used…
1. Initial model: the LinearProbabilityModel (LPM) • Triestolitearly fit a linearestimation (OLS) to a binaryoutcome ) *This just like running OLS with a binary/limited DV*
1. Initial model: the LinearProbabilityModel (LPM) • Assumesthat the effectof ’X’ on ’Y’ is constant (linear), thus has the same problems associatedwith the OLS estimatesof the previousslides, plus moreissues. • Places no restrictions on IV’s (can be dummy ,continuous, etc) just like otherbinarymodels (logit, probit) • The LPM here, it is not used in contemporarymanyquantaitive studies (however, therearesomeexceptions).. • Let’stake an example….
Simple Example: 2016 US Election – ’Trumpvote’ and VoterIncome • Take the votes and incomesof 20 US voters DV = VoteTrump (1/0) IV = yearlyincome (in thousands $) Howtoestimatethis relationship?? Several options • linear OLS (LPM) • ’reverse’ the IV and DV and do a simple t-test ofmeans (no controls..) • Probitmodel • Logit model
Effect of Income on Vote for Trump • What do weobserve? • Oneaverage, the Higher the income the morelikelytovote for Trump.. • Regressing OLS on this DV is what Long calls the ’LPM’, & it gives usPr(Y=1) Model interpretation • Probabilityofsomonewith a $5k yearincome? • Someonewith a $130k income?
CalculatingprobabilitiesofXi in our LPM For 5k, we get: -0.058 + (0.0086*5) = -0.015, or -1.5% pr=1 For 130k, we get: -0.058 + (0.0086*130) = 1.07, or 107% pr=1 What do we make ofthesepredictions?!?
Let’splot the residuals (Y-yhat) against PR(Y=1) from OLS’rvfplot’
Ourresiduals looks like thisbecause– when Y (0, 1) then: e = - yhat (e.g. when Yi=0) Or e = 1 – yhat(e.g. when Yi=1) Remember- ’yhat’ = predictedvalue of Yi from ourmodel Ourresidualplotshave a lowerline(for Yi=0 cases ) & an upperline (for Yi=1 cases), & for both, as Yhat goes up, absolute errorlevels go down.. So, ifyhat = .2 for example, our ’e’ valuecan ONLY be -.2 or .8 for example. So if Y(0,1), weareleftwithseveralkey problems ifusing OLS: 1a. ‘e’ in a binomial distribution is thus HETEROSKADASTISTIC– in OLS, under homoskadasticity, when we plot y-yhat (residuals) over yhat, what SHOULD it look like??? our simple example shows this is NOT the case.. (efficiency & bias)
Key problems 1b. If , thismeans OLS assumes ALL valuesof X have the same Var, e.g, it doesn’tmatterifXivaluesarehigh or low -result: Errors not NORMALLY DISTRIUBTED. Where Y(0,1), errorscanonlytake on 2 values – clearviolation (efficiency) 2. Exogoenity: For dichotomousDV’s, this is ONLY the caseifPr(Y=1) = 0.5. ourregressorsare ENDOGENOUS (e.g. X’scorrelatedwitherror term) Thus errorterms areincorrect & hypothesistestingis less reliable 3. Linearityof Betas – wewanttoknow the Pr(Y=1), whichrangesbetween 0-1 – OLS cannotconstrainvalues, thuswe get unrealisticpredictions& specify the wrongFUNCTIONAL FORM of X. Plus, is it fair tosay the marginal effect of X is constant? (bias)
Howto ’fix’ this - Link Functions • So these ’link’ the actual ’Y’ valuesto the DV in our statistical models. • So wetakewhat is called a ”linkfunction” F(Y), thattakes Y and makes it • Thesecan be logged(as wediscussed in OLS) or sq.rootDV’s in OLS even (which transforms the ’real Y’, to a ’logged Y’).. But in ourcase, wewantto go from just estimating Y(0,1) to Y as Odds or Probability
Link functionscont. • We wouldneedto transform ourdichotomous DV (Y) into a (somewhat) continuous DV, for example, the log of the odds: • For dichotomousDV’sweneedtofind a function F(Y) that goes from (0,1) that is normallydistributed – predicts as Xiincreases, Pr(Y=1) increases (or decreases). • For starters, statisticians discovered thatwecoulduse the probabilitydensityfuction (PDF, e.g. normal bellcurve) from whichwedrawhypothesistestingwithZ-scores, etc. • If you test the significanceofany Beta in your OLS modelyou get a p-value(thatcorrespondsto a Z-score), whichranges from 0 to 1. • Wesimplytake the inverseofthis (called the ’cumlulativedensityfunction– CDF), which is alsonormally distributed,
Keyconcept: Cumulativedensitityfunction (CDF) • Just like a normal probability distribution function (PDF), wewanttoknow, given the valueofXi, what is the Yi, in terms ofPr(Y=1). • For this, weuse the same logic from a standardized Bell curve (z-scores), a valueof -1 impliesabout 16% Pr(Y=1), of 2 implies 97.7% Pr(Y=1), etc.. • BUT, the effect is NON-linear (littleeffect on the DV for low/highvalues, and strong effect in the middleof the distribution. • So the normal PDF is transformed into the CDF tobettercapturethiseffect
The ProbitModel (Bliss and Fisher 1935)
Outfirst Alternative: The ProbitModel • We see from the CDF that the effect of Xi on the Pr(Y=1) is non-linear, butprobitwantstoestimate it like a linearmodel – Long and Freesech 5 • So, we do the followingadjust for thiswith: • Wethusimpose the pdf(standard normal function) on e.g. ourlinkfunction) • The function gives us the PROBABILITY AREA in the standard normal distribution (like in a bellcurve) ofPr(Y=1|Xi) • Error term assumed (like OLS) tohave a meanof 0 and varianceof 1 • So, probitproduces z-scores thatyoucan look up, just like withstandardizedvariables, or withhypothesistesting in OLS • But, to make the model ’linear-like’ and fit the Probit distribution, we start by: AND THEN: wesubtractthe inverseof: • Coefficientsare NOT probabilites, butscaled as the inverseof the standard normal distribution.. Buteasytocalculateintoprobabilities..
ex. predicting Y=1 at valuesofXi • Insteadof OLS, we just run: probitvotetrumpincomein STATA
ex. predicting Y=1 at valuesofXi • Instead of OLS, we just run: probit yvar xvarin STATA • We get Pr(Y=1) = + βIncome(0.038) + e. • Let’ssaywewanttoknowPr(voteTrump) for someonewith 80k income.. • Thus -2.31 + (0.038*80) = 0.73 • = 0.76, or a 76% likelihood of voting for Trump given someone has an income of 80k • Thatwasreally cool, howdidIdo that?? 1. Look it up on a z-score table…. OR, 2. In Stata, type: display normal(.73) (back tothis later…) 3. predict hat – and check the yhatvalue of any observation with the x-value of interest (more later on this..)
So, for anyvalue of X, probit just takes the area to the left of the Xi. This is distributed normally and the probabilityof Y=1 is the AREA coveredin a PDF. ’ at 65k (0.16) for ex. = 55%, *for incomelevelsof 160k, we get 3.77, or 0.0001 = 99.99%
Or put in terms of the CDF – this is EXACTLY the same area!! (in Stata – ’twoway (connected hat income)’
The Logit Model (David Cox, 1958)
another option: Logit Models • another nonlinear regression model that forces the output (predicted values) to be between 0 - 1: • Like probit, logitmodels estimate the probability of your dependent variable to be 1 (Y=1|Xi). This is the probability that some event happens, given a certain level of X. You can always reverse this if you want (Pr(Y=0)).. • ln[p/(1-p)] is the log odds ratio, or "logit" (which is what is different from probit, which uses ) • Mean wherasProbitwas 0,1..
Like Probit with the CDF (-), we need a formula for the logistic transformation (our ‘link function’) This is the odds. As the probability increases (from zero to 1), the odds increase from 0 to infinity. Odds CANNOT be negative So if β is ‘large’ then as X increases the log of the odds will increase steeply. The log of the odds then increases from –infinity to +infinity. The steepness of the curve will therefore increase as βgets larger
Odds vs. probability • What is the difference?? • Really, they express the same thing – the chancethat a given outcomewilloccur Simple difference: • Probabilities= # oftimes event occured/ total numberoftries or observations • Odds = the probability an event willoccur/ (1–the probability an event willoccur). *For example, saywewanttoknow the Pr(graduate) and weobservethatoutof a 100 students, 80 did and 20 did not. *The probabilityof a student graduating from oursample is thus 80/100 = .80 or 80% The odds ofa student graduatingis .80/ .20 = 4/1 a 0 for Probability = 0 for Odds, and 0.5 probability = 1.00 odds (’evenmoney’) All Pr<0.5 rangebetween 0 and 1 for odds All Pr>0.5 rangebetween 1 to
Logit ModelCont. • In comparison to the linear probability (LPM) estimates, the logistic distribution constrains the estimated probabilities to be between 0 and 1. The estimated probability is defined as (Long & Freese p192):pr(Yi=1|Xi) = 1/[1 + exp](- -X)] • as + X increases, p approaches 1 • as + X decreases, p approaches 0 • if you let + X =0, then p = .50
Logit vs. Normal curve (e.g. probit) • Standard logistic curve is flatterthan normal (probit) distribution since it has a slightlylargervariance for the error term, () • Logit and probitwill ALWAYS have the same signs for βs (given the same model) • Coefficientswillthus be about1.7x greater for Logit thanProbit for the same model • Probitwillhaveslightlyhigherprobabilities for Xiaround the mean, but logit greater at more extreme values..
Model fit – OLS vsLogit • So that’s what we want to do, but how do we do it? • With OLS we tried to minimize the squares of the residuals (which is why its’ called “least squares”..), to get the best fitting line for each IV regressed onto Y. • When the DV is binary, there’s only 2 values & the errors won’t be normally distributed. Thus ‘least squares’ technique does not seem really logical.. • So instead, for logit and probit, we use something called maximum likelihoodto estimate what the β and α are.
Fitting Logit models.. What’s going on here? • Maximum likelihood(ML)is an iterative process that estimates the best fitted equation. (see p84 in Long and Freese) • Iterative? This just means that STATA tries lots of models until we get to a situation where alternative ways do not improve the ‘fit’ of the model given our constraints (e.g. IV’s that are in our model) • The ML process is pretty complicated, although very intuitive. The basic idea is that we find the coefficient value that makes the observed data most likely. (more on that in a bit..) • In either case, the coefficients produced (while direction & sig. are interesting) for both logit and probit are essentially meaningless – e.g. not marginal effects like in OLS ***logit requires a bit larger sample – rule of thumb: 20 obs per IV
Logit Regression output: let’scompare & interpret • Logit regression shows impact of Income on Pr(voteTrump) is positive & sig. • BUT wecannot interpret the Beta coefficients as marginal effects like in OLS duetoourlinkfunction, weneedtotake a more ’interpretable’ numer: • PredictedProbabilites(logit and probit) • Odds Ratios(only logit) • Marginal effects(logit and probit)
The Effect of Income on Voting for Trump: PredictedProbabilites Simple Interpretation: What is the probabilitythat a votervoted for Trumpwith an incomeof: • 5k? • 65k? • 130k? • Calculateusingformula: 1/ (1 + exp((3.85 - (65*0.64)) = 0.571 = 57.1% Or… II. Based on the logit regression estimates, wecanproduce the Predictedprobability for eachvoter PR(=1) using post-STATA command: ’predicty_hat’ Like probit, the range is between 0 to 1
Or visually…(scattery_hatincome) • So, youcansee by the scatterplot, Whenincome is 65k per year, the Pr(voteTrump) = 57.1% • STATA command: twoway(connectedy_hatincome)
The Effect of Income on Voting for Trump: ODDS RATIOS • Remember, Since: ln[Pr Vote Romney/(1-PrVote Romney)] = + income+ e We interpret the slope of the incomeas the rate of change in the "log odds" as incomechanges by one unit… does anyone know what that means? ? • So, like taking the Predicted Prob, we can also take the “Odds Ratio”remember, Probability = 1/[1 + exp(- - income)] • The marginal effect of a change in income on the probability is: Δp/Δincome= f(income)+e • Since the exponential function is the inverse of the log, The Odds ratio is just: [p/(1-p)] = exp( + X) • In STATA, wecan get the Odd ratios just by running the following: logit yvar xvar , or **OBS - the modelproduces no constant & z-scores
cont. • Odds ratios >1 imply a positive effect, & < 1 a negative effect. • Also, a valueof 2 is ofequalstrength as 0.5. same with 4 and 0.25, etc. • Let’stakeanotherexample, using a multivariate analysis, addinggender interpretation • Basically, if OR < 1, subtractvalue from 1 (1 – 0.798 = 0.212) ”holding incomeconstant, the odds of voting for Trumpdecrease by 0.21 (21%) for womencomparedwithmen” (p is insig..) • If OR>1, subtract ’1’ from OR value (for income, 1.065-1 = 0.065) ”odds increase by .065 (or 6.5%) for eachincrease in 10k”
More Interpretation: marginal effects& interactions (pps227-260) • Ok, let’suse a biggerdataset and multivariate analysis(from Long and Freese) - the example data on explainingcases of women’sworkforceparticipation in the US. • Herewearesimplyinterested in knowingwhether kids and otherfactorsaffect the likelihood of women’sworkforce participation (see p194) • The data in GUL or epost: US women'swfp.dta
Let’ssimplify the age variableinto 3 categories: agecat& label the variable’scategories • gen agecat=1 ifage<40 • replace agecat=2 if age>39 & age<50 • replaceagecat=3 ifage>49 • label define agecat 1 "30-39" 2 "40-49" 3 "50+ " • labelvaluesagecatagecat
Ourmodel: logit lfp k5 k618 i.agecat wc hclwginc • havingyoung kids (5 and <) negativelyaffects LFP, but no effect w/ kids 6-18. • Youngerwomenaremorelikleythanolderonestohave LFP • A women going to college (wc) positivelyaffects LFP, buthusband’s college (hc) no effect • A wife’sestiamtedwages (lwg) arepositivelyassociatedwith LFP, whilefamily (minus wife, inc) is negative • Canwe make theseresultsmore ”tangible”?? YES!!
So, in otherwords, Herewewillcontinueusing the margins and mchangecommands** seeChapter 6 in book for moredetails Someotherwaysyoumightwantto interpret: • Predict the probability of the DV with all IV’s at theirmeans • Predictthe Pr(DV) at various levels of an IV, given all others at their mean or at certainlevels (specificpredicitons) (MER’s) • Marginal effects – e.g. Show changes in Y as X changes (AME’s and MEM’s) • Interactioneffects / quadradiceffects
1. Predict the DV whenall IV’s at theirmeans • Very simple, what is the probability of LFP of a womenwho has the ’meanvalue’ of all IV’s (simultaneously)? • Wecansaythataverage person has a 56.8% chance of LFP, but is this super useful?? Maybe a little… • To get this, just type ’margins’ after a regression
2. Predict the DV with 1 (sig) IV at different levels, given all others at theirmean • How about finding out what the pr(LFP) is at different ages is, holding all elseconstant? • Hereweseethatmeanvalues of the otherIV’sheldconstant at the means, but ’age’ is taken at 30’s, 40’s and 50+. • Pr(LFP) in one’s 30’s is only 69.7%, whilewhen over 50, it is 39.1%
2b. Predict DV withspecific ’meaningful’ values for all IV’s WhataboutPr(Y=1) for a women in her 30’s withone small childwho, with husband, wentto college and has a mean personal and familyincome? Or a female age 50+ with 4 kids (2<5, 2 >5), no college, and one standard deviation belowmean for personal anddfamilyincome?
3. Marginal Effects “aME [marginal effect], or partial effect, most often measures the effect on the conditional mean of Y of a change in one of the regressors, say Xk. In the linear regression model, the ME equals the relevant slope coefficient, greatly simplifying analysis. For nonlinear models, this is no longer the case, leading to remarkably many different methods for calculating MEs.” (Cameron and Trivedi 2004: 333) • Several ways to show that Xi’s effect on Pr(Y=1) is NOT just its own, but also where other IV’s in our model are, and at various levels of X.. 1. We can see the effect of X just assuming that all others are held at means (MEM), or the average effect of X over all values of X (AME) – see pps 244-245 Long & Freese for a discussion on which is ‘better’! 2. Another way is to let the other IV’s vary and see Xi’s effect across a range of values for other IV’s of interest..(MERV)
Marginal effects - Cont. 1. For categorical (dummy) IV’s, the ME shows howPr(Y=1) changes as Xichanges from 0 to 1 (maletofemale, employedtounemployed, etc.), or: 2. For continuousIV’s, the ME shows howhowPr(Y=1) changes as Xiincreases by ONE UNIT (age would be years for ex.) – ranges from 0-1
a. Average Marginal Effects(AME’s)the ’mchange’ command*canalso be donewithmarginssee p 248wealsosee the ’meanprediction’ givenourmodel