640 likes | 648 Views
Learn about the assumptions, examples, and interpretation of multinomial logit models. Understand how to estimate the relative probability of categorical outcomes using a common set of parameters.
E N D
Limited DV Day 3 - Agenda Yesterday • Modeldiagnostics for logit • Ordered logit - assumptions, example and interpretation Today • Multinomial logit • In classexercise: ordered and multinomial logit • Estimatingcountvariables – Poisson and negative binmialmodels • Estiatingcensured data (iftime) –tobit, heckmanmodels
2. Multinomial Logit • Similartoordered logit, whenour DV takes on 2+ values, but still limited – 3, 4, 5 categories for example. • Unlikeordered logit, the categoriesof the DV are ’not ordered’, butarenominal categories(aka ’categorical’).. • Weareinterested in the relative probabilityoftheseoutcomesusing a common set of parameters (IV’s) • For example- given a set ofIV’s (education, country/regional origin, parent’sincome, rural/urban) wemightwanttoknow the following: • Choice of a foreignlanguage – English, Spanish, Chinese, Swedish • Choice of drink: coffee, Coke, juice, wine • Choice ofoccupation – police, teacher, or healthcareworker • mode oftransportation – car, but, tram, train • Voting for a party or bloc – R-G, Alliansen or S.D. (whatused to be anyway…)
Assumptionsof’mlogit’ models: • a common set of parameters (IV’s) canlinearlypredictprobabilitiesof DV categoricaloutcomes, but do not assumeerror term is constantacross Y outcomes.. • UnlikeOlogit, theseIV’sareCASE SPECIFIC – have independent effects on eachcategoryof the DV (e.g. different Betas acrosscategories – no ’parallel odds assumption’..) • ”Independenceof Irrelevant Alternatives” (IIA, from Arrow’s ’impossibilitytheorom) – the odds/probabilityofchosingonecaseof the DV over anotherdoes not depend on another’spresence or absence, ’irrelevant alternatives’ **strong assumption** *Multinomial logit is not appropriate if the assumption is violated.
Multinomial Logit Assumptions: IIA • IIA Example 1: Voting for certain parties **For ex., the probabilities of someoneS, V, L, M, KD or, SD vs. M does not changeif MP is added or taken away • Is IIA assumption likely met in this election model? • Probably not.. If MP were removed, those voters would maybe vote for S, V or C. • Removal of MP wouldincreaselikleyhood for S, C or V relative to M • IIA Example 2: Consumer Preferences • Options: coffee, juice, wine, Coke • Might meet IIA assumption • Options: coffee, juice, Coke, Pepsi • Won’t meet IIA assumption. Coke & Pepsi are very similar – substitutable. • Removal of Pepsi will drastically change odds ratios for coke vs. others.
Multinomial Logit Assumptions: IIA What to do about this issue? Long and Freese (2006): • “Multinomial and conditional logit models should only be used in cases where the alternatives “can plausibly be assumed to be distinct and weighed independently in the eyes of the decision-maker.” • Categories should be “distinct alternatives”, not substitutes. • Theory & argument very important • Note: There are some formal tests for violation of IIA. But they don’t always work well. Be cautious of them. • See Long and Freese (2006: 243)
Let’s do a simple example: Danishelection (from 2014 ESS data), testingfor howsupport for EU impactsone’s party vote, controlling for otherIV’s age, gender, unemployment, education, region • 1st, Wehave a sampleof ca. 1000 Dansk respondents and lots ofparties- butlet’sgrouptheminto 3 groupstoreduceviolationof IIA (’Votebloc3’): 1= Red Bloc(A+B+F+Ø+Å) 2= BlueBloc(V+I+C+K) 3=Dansk Folkparti (DF) **Mlogit is more complicated than logit or even ologit, because it needs to estimate more parameters (K+1) + (Y-1), K=number of IV’s (plus a constant), Y= number of outcomes in the DV. The interpretation is a bit more complicated & cumbersome… • Seemailfor data – what do weobserve?
Simple look (bivariate) Wewant to seehowvoter’s feelings about the EUallignwiththeir national party choices (euftf - ’EU unificationgonetoo far – not gone far enough , 0-10). Let’slook at the summary stats by bloc: What do wesee? Does thisholdwithcontrolvariables?
How is the modeltestingthis? • The outcomeof an ’mlogit’ modelalways has Y-1 parts (total outcomes – a baslinecategory. Let’suse Red, the (then) incumbentblocofparties. **Coefficientsarethus all relative to the ’baseline’ or ’common reference’ (like dummy variables as IVs in a model..) • Theycorrespondto Y-1 seperateequations (in ourcase 2 equations): • = • = ***euftf is also a limited, ordinalvariable, butwith 10 categories, let’s try it is a continuous IV (wecanalwaysuse dummys for eachcateogryifwewantto as well..)
Baselinemlogitmodel What do wesee?? Log likelihood & Psuedo R2 allowsustocomparemodels Chi2 tellsusmodelsignificance Coefficientsgiveus the relative log odds ofourX’s for eachoutcome relative to the baseline (Red)
*seewordfile for clearerpic..** • Notice the ’rrr’ command? Thisproduces the ’relative risk’, or odds ratio.. • What do wefindnow?? • EU support holds • Control variables? What do younotice? Gender f/e.. • regional variable – comparison in DK01 (Copenhagen) Note: Wecanchange the baselinegroup: baseline (x)
Sig. test ofvariable(s) Post-regression command: -withcontrols, wecan test toseeif the overall effectofeuftf (or anyother IV) is significant on our DV with ’test’, (the Ho is that the effectof the variable in question =0) Clearlysignificant overall
Interpretation, Interpretation, Interpretation • wefindsomeinterestingresults: • Most positive toward EU (wantmore integration) mostlikelytovote Red, leastlikleytovote DF • Comparedwith Red, Bluevotersaremoremale, less educated, less urban • Unemployedmostlikleytovote Red • No sig. Difference for gender between Red and DF (butwomen less likelytovoteBlue..) • Voters from S. Denmark and Midtjylland most likley to vote DF or Blue over Red. Voters in Copenhagen region most likely go for Red. *Canwe be morespecific? YES!! • start with the ’margins’ commandto do PredictedProbabilities for each party….
Diagnositics withMLogit • Again, like logit (and ologit), we test the signficanceof the full modelwith the χ² statistic, and ’improvements’ (or omitted/ irellevantvariables) with an LR test using the log likelihoodratios. • Again, Psudeo-R2 is meaningless by itself – onlycomparedtoothermodelswith the same sample..BUT, the higher, the better • Detectingoutliers is harder in mlogitthan logit. Similar stats applyhowever
Testing for omitted or irrelevant IV’s: an LR (all IV’s) or Wald Chi-Sq test – no factorIV’showever (’i’ & ’c’)… mlogtest, all
Testing the IIA & CombiningDependentcategoriesof Y • Hausman test • Small-Hsiao test (in mlogtest, all) Both test the Ho that the IIA is NOT violated (e.g. independenceofcategories) 2 similar tests (Wald & LR) **wefindthat the predictors for the 3 votingblocsarestatisticallysignficant. **If violated, considercollapsingcategories, or running simple logit
Exampleof logit and mlogit regression Charron, N. and A. Bågenholm. 2016. Ideology, party systems and corruption voting in Europeandemocracies. ElectoralStudies 41, 35-49 see GUL, day 3 logit
Topic 3: Event Count Models • Again, wedetermine the useof an Event Count model by the structureofour DV • So far, we’velooked at variablesthathave normal and binary distributions (OLS, and Logit). We’llnowconsider a 3rd type, ’Gamma’ distributions • In thiscase, the DV is: • a FIXED numberofoutcomes & NOT binary • For ex., can be unitsoftime (days, years, etc), units in fixedtime (individual or geographicunit) • Ordinal (more later ifyour DV is continuous) • Positive (butcantake ’0’)
Some examples • Numberof new politicalparties entering parliament in a given electionyear • The numberofpolitical protests or coup d’Etats in a country-year • Numberofpresidential vetos in a year or mandate period • Numberofchildren in a household • Numberof vaccinations a child gets in a year, or doctor visits an adult makes • Numberofcivicorganizations an individualjoins or is a memberof in a given year.
Keycharacteristicsof ’Event Data’ • The countof events is non-negative • areindependent ofoneanother • Counts must be integers (e.g. discrete) – cannot be 2.2, 3.7 but 2 or 4. • Canhave 1-parameter (λ) distribution (mean=VAR) • Using a histogram, weseethat the distribution of Yi outcomes is usuallylarge in 0 or 1, and diminishes rapidlyfrom the 2nd or 3rd outcome on… • The distribution is thus NOT normal (’Gausian’)– it is a ’gamma distribution: for count data weusethesemodels: 1.Poisson 2. negativel binomal
1. PoissonModels: Assumptions & workings • Like logit, estimateswith Maximum Likelihoodestimation (MLE), whichfinds the valueof the parameter that fits the model ’best’ (log likelihood) • Our ”linkfunction” in thiscase is Lambda – λ Goalsareto: 1) estimate the increasePr(Y=n) for a unitchange in X. In Poisson regression, the model expresses the log outcome rate as a linear function of a set of predictors. (like Logit, β’s need to be transformed for interpretation) 2) predict the expectedcount-outcome (group) for an observation (like ologit). Butbecauseofour DV distribution, the normal/logit curvecan’t be used, thus the Gamma distribution fillsthis gap..
Whybetterthan OLS?? -like ourTrumpexample from logit, OLS willproduce a linearestimateof the relationship betweenβX and Y thatwill be less than 0 and greaterthanourhighestcount (unrealisticpredictions). -OLS assumes the difference is the same between all counts in Y (0 to 1 is the same as 3 to 4), like Ologit, Poissondoes not. -wewillalmostalwayshaveheteroskadasticity (as therewillprobably be more VAR in Y-outcomeswithmore observations) -error term not normally distributed
Cont. , The Poisson Distribution The Poisson distribution is the following: is calculated as the meanof Yi is equalto the exponent inverseof Lambda K is the numberofoutcomes in Y K! is the factorialof K (ex. 4! = 4 × 3 × 2 × 1 = 24) is the expectedvalueof Yi (meanof DV) and alsoitsvariance: So: = E(Y) = Var(Y) Noticewhen =1 the CDF is highlyconcentratedbetween 0 and 10, as Lamdaincreases, whatdoes the CDF look like?
Importantassumptionsof a PoissonModel • The observations areassumedto be independent ofoneanother • Logarithm of rate changes in the DV are expressed linearly with equal increment increases in the IV’s • ”Equidispersion” – e.g., the meanof the DV = the Variance (althoughthisdoes not happenthatveryoften..). -Breakingthis is called ”overdispearsion” –whan VAR in our data is greaterthan the modelassumes. If violated, wecan’tusePoisson for hypothesistesting.. **If outcomecasesof Y are not independent, thenwewillmostlylikelysee ”overdispersion” – whichiflargeenough, willleadustouse a Negative Binomialmodel(more later…)
Overdispersion: Causes & Consequences Possible causes: 1. a poorlyfittedmodel • Omitedvariables • Outliers • Wrongfunctional form of 1+ ofourIV’s in the model • Un-accountedheteroskadescticity from structural breaks. 2. ) (varianceofour data greaterthan the mean) -very common withindividuallevel data! Consequences • UnderestimatedSE’s (thinkoppositeeffectofmulticollinearity) • Overstimated p-values & poorprediections
Simple EmpiricalExample – SES school data H: the number of awards earned by students at one high school is positively related to their score on their final exam in math, and can be predicted by the type of program in which the student was enrolled (e.g., vocational, general or academic). use http://www.ats.ucla.edu/stat/stata/dae/poisson_sim, clear Let’s take a look at the mean of our DV by the 3 academic programs - it looks like ‘academic’ program has a significantly higher level of awards..
OurPoissonmodel #Awards= academicprogram+math score+ constant + error • Just like Logit models, wecanuse the Waldχ²toevaluate the model on whole. The pseudo R² & log pseudolikelihoodcan be usedtocompare different models • Wecanuse robust s.e.’s to correct for slightviolation of Mean=VAR For ordered/catagoricalIV’s • Let’sfirst test for overall significanceof ’Programs’ on ’Awards’ (general=baseline)
Important extra model test in Poisson • Before going on to interpret the model’s Betas, weneedtoknowwhetherwe’ve ’chosen correctly’ withPoisson – does the Poissonestimation form fit our data?? E.g. is the Gamma distribution appropriate? • Otherwise, wemightconsiderologit • A ’goodnessof fit’ test (χ²) willletusknowifwehave a problem from – the Ho is the themodel’s form DOES fit our data, a rejectionof Ho meansthatPoissonmight be the WRONG estimation…. • Otherreasons for rejectionwould be omittedIV’s or incorrectfunctional forms • Canwereject??
Ok, nowtimeto interpret! • Like logit, the Betas arebasicallymeaningless, but - Poissoncangiveus Odds ratio (IRR), or ’incident rate ratio’ = exponentiated Betas (like logit) =) =exp(1.08) = 2.95 • Ex., holding math score constant, a student in an academic program (comparedwith general) has 2.95 times the incident rate • Also, weseethat for everyincrease in oneunit in a math score (e.g. ’1’), the percent change in the incident rate increases by 7%, holding program constant
No let’s try predictedprobabilities… • Withmath score at itsmean (52), wefind the predictednumberofawards for someonewith an academic program education = 0.62, whilesomeonewith a general is only 0.21 (or 2.95 timesgreater, as in our IRR score!!) • Someonewith a vocationalbackground is predictedtohave .31 awards, holding math score at itsmean.
Testing the impactofMath Scores • Wecan do marginal effectsofMath scores at itsmean ”margins, dydx(math) atmeans’ or at certainlevels – • Wefindthat the predictedcountofawards for a student with a 35 (nearlow) math score is 0.13, but at 75 (high) it is 2.17 – • Thisseemsverymeritocratic
Showing the effectof a categorical IV over other IV values.. • To seehowmath scores interactwithacademic programs, wecan show marginal effetsof the programs at variouslevelsofmath scores.. • Whenmath scores arelow, program plays no role for awards, BUT, as it increases, it seems students withacademic program favored!!
Alternatively • A similarwayto show this is withpredictedprobabilities over ’actual observations’ • This shows only the relevant range(butslightlymorecommandsthan just margins) Stata graphcommands • predict c • separate c, by(prog) • twoway scatter c1 c2 c3 math, connect(1 1 1) sort
Final diagonistic checks of the model • Run ’fitstat’ to get model statistics, especiallywhenrunningmoreIV’s and testing for omitted &/or irrelevant IV’s • Overdispersion: since it is VERY rare that the meanof the Y = VAR(Y), weshould do one last test – it is recomendedtorun a negative binomialmodel(with same DV & IV’s) and compare the ’alpha’ and otherestimatesof the IV’s – LR test ’nbreg’ • STATA doesthis for us – Ho: no difference –sincePoisson is moreefficient , wechoosethis!!
2. Negative BinomialModels (NBM) • Are also ”count” models for limitedDV’s, verysimilartoPoisson in bothassumptions and interpretation • Uses a version of Lambda as a linkfunctiontoestimatePr(Y) as well • Keydifference from Poisson is that the Var(Y) is assumedto be largerthan the Mean(Y) (e.g. ’overdispersion’). • Also, ifwecannotassumethat the outcomesof Y are independent from oneanother, than a NBM might be moreappropriate • A matterofefficiency: wepreferPoissonbecasueofgreaterefficiency, butthere is a clear solution whenweviolatekeymodelassumptions, so wetake NBM instead..
Cont. • Like Poisson, the NBM assumesconstantvariance in Y, which is estimated by maximum likelihood as: Var(Y) = λ • the ’dispearsion parameter’ (set at ’0’ in Poisson), so insteadofone parameter being estimated, thereare 2 (which is why less ’efficient’) • Useslogged Betas, so like logit (& Poisson) canuse Odds ratios • So, NBM’sarebasically a more general typeofPoissonmodel. Keydifferences • becauseof the quadradicfunction in the assumed Var(Y), theyare LESS EFFICIENT – Poissonwillproduce SMALLER s.e.’s for beta estimates, in med-largesamples, the estimatesareconsistant (not-biased) however. • Following, NBM’swillresult in largerexpectedprobabilities for smallercounts (e.g. # of Yi outcomes) comparedwithPoisson • NBM’swillhaveslightlylargerprobabilities for largercounts
Example: common Poisson vs. Negative binomial distributions ofour data
Ex. usehttp://www.ats.ucla.edu/stat/stata/dae/nb_data, clear • Let’ssaywewanttoexplain student absences from a given school term • Let’s look at our DV: • tabstat DV, stats (mean v n) 2. ‘histogram AbsAdj, discrete freq scheme(s1mono) normal’ Ok, it’sprettyclearthat the Poissonassumptions (mean=Var) does not hold as the Var ofthe DV is morethat 2*mean Thus wehave ’over-dispersed data’, and right for a NBM
The estimation • Let’s get reallycreative and estimate the type of school program & student performance (math scores) on the # of absences from a term.. • What do welearn from this? -Model fit: chi2 saysmodel is significant -’alpha’ is usedto get ourestimationof Var(Y) in ML (0.45) (this is like ’modelingoverdispearsion’ The LR test is Ho: ourmodel = Poisson, whichwecanreject.
Cont. -betterMathscores areassociatedwith a decrease in absences -comparedwith general program, academic & vocational student have less absences – wecan test the overall sig. of ’program’ with the ’test’ after regression. Odds ratios (again ’irr’) canhelp.. Weseethat for everyincrease in math scores, the rate ofabsencesdecreases by ca. 1% Comparedwith general, the ’incident rate’ of an absences is .57 and .38 for academic and vocation students respectively
Interpretation – individualpredictions • What is the numberofabsencespredicted by the model for a vocational student with a 79 math score? 1.854 + (79*-.0045) + (1*-.9557) = .541477 Like logit, this is the log, so wetake the exponent ofthis: = 1.718 absences To check this, wecan ’predict c’ to get predictedprobabilities, and go in an look at any observation..
Or just check for marginal effects - MEM • At meanlevelsofmathscores, predictednumberofabsences by program: • General = 5.07 • Academic = 2.91 • Vocational = 1.95 ** Vocational/General = 1.952/5.076 = 0.384 =our Incident rate!
Marginal EffectsofMath scores • Math scores range from 1-99, so wetakepredictednumberofabsences at scores at intervals of 20 points • Predicted # ofabsences for a student withlowest score = 3.4, whileonly 2.16 for highest • Buthow do our 2 variablesinteract??