1 / 19

Dummy Variables

Dummy Variables. Outline. Objective Why forming dummy variables to use nominal variables as independent variables in regressions are important. How to use and interpret dummy variables. Rules of use. Recommended best practices. Interpretation example. Objective.

maik
Download Presentation

Dummy Variables

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dummy Variables

  2. Outline • Objective • Why forming dummy variables to use nominal variables as independent variables in regressions are important. • How to use and interpret dummy variables. • Rules of use. • Recommended best practices. • Interpretation example

  3. Objective • Learn how to use nominal variables as independent variables in regression models. • These include variables like: • Continent/region (Africa, Western Europe, etc). • U.S. Party Vote (Democrats, Republicans, Other). • Marital status (Married, Single, Widowed, etc). • Religion (Catholic, Protestant, Muslim, etc).

  4. Independent Variables in Regressions • When you run an ordinary least squares (OLS) regression analysis, each B coefficient can be interpreted as the predicted change in Y (the dependent variable) as a result of increasing the independent variable (X) by one unit. • For example: To explain differences in countries’ life expectancy rates, a regression was run using literacy rates as an independent variable. • Both variables are interval.

  5. Interpreting Coefficients • The B (unstandardized) coefficient for literacy rates was 0.28. • We interpret this coefficient as: • When literacy increases by one point, the model predicts that life expectancy will increase by 0.28 points when controlling for all other variables.

  6. What if? • What if another reader suggested that the relationship between literacy and life expectancy was different in Africa than everywhere else in the world? • Fortunately, there is a variable in the dataset for continent/region: North America, South America, Western Europe, Eastern Europe, Africa, the Middle East, Central & South Asia, East Asia and Oceania.

  7. A problem with nominal variables • The continent/region variable is nominal. • This poses a problem when used in regression analyses, because without an order to the values, we cannot interpret the coefficient. • It would be silly to say that “for every one point increase in region…” or “for every one point increase from North America…”

  8. Solution for nominal variables • Transform nominal variable into many dichotomous variables, called “dummies.” • Dichotomous variables have only two value categories or options, like “yes” and “no”. • So, recode the region variable so that all African countries are coded as 1 and all other countries are 0. • With only two options, coefficients can be interpreted as the difference from one value category to the other value category. • The coefficient for a dichotomous variable for African countries would be interpreted as the difference between African and non-African countries.

  9. Dummy interpretation IV = Africa= 1, All others = 0 DV = Life expectancy Unstandardized B Coefficient = ## • The model predicts that compared to all other countries, countries in Africa have ## lower/higher life expectancy when controlling for all other variables.

  10. Rules for Dummies • All dummy variables must be dichotomous with only two options or categories. • Continents: • Africa=1, all other regions = 0 AND/OR a separate variable that is: • Western Europe =1, all other regions = 0. • Party voted for when there is a Green Party, Tea Party or other third party candidate: • Democrat=1, all other parties = 0 AND/OR a separate variable that is: • Republican = 1, all other parties= 0.

  11. More rules for dummies • You can use more than one dummy variable as independent variables in a regression equation. • Region/continents example: • Africa=1, all other regions = 0 • Western Europe =1, all other regions= 0. • East Asia=1, all other regions= 0. • When you add new dummies, the observations covered by the omitted category (zero) decreases.

  12. Note on adding additional dummies • Each time you create a new dummy variable out of a nominal variable, that category is no longer included in the omitted category (zero). • For example, if you have only one dummy variable, Africa=1, then all other regions = 0. • If you add a dummy for Western Europe =1, then all other regions is really “all other regions except Africa and Western Europe.” • If you a dummy for East Asia=1 too, then for each of the three dummies, 0= “all other regions except Africa, Western Europe and East Asia.”

  13. Maximum number of dummies • The number of dummy variables used must be NO MORE than one less than the total number of value categories in the original nominal variable. • For example, the original continent/region variable had NINE value categories: • North America, South America, Western Europe, Eastern Europe, Africa, the Middle East, Central & South Asia, East Asia and Oceania. • Therefore, one can use up to EIGHT different dummy variables. • There must always be at least one region as a baseline, remaining as zero.

  14. What dummy do you exclude? • It does not matter to your overall model which category you exclude if you include the maximum number of variables. • However, there are best practices that one ought to follow when choosing the excluded category. The excluded category is like a baseline, so certain categories make results easier to understand and interpret. • It is best if the excluded category is: • The mode or most common category • The observations in that category are relatively similar or homogenous.

  15. Recommended: exclude the mode • Exclude the mode, the most common or best known category. • For example, if your original variable was U.S. vote choice, with three categories, Democrats, Republicans or “Other”, exclude the well-known Democrats or Republicans. It will be easier for you and your readers to interpret the coefficient for the dummy variable in regards to a well-known group of voters.

  16. Recommended: homogenous baseline • Since the excluded category provides a baseline, interpretations are easier when the excluded category is relatively homogenous. • In the region example, it may make sense to exclude Western Europe since almost all of the countries in those regions share certain attributes like high levels of literacy relative to countries in other regions. • This category may often appear “extreme”.

  17. Dummy interpretation: all others? IVs = Canadian Party Vote. Canada has five parties represented in Parliament: the Conservatives, the Liberals, the NDP, the Bloc Quebecois, and the Greens. Other parties also run. Conservatives = 1, All others = 0 NDP = 1, All others = 0 Bloc Quebecois = 1, All others = 0 Greens = 1, All others = 0 Other small parties = 1, All others = 0 What party or parties are included in “all others” at this point?

  18. Example when maximum number of dummies are used. Canadian vote example from previous slide Independent Variables = Vote (All others = last remaining party = Liberals) Conservatives = 1, Liberals = 0 NDP = 1, Liberals = 0 Bloc Quebecois = 1, Liberals = 0 Greens = 1, Liberals = 0 Other small parties = 1, Liberals = 0 Dependent Variable = Feeling towards Prime Minister Harper (Conservative)

  19. Interpretation of example • Independent variable = party vote, dependent variable is feelings towards Prime Minister Harper, the unstandardized B coefficient is ##. • The model predicts that compared to Liberals, NDP voters’ opinions are ## lower [or higher] when controlling for all other variables. • The model predicts that compared to Liberals, Conservative voters’ opinions are ## higher [or lower] when controlling for all other variables.

More Related