1 / 34

Unit 8: Categorical predictors, I: Dichotomies

Unit 8: Categorical predictors, I: Dichotomies. "There are two kinds of people in the world: Those who believe there are two kinds of people in the world and those who don't." –Robert Benchley, American Humorist (1888-1946). The S-030 roadmap: Where’s this unit in the big picture?. Unit 1:

heman
Download Presentation

Unit 8: Categorical predictors, I: Dichotomies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unit 8: Categorical predictors, I: Dichotomies "There are two kinds of people in the world: Those who believe there are two kinds of people in the world and those who don't." –Robert Benchley, American Humorist (1888-1946)

  2. The S-030 roadmap: Where’s this unit in the big picture? Unit 1: Introduction to simple linear regression Unit 2: Correlation and causality Unit 3: Inference for the regression model Building a solid foundation Unit 5: Transformations to achieve linearity Unit 4: Regression assumptions: Evaluating their tenability Mastering the subtleties Adding additional predictors Unit 6: The basics of multiple regression Unit 7: Statistical control in depth: Correlation and collinearity Generalizing to other types of predictors and effects Unit 9: Categorical predictors II: Polychotomies Unit 8: Categorical predictors I: Dichotomies Unit 10: Interaction and quadratic effects Pulling it all together Unit 11: Regression modeling in practice

  3. In this unit, we’re going to learn about… • Categorical predictors and regression: An unusual marriage • Creating and naming conventions for a dummy (or indicator) variable • Regressing Y on a dummy variable: How this relates to the two-sample t-test • What happens if we change the reference category? • Including a dummy variable in a multiple regression model: How and why it operates • Another example of using the simple and partial correlation matrices to foreshadow results • Adjusted means: A simple way of presenting findings for categorical question predictors • Graphic displays of regression findings: How do you decide which effects to highlight? • Displaying and interpreting prototypical trajectories

  4. Categorical predictors and regression: An unusual marriage? assumptions focus on Y and  no assumptions about the X’s Another important distinction Dichotomies (only 2 categories) Polychotomies (>2 categories) Nominal predictors (unordered values) Sex Religion Political Party Ordinal predictors (ordered values) Education Religiosity Neighborhood Integration Categorical predictors are predictors whose values denote categories By convention, the variable name corresponds to the category given the value 1 Dummy (or indicator) variables Variables whose values offer no meaningful quantitative information but simply distinguish between categories By convention, the category given the value 0 is called the reference category

  5. Do “primary” seat belt laws save lives? 36 states did not have a primary seat belt law (72%) 14 states had a mandatory primary seat belt law (28%) SeatBeltLaw = 0 if no law = 1 if law Non-occupant fatalities (pedestrians & bicyclists) Occupant fatalities (driver & passenger) Source: Calkins, LN & Zlatoper, TJ (2001). The effects of mandatory seat belt laws on motor vehicle fatalities in the United States, Social Science Quarterly, 82(4), 716-732 Potentially important covariate n = 50 Seat Miles ID State DPFat NOFat BeltLaw Driven 40 RI 44 13 0 7071 2 AK 47 17 0 4387 46 VT 61 19 0 6466 35 ND 71 9 0 7123 51 WY 71 15 0 7576 30 NH 76 28 0 11202 8 DE 84 25 0 8007 .... 47 VA 630 147 0 70320 26 MO 775 143 0 62980 1 AL 777 124 0 53458 43 TN 789 172 0 60526 14 IL 824 313 0 99319 23 MI 846 259 0 91755 36 OH 944 253 0 103675 39 PA 975 271 0 98015 10 FL 1478 835 0 134007 12 HI 83 36 1 7947 7 CT 199 96 1 28552 32 NM 237 97 1 21937 38 OR 306 99 1 32268 16 IA 312 61 1 27984 21 MD 345 146 1 46609 19 LA 512 184 1 38840 37 OK 541 107 1 41400 15 IN 615 133 1 68620 33 NY 822 546 1 120778 34 NC 870 269 1 81893 11 GA 973 257 1 93317 5 CA 1817 1102 1 285612 44 TX 2012 613 1 198700 RQ: Do states with primary seat belt laws have lower traffic fatality rates? Hypothesis 1: Seat belt laws save lives because seat belts save lives Hypothesis 2: The Offset hypothesis:Seat belts encourage riskier driving behavior that may offset any benefit associated with increased seat belt use

  6. Do seat belt laws save lives? A 2 sample t-test tests the null hypothesis that 2 population means are the same: Occupant Fatalities Non-occupant Fatalities • Should we believe these t-tests? • Is the homoscedasticity assumption tenable? • Should we be concerned about the skewness of these outcomes?

  7. Can transformation help make the outcome distributions more symmetric? Non-occupant Fatalities Occupant Fatalities Loge(Non-occupant Fatalities) Loge(Occupant Fatalities) Stem Leaf # Boxplot 7 0 1 | 6 7 1 | 6 34 2 | 5 5556667 7 | 5 0001223 7 +-----+ 4 5566666788899 13 *--+--* 4 01233 5 | | 3 566 3 +-----+ 3 23444 5 | 2 67789 5 | 2 2 1 | ----+----+---- Stem Leaf # Boxplot 76 1 1 | 74 0 1 | 72 0 1 | 70 | 68 588 3 | 66 5671147 7 +-----+ 64 25 2 | | 62 48914 5 | | 60 0634 4 | | 58 4336 4 *--+--* 56 247 3 | | 54 708 3 | | 52 5579 4 +-----+ 50 35 2 | 48 | 46 7 1 | 44 234 3 | 42 663 3 | 40 1 1 | 38 5 1 | 36 8 1 | ----+----+-- Stem Leaf # Boxplot 11 0 1 * 10 10 9 9 8 8 4 1 * 7 7 6 6 1 1 0 5 5 1 0 5 4 4 3 3 1 1 | 2 56677 5 | 2 14 2 | 1 55788 5 +--+--+ 1 000001222334 12 *-----* 0 5667799 7 | | 0 11222223333344 14 +-----+ ----+----+----+ Stem Leaf # Boxplot 20 1 1 0 19 18 2 1 0 17 16 15 14 8 1 | 13 | 12 | 11 | 10 | 9 478 3 | 8 2257 4 | 7 889 3 +-----+ 6 23 2 | | 5 14457 5 | | 4 0367 4 | + | 3 1124889 7 *-----* 2 00446 5 | | 1 25799 5 +-----+ 0 456778889 9 | ----+----+--

  8. Distribution of loge(n fatalities) by presence of seat belt law LAW 6.21 Loge(Occupant fatalities) NO LAW 5.64 Diff in means 0.57 t (for diff) 1.89 p(for diff) 0.0643 Loge(Occupant Fatalities) Loge(Non-occupant Fatalities)

  9. Simple regression with one dichotomous predictor: How & why it works The slope is the estimated difference in Y between categories of the dichotomous predictor (here, the mean difference in Y between states with and without seat belt laws) The y-intercept is the estimated value of Y when the dichotomous predictor=0 (here, the mean loge(occupant fatalities) for non-seat belt law states) + 0.57 LAW 6.21 NO LAW 5.64 Diff in means 0.57 t (for diff) 1.89 p(for diff) 0.0643 Loge(Occupant Fatalities) 6.21 States with laws Dependent Variable: LDPFat Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 5.64258 0.15809 35.69 <.0001 SeatBeltLaw 1 0.56572 0.29877 1.89 0.0643 5.64 States without laws What would have happened if we’d changed the reference category (when X=0)? Loge(Occupant fatalities)

  10. What happens if we change the “reference category”? Dependent Variable: LDPFat Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 5.64258 0.15809 35.69 <.0001 SeatBeltLaw 1 0.56572 0.29877 1.89 0.0643 SeatBeltLaw 0 = no law 1 = law Dependent Variable: LDPFat Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 6.20830 0.25352 24.49 <.0001 NoSeatBeltLaw 1 -0.56572 0.29877 -1.89 0.0643 NoSeatBeltLaw 0 = law 1 = no law LAW 6.21 NO LAW 5.64 Diff in means 0.57 t (for diff) 1.89 p(for diff) 0.0643 Weather Urbanicity Vehicle miles The intercept is always the estimated value of Y in the reference category The sign of the slope is reversed The se of the slope remains the same The se and hypothesis test for the intercept changes to focus on the reference category Results of hypothesis tests are identical, regardless of how a dichotomous predictor is coded What happens if we statistically control for covariates? Loge(Occupant fatalities)

  11. Vehicle Miles: A theoretically important covariate Loge(Occupant Fatalities) Loge(Occupant Fatalities) r = 0.96*** Loge(Non-occupant Fatalities) Loge(Non-occupant Fatalities) r = 0.96***

  12. What about the effect of SeatBeltLaws after controlling for LMiles? Loge(Occupant Fatalities) Loge(Occupant Fatalities) States with seat belt laws States with seat belt laws States without seat belt laws States without seat belt laws Controlling for Lmiles, states with laws have fewer occupant fatalities than states without laws Loge(Non-occupant Fatalities) Loge(Non-occupant Fatalities) States with seat belt laws States with seat belt laws Controlling for Lmiles, states with laws have more non-occ. fatalities than states without laws States without seat belt laws States without seat belt laws

  13. Including a dichotomous predictor in a MR model: How & why it works ^ 1 Loge(Non-occupant Fatalities) effect of Seat Belt Laws, controlling for vehicle miles Realize that these lines are parallel because we’ve assumed that they’re parallel. This is the main effects assumption that we’ll learn how to examine (and if necessary relax) in Unit 10. SeatBelt = 1 SeatBelt = 0 In many fields, this model is known as an Analysis of Covariance (ANCOVA) model

  14. The effect of SeatBeltLaws on Occupant Fatalities (controlling for LMiles) Model is stat sig (p<.0001) R2 statistic is very high Estimated effect of vehicle miles is large: Controlling for seatbelt laws, states whose total vehicle miles differ by 1% have occupant fatalities that differ by an average of approximately 1% as well (p<.0001) The effect of seatbelt laws disappears: Controlling for vehicle miles, there is no relationship between seatbelt laws and the number of occupant fatalities The REG Procedure Dependent Variable: LDPFat Number of Observations Read 50 Number of Observations Used 50 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 43.08715 21.54358 304.21 <.0001 Error 47 3.32849 0.07082 Corrected Total 49 46.41564 Root MSE 0.26612 R-Square 0.9283 Dependent Mean 5.80098 Adj R-Sq 0.9252 Coeff Var 4.58747 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -4.12267 0.41399 -9.96 <.0001 SeatBeltLaw 1 -0.05016 0.08775 -0.57 0.5703 LMiles 1 0.95514 0.04026 23.72 <.0001 Loge(Occupant Fatalities) What has happened???

  15. Developing your instinct for the effects of a dichotomous predictor: Comparing uncontrolled and controlled effects on occupant fatalities - 0.05 - 0.05 + 0.57 Estimated effect of having a primary seat belt law on number of occupant fatalities Uncontrolled Controlling for vehicle miles +0.57 t = 1.89 p = .0643 -0.05 t = -0.57 p = .5703 Loge(Occupant Fatalities) 7.34 States without laws 7.29 States with laws Holding vehicle miles constant, states with Seat Belt Laws have no more occupant fatalities than those without laws 6.21 States with laws 5.64 States without laws States with Seat Belt Laws have many more vehicle miles. Might this explain why they have many more occupant fatalities??? 4.00 3.95 10.22 10.87 18

  16. What’s the effect of SeatBelt laws on non-occupant fatalities? Model is stat sig (DUH!) R2 is very high (DUH!) Estimated effect of vehicle miles is large: Controlling for seatbelt laws, states whose total vehicle miles differ by 1% have non-occupant fatalities that differ by an average of approximately 1% as well (p<.0001) The effect of seatbelt laws diminishes: Controlling for vehicle miles, the effect of seatbelt laws on the number of non-occupant fatalities is no longer statistically significant at the 0.05 level (although this difference between stat sig & not stat sig is undoubtedly not stat sig!) The REG Procedure Dependent Variable: LNOFat Number of Observations Read 50 Number of Observations Used 50 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 55.30826 27.65413 262.68 <.0001 Error 47 4.94808 0.10528 Corrected Total 49 60.25634 Root MSE 0.32447 R-Square 0.9179 Dependent Mean 4.52624 Adj R-Sq 0.9144 Coeff Var 7.16856 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -6.41013 0.50476 -12.70 <.0001 SeatBeltLaw 1 0.18784 0.10698 1.76 0.0857 LMiles 1 1.04607 0.04909 21.31 <.0001 Loge(Non-occupant Fatalities) At this point in our analysis, our question predictor seems to have NO effect (controlling for LMILES)... Oy!

  17. Making sense of the correlation matrix: Considering old & new variables Warmer states have SeatBelt laws and more vehicle miles Might controlling for LTemp change things? SeatBelt states have more vehicle miles And that seems to have been enough to make the SeatBelt effects disappear! Pearson Correlation Coefficients, N = 50 Prob > |r| under H0: Rho=0 Seat LDPFat LNOFat BeltLaw LMiles LTemp PctUrban lpopden LDPFat 1.00000 0.92716 0.26363 0.96322 0.50966 0.33674 0.40214 <.0001 0.0643 <.0001 0.0002 0.0168 0.0038 LNOFat 0.92716 1.00000 0.35271 0.95525 0.50989 0.56415 0.50772 <.0001 0.0120 <.0001 0.0002 <.0001 0.0002 SeatBeltLaw 0.26363 0.35271 1.00000 0.29585 0.33251 0.21303 0.20434 0.0643 0.0120 0.0370 0.0183 0.1374 0.1546 LMiles 0.96322 0.95525 0.29585 1.00000 0.41615 0.49945 0.53365 <.0001 <.0001 0.0370 0.0026 0.0002 <.0001 LTemp 0.50966 0.50989 0.33251 0.41615 1.00000 0.28925 0.33662 0.0002 0.0002 0.0183 0.0026 0.0416 0.0168 PctUrban 0.33674 0.56415 0.21303 0.49945 0.28925 1.00000 0.67812 0.0168 <.0001 0.1374 0.0002 0.0416 <.0001 lpopden 0.40214 0.50772 0.20434 0.53365 0.33662 0.67812 1.00000 0.0038 0.0002 0.1546 <.0001 0.0168 <.0001 The two outcomes are highly correlated (but this does NOT mean we should only analyze one of them!) SeatBelt states have more fatalities (Knew this from t-test results) Vehicle miles is a strong predictor Partial LMILES out and look again Warmer states have more fatalities Might want to include? More urban states have more fatalities (esp non-occupants) SeatBelt states are more urban More urban states are warmer & have more vehicle miles The urbanicity variables are highly correlated Might want to include? but this difference is not stat sig… Getting a sense that we need to make sure that we can really include all these additional predictors… This does NOT mean they are definitely collinear—but we need to determine if both are needed in a model • Correlation matrix guidance: • Keep your eyes on the question predictor • See a control predictor with a big effect?  Partial it out and look again...

  18. What changes & what remains the same when we partial out LMiles? Hmmm... Hmmm... Hmmm... Warmer states still aremore likely to have SeatBelt Laws (but the partial is now n.s.) Pearson Partial Correlation Coefficients, N = 50 Prob > |r| under H0: Partial Rho=0 Seat LDPFat LNOFat BeltLaw LTemp PctUrban lpopden LDPFat 1.00000 0.08868 -0.08310 0.44532 -0.61999 -0.49233 0.5446 0.5703 0.0013 <.0001 0.0003 LNOFat 0.08868 1.00000 0.24809 0.41772 0.33969 -0.00821 0.5446 0.0857 0.0028 0.0169 0.9554 SeatBeltLaw -0.08310 0.24809 1.00000 0.24107 0.07887 0.05751 0.5703 0.0857 0.0952 0.5901 0.6947 LTemp 0.44532 0.41772 0.24107 1.00000 0.10334 0.14895 0.0013 0.0028 0.0952 0.4798 0.3071 PctUrban -0.61999 0.33969 0.07887 0.10334 1.00000 0.56177 <.0001 0.0169 0.5901 0.4798 <.0001 lpopden -0.49233 -0.00821 0.05751 0.14895 0.56177 1.00000 0.0003 0.9554 0.6947 0.3071 <.0001 The two outcomes are no longerhighly correlated SeatBelt states no longerhave more fatalities (We knew this from regression results) Warmer states still have more fatalities Really need to include LTEMP, don’t we! More urban states now have fewer occupantfatalities! Urbanicity is now uncorrelated with either SeatBelt laws or temperature The urbanicity variables are stillhighly correlated States with a greater %age of urban roads still have more non-occupant fatalities, but population density, by itself, no longerseems to matter But this still does NOT mean they are necessarily collinear! So we can probably add at least one urbanicity variable (but need to check about both) We probably want to include PctUrban, but we’re now unsure about LPopDen—need to see what happens • In general, the inter-correlations between predictors are smaller after we control for LMILES... • But also, some of these correlations have changed sign!

  19. What happens when we control for these additional covariates?Non-occupant fatalities The effect of the seatbelt law has disappeared! Any observed differential in non-occupant fatalities between states with and without seatbelt laws is now well within the limits of sampling variation, once you control for vehicle miles, temperature and urbanicity. This suggests little support for the offset hypothesis R2 statistic is even higher Effect of Vehicle Miles remains stable Positive effect of temperature: Controlling for all other variables in the model, states whose average temperatures are 1% higher have non-occupant fatality rates that are .92% higher Urbanicity variables tell a complex story. On the one hand, the higher the percentage of urban roads in a state, the higher the number of non-occupant fatalities; on the other hand, the higher the population density, the lower the number of non-occupant fatalities The REG Procedure Dependent Variable: LNOFat Number of Observations Read 50 Number of Observations Used 50 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 5 56.89264 11.37853 148.84 <.0001 Error 44 3.36370 0.07645 Corrected Total 49 60.25634 Root MSE 0.27649 R-Square 0.9442 Dependent Mean 4.52624 Adj R-Sq 0.9378 Coeff Var 6.10863 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -9.48379 1.12472 -8.43 <.0001 SeatBeltLaw 1 0.10353 0.09409 1.10 0.2772 LMiles 1 0.97871 0.05106 19.17 <.0001 LTemp 1 0.91980 0.29841 3.08 0.0035 PctUrban 1 1.04986 0.31712 3.31 0.0019 lpopden 1 -0.09652 0.04100 -2.35 0.0231

  20. What happens when we control for these additional covariates?Occupant fatalities The effect of the seatbelt law is now reversed! States with primary seat belt laws have lower numbers of occupant fatalities than states without these laws, once you control for vehicle miles, temperature and urbanicity R2 statistic is almost perfect! Effect of Vehicle Miles remains stable Positive effect of temperature: Controlling for all other variables in the model, states whose average temperatures are 1% higher have occupant fatality rates that are 1.1% higher City driving is safer for car occupants. The higher the percentage of urban roads and the denser the population, the lower the number of occupant fatalities. The REG Procedure Dependent Variable: LDPFat Number of Observations Read 50 Number of Observations Used 50 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 5 45.49932 9.09986 436.96 <.0001 Error 44 0.91632 0.02083 Corrected Total 49 46.41564 Root MSE 0.14431 R-Square 0.9803 Dependent Mean 5.80098 Adj R-Sq 0.9780 Coeff Var 2.48769 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -8.40250 0.58703 -14.31 <.0001 SeatBeltLaw 1 -0.10059 0.04911 -2.05 0.0465 LMiles 1 1.01653 0.02665 38.14 <.0001 LTemp 1 1.10171 0.15575 7.07 <.0001 PctUrban 1 -0.87943 0.16551 -5.31 <.0001 lpopden 1 -0.06346 0.02140 -2.97 0.0049

  21. How would we present these regression results? We (the people) conclude: Seatbelt laws save the lives of people in cars and don’t hurt people on the streets

  22. Adjusted means: A simple way of presenting findingswhen your question predictor is CATEGORICAL (here, dichotomous) (0.52) (3.98) (4.29) (10.40) (0.52) (3.98) (4.29) (10.40) To calculate adjusted means Set all predictors—except for the categorical question predictor—to their sample means and then compute the predicted value of the outcome at each value of the categorical predictor For occupants Diff = -0.10, t=-2.05, p=0.0465 For non-occupants Diff = 0.10, t=1.10, p=0.2772

  23. Presenting unadjusted and adjusted means Adjusted mean = controlling for temp, miles driven & urbanicity The difference between the means of the dichotomous predictor’s two categories is equal to the dichotomous predictor’s slope coefficient in a particular model. For example, for occupant fatalities: Unadjusted means  6.21 – 5.64 = 0.57 in the uncontrolled model Adjusted means  5.77 – 5.87 = -0.10 in the controlled model

  24. Another example presenting adjusted differences between groups British Medical Journal, 2005, 331, 1306-1311

  25. Towards a graphic display of the regression findings: Which predictors would we want to highlight in a graph? ...but as dichotomy, we probably don’t want it on the x-axis! Question Predictor Definitely want to document in a graph ...so hold it constant at its mean? Obvious Covariate Don’t need to document in a graph ...if so, how? Interesting CovariateMight want to document in a graph ... Should we emphasize the difference in signs for the two outcomes? Interesting CovariateProbably want to document in a graph Small, similar effect Don’t need to document in a graph ...so hold it constant at its mean?

  26. Sketching out the expected graph documenting the effects of Seatbelt laws, PctUrban and Temperature Q1: With 2 outcomes, do I want 1 graph or 2? Q3: How should we display the effects of the other continuous predictor, LTemp? Q4: What will the lines look like for states with and without seatbelt laws? Q2: Which of the predictors should go on the X axis? Attend now to ranking; worry about scale later Depends on where the lines fall Need to choose prototypical values—’warm’ and ‘cold’ states Usually the question predictor, but because SeatBelt is a dichotomy, I’m choosing PctUrban to highlight the sign difference

  27. Displaying prototypical trajectories, step one: Setting control variables at their means (4.29) (10.40) = - + + + + ˆ log( No n Occ ) 9 . 48 9 . 81 0 . 10 SeatBelt 0 . 92 LTemp 1 . 05 PctUrban (4.29) (10.40)

  28. Displaying prototypical trajectories, step two: Computing predicted values for selected levels of the remaining predictors Selecting prototypical temperature values Mean(LTemp)=3.98, sd=0.15 Cold  1 sd below mean = 3.85 (47º F) Warm  1 sd above mean = 4.15 (63º F) Mean = 0.52, sd = 0.15 Displayed on X axis: calculate at .33 and .66 Only 2 values, 0 & 1 ~ Illinois/Michigan ~ Mississippi/Tenn

  29. The effects of seat belt laws, urbanicity & temperature on traffic fatalitiescontrolling for vehicle miles and population density What would this graph look like if we were to also “just control” for the effect of LTemp? Loge(Fatalities) 6.5 6.0 5.5 5.0 4.5 4.0 0.25 0.50 0.75 Pct Urban Roads Occupant Fatalities Warm No Law Note: These differences in occupant fatalities by Seat Belt Law are statistically significant Cold Seat Belt Law No Law Seat Belt Law Note: These differences in non-occupant fatalities by Seat Belt Law are not statistically significant Non-occupant Fatalities Seat Belt Law No Law Seat Belt Law No Law Warm Cold

  30. In some situations, you might prefer a simpler display How does this graph relate to the adjusted means? The effects of seat belt laws and urbanicity on traffic fatalitiescontrolling for vehicle miles, population density and temperature (with Ltemp set at its mean of 3.98) Loge(Fatalities) 6.5 Occupant Fatalities 6 5.87 Note: These differences in occupant fatalities by Seat Belt Law are statistically significant No Law 5.77 Seat Belt Law 5.5 Non-occupant Fatalities 5 Note: These differences in non-occupant fatalities by Seat Belt Law are not statistically significant Seat Belt Law No Law 4.63 4.53 4.5 4 0.25 0.5 0.75 Pct Urban Roads Go to adjusted means

  31. What’s the big takeaway from this unit? • Regression models can easily include dichotomous predictors • All assumptions are about Y at particular values of X (or X’s)—no assumptions about the distribution of the predictors • The same toolkit we’ve developed for continuous predictors can be used for dichotomous predictors (including hypothesis tests, correlations and plots) • Controlled effects are often different from uncontrolled effects • One of the major reasons we use multiple regression is that we have several predictors that affect the outcome for which we want to statistically control • Not only can we control for a single covariate, we can control for many covariates simultaneously (in this example, we had 4 covariates in addition to our question variable) • Results of complex analyses can be displayed more simply using tables and graphs • As your models become more complex, the need for simpler numerical and graphical displays remains • Always important to think about how you will communicate your results to colleagues and broader audiences • Adjusted means and prototypical trajectories are powerful tools

  32. Appendix: Annotated PC-SAS Code for Using Dichotomous Predictors proc boxplot, when used for dichotomous predictors, creates pairs of boxplots comparing the outcome variables values across the two categories in the dichotomous predictor. The plot statement specifies the outcome variables to be used and the dichotomous predictor. Its syntax is outcome*predictor (note the use of parenthesis because of the two outcome variables) proc means is a very useful tool to create table summaries of descriptive statistics, especially for categorical predictors. The by statement specifies the categorical predictor to be used in grouping the data. The var statement specifies the variables for which you require descriptive statistics. proc ttest runs a two-sample t-test comparing the means of two groups. The class statement specifies the categorical predictor used to differentiate the two groups. Note that this is just an abstract from the full program *------------------------------------------------------------------* Creating boxplots of DPFAT & NOFAT distributions for SEATBELTLAW=0 and SEATBELTLAW=1 *------------------------------------------------------------------*; procboxplot data=one; title2 "Fatalities by Presence/Absence of SeatBelt Laws"; plot (PDFat NOFat)*SeatBeltLaw; *-------------------------------------------------------------------* Display PDFAT & NOFAT univariate summary information in tables for SEATBELTLAW=0 & SEATBELTLAW=1 *------------------------------------------------------------------*; procmeans data=one; by SeatBeltLaw; var PDFat NOFat; *-------------------------------------------------------------------* Comparing mean values of PDFAT & NOFAT for SEATBELTLAW=0 and SEATBELTLAW=1 *------------------------------------------------------------------*; procttest data=one; class SeatBeltLaw; var PDFat NOFat;

  33. Appendix: Annotated PC-SAS Code for Using Dichotomous Predictors Use the datastep in the middle of the program to add new variables to the same data. The set statement specifies to which dataset to add the variable. You can then run new PROCs on the same data, using the new variables. *-------------------------------------------------------------------* For pedagogic purposes only: What happens if we change the reference category? Creating new dichotomous predictor NOSEATBELTLAW *------------------------------------------------------------------*; data one; set one; NoSeatBeltLaw = 1 - SeatBeltLaw; -------------------------------------------------------------------* Controlling for vehicle miles Inspect bivariate scatterplots LDPFAT vs MILES, LDPFAT vs LMILES, LNOFAT vs MILES, LNOFAT vs LMILES Inspect same plots showing SEATBELTLAW=0 and SEATBELTLAW=1 *-----------------------------------------------------------------*; procgplot data=one; title2 "Examining the effect of vehicle miles"; plot (LDPFat LNOFat)*(miles lmiles); plot (LDPFat LNOFat)*(miles lmiles)=SeatBeltLaw; proc gplotcan also be used to represent a three way plot with plotting symbols denoting the 3rd (here categorical) predictor. The plot statement syntax is outcome*predictor=categorical predictor. If you use a symbol statement in the program, SAS will use dots ● of different colors for each category of the predictor. Note you can have multiple plot statements in a single GPLOT. *-------------------------------------------------------------------* Estimating partial correlations controlling for LMILES *------------------------------------------------------------------*; proccorr data=one; title2 "Partial correlation matrix controlling for Lmiles"; var LDPFat LNOFat SeatBeltLaw ltemp PctUrban lpopden; partial lmiles; proc correstimates bivariate correlations between variables you specify. By adding a partial statement to the syntax, it will estimate partial correlations, controlling for the variable named in the partial statement.

  34. Glossary terms included in Unit 8 • 2 sample t-test • Adjusted mean • Categorical predictor (nominal and ordinal) • Dichotomous predictor • Dummy variable • Main effects assumption

More Related