400 likes | 409 Views
Introduction to Statistics: Political Science (Class 9). Review. Probability of having cardiovascular disease. Purpose of statistics: Inferences about populations using samples We draw a random sample of 1,000 adults and 405 have some form of CVD
E N D
Introduction to Statistics: Political Science (Class 9) Review
Probability of having cardiovascular disease • Purpose of statistics: • Inferences about populations using samples • We draw a random sample of 1,000 adults and 405 have some form of CVD • Based on our sample, if we randomly select one adult from the population: what is the probability that they have cardiovascular disease?
Probability of exercising <3 days/week? Probability of CVD among those who exercise <3 days/week? Probability of CVD among those who exercise 3 or more days/week? Conditional Probability
Association between exercise and CVD? p1 = 28.9/(30.3+28.9) = 0.488 p2 = 10.6/(30.2+10.6) = 0.260 Difference = 0.488 - 0.260 = .228 Those who exercise less than 3 days/week .228 (22.8%) more likely to have CVD
Specifying and testing hypotheses • Difference of proportions = .228 • What’s our null hypothesis? • Why a “null hypothesis”? Why not test whether the difference is .228? • Central limit theorem • In repeated sampling, the distribution of our estimates of the mean (or difference of means or slope) will be normally distributed and centered over the true population value
Central limit theorem 0 1 standard error Proposed true value
Comparing proportions • Difference of proportions = .228 p1 = 28.9/(30.3+28.9) = 0.488 (N=602) p2 = 10.6/(30.2+10.6) = 0.260 (N=398) • Standard error of this difference:
Comparing proportions • So, standard error of difference is the square root of: (.488*(1-.488)/602)+(.260*(1-.260)/398) • Which is .0299 • Difference of proportions = .237
Hypotheses • Null hypothesis: • There is no difference in the rate of CVD between those who exercise less than 3 days/week and those who do • Alternate hypothesis: • There is a difference in the rate of CVD between those who exercise less than 3 days/week and those who do • (i.e., the difference is not 0)
If 0 is was the true difference, it would be very unlikely that we would find a difference 7.93 (.237/.0299) standard errors from that value by chance 0 1 standard error Proposed true value
Does exercise cause lower CVD? • Reverse causation? Might CVD cause exercise? • Failure to account for confounds • Typically leads to over-estimating the strength of a relationship (not always… but usually)
Specification and Interpretation Multivariate Regression
Does exercise make CDV less likely? • Regression (predict CDV) • Estimated likelihood of CDV if exercise 4 days/week? • What might confound our estimate of the relationship between exercise and CVD? Coef. SE T P-value Days Exercise (0-7) -0.06 .001 ? 0.000 Constant 0.56 .002 ? 0.000
Controlling for confounds Coef. SE T P-value Days Exercise (0-7) -0.03 .001 -3.0 0.002 Days Fast Food (0-7) 0.04 .002 2.0 0.048 Constant 0.42 .002 21.0 0.000
High Fast Food Low Fast Food % Chance CVD Days per Week Exercise
Controlling for dichotomous confounds Coef. SE T P-value Days Exercise (0-7) -0.03 .001 -3.0 0.002 Days Fast Food (0-7) 0.04 .002 2.0 0.048 Smoker (1=yes) 0.11 .001 11.0 0.000 Constant 0.38 .002 19.0 0.000 • Predicted probability of CVD for • 2 days exercise, 2 days Fast food, smoker
Nominal Variables • Variable that does not have an “order” to it • Nothing is “higher” or “lower” • Create set of dichotomous variables • Always interpret coefficients with respect to the reference category
Controlling for nominal confounds Coef. SE T P-value Days Exercise (0-7) -0.03 .001 -3.0 0.002 Days Fast Food (0-7) 0.03 .002 1.5 0.135 Smoker (1=yes) 0.09 .001 9.0 0.000 South (1=yes) 0.03 .002 1.5 0.137 West (1=yes) -0.01 .002 -0.5 0.642 Northeast (1=yes) 0.02 .002 1.0 0.410 Constant 0.34 .002 17.0 0.000 (Midwest is excluded category) What if we wanted to test whether including region indicators improves fit of the model?
Logarithms Why use a logarithmic transformation? You think the relationship looks like this…
Squared term – U(or ∩)-shaped relationship Age and political ideology (-2=very conservative, 2=very liberal)
Create indicators from an ordered variable • Party Identification (-3 to 3) • Seven Variables: • Strong Republican (1=yes) • Weak Republican (1=yes) • Lean Republican (1=yes) • Pure Independent (1=yes) • Lean Democrat (1=yes) • Weak Democrat (1=yes) • Strong Democrat (1=yes)
Predict Obama Favorability (1-4) Excluded category: Pure Independents
Predict Obama Favorability (1-4) New excluded category: Leaning Republicans
Interactions • One variable moderates the effect of another – i.e., the relationship between one variable and an outcome depends on the value of another variable
61.100 + Party(1.286 + Voted*3.575) – 1.138*Voted+ u • 61.100 + Party*1.286 + Voted(Party*3.575 –1.138)+ u • Regression estimates an equation… • 61.100 + 1.286*Party – 1.138*Voted + 3.575*Party*Voted + u • 61.100 + Party*1.286 + Party*Voted*3.575 – 1.138*Voted+ u • OR • 61.100 + Party*1.286 + Voted*Party*3.575 – Voted*1.138+ u
Dealing with confounds • Theory + multivariate regression • Experiments
Dealing with reverse causation • Theory • Experiments
Experiments • What is the key characteristic of an experiment? • How does this address reverse causality? • How does it address confounds? • Weaknesses/limitations of experiments?
Exam Expectations • Describe probabilities / conditional probabilities • Write hypotheses • Demonstrate understanding of how null hypotheses relate to the central limit theorem • Test difference of proportions (formula for SE will be provided) • Interpreting multivariate regression • Relationships (slopes) • Predicted values • Sketch graphs of relationships • Discuss strengths and limitations of analyses • Why an estimated slope might be biased • Benefits and limitations of experiments
Notes • Homework 3 graded • Homework 4 due Thursday 12/9 • Office hours next week – email to come • Exam December 14 at 2pm