200 likes | 374 Views
Exploratory data analysis with two qualitative variables. Not in FPP. Exploratory data analysis with two qualitative/categorical variables. Main tools Contigency tables Conditional, marginal, and joint frequencies. Motivating example. Surviving the Titanic
E N D
Exploratory data analysis with two qualitative variables Not in FPP
Exploratory data analysis with two qualitative/categorical variables • Main tools • Contigency tables • Conditional, marginal, and joint frequencies
Motivating example • Surviving the Titanic • Was there a class discrimination in survival of the wreck of the Titanic? • “It has been suggested before the Enquiry that the third-class passengers had been unfairly treated, that their access to the boat deck had been impeded; and that when they reached the deck the first and second-class passengers were given precedence in getting places in the boats.” Lord Mersey, 1912
Titanic: Marginal frequencies • % Dead = 1513/2224 = 0.68 • % Alive = 711/2224 = 0.32 • % in first class = 325/2224 = 0.14 • % in second class = 285/2224 = 0.13 • % in third class = 706/2224 = 0.32 • % crew = 908/2224 = 0.41
Titanic: Conditional frequenceis • % (Alive | 1st) = 203/325 = 0.625 • % (Alive | 2nd) = 118/285 = 0.414 • % (Alive | 3rd) = 178/706 = 0.252 • % (Alive | Crew) = 212/908 = 0.233 • Based on these frequencies does there appear to be class discrimination?
Titanic: percentage of men in each class • % (Man | 1st) = 175/325 = 0.54 • % (Man | 2nd) = 168/285 = 0.59 • % (Man | 3rd) = 462/706 = 0.65 • % (Man | Crew) = 885/908 = 0.97 • There are larger percentages of men in third class and crew
Surviving the Titanic • A reason for class differences in survival: • Larger percentages of men died • 3rd class consisted of mostly men. • Hence, a larger percentage of 3rd class passengers died. • Once again keep in mind possible lurking variables that could be driving the relationship seen between two measured variables
Relative risk and odds ratios • Motivating example • Physicians’ health study (1989): randomized experiment with 22071 male physicians at least 40 years old • Half the subjects assigned to take aspirin every other day • Other half assigned to take a placebo, a dummy pill that looked and tasted like aspirin
Physicians’ health study • Here are the number of people in each cell:
Relative risk Risk of y1 for level x1=a/(a+b) Risk of y1 for level x2=c/(c+d)
Relative risk for physicians’ health study • Relative risk of a heart attack when taking aspirin versus when taking a placebo equals • People that took aspirin are 0.55 times as likely to have a heart attack than people that took the placebo • Or people that took placebo are 1/0.55 = 1.82 times as likely to have a heart attack than people that took aspirin
Odds ratios Odds of y1 for level x1=a/b Odds of y1 for level x2=c/d
Odds ratios for physicians’ health study • Relative risk of a heart attack when taking aspirin versus taking a placebo is • Odds of having a heart attack when taking aspirin over odds of a heart attack when taking a placebo (odds ratio)
Interpreting odds ratios and relative risks • When the variables X and Y are independent • odds ratio = 1 relative risk = 1 • When subjects with level x1 are more likely to have y1 than subjects with level x2, the • odds ratio > 1 relative risk > 1 • When subjects with level x1 are less likely to have y1 than subjects with level x2, then • odds ratio < 1 relative risk < 1
Which one should be used? • If Relative Risk is available then it should be used • In a cohort study, the relative risk can be calculated directly • In a case-control study the relative risk cannot be calculated directly, so an odds ratio is used instead • Case-control studies is an example. They compare subjects who have a “condition” to subjects that don’t but have similar controls • In this type of study we know %(exposure|disease). But to compute the RR we need %(disease|exposure). • Recall that RR = %(disease|exposure)/%(disease|placebo) • Not available in more complex modeling (logistic regression)
Odds ratio vs relative risk • When is odds ratio a good approximation of relative risk • When cases are representative of diseased population • When controls are representative of population without disease • When the disease being studied occurs at low frequency • Of itself, an odds ratio is a useful measure of association
Relative risk vs absolute risk • % smokers who get lung cancer: 8% (conservative guess here) • Relative risk of lung cancer for smokers: 800% • Getting lung cancer is not commonplace, even for smokers. But, smokers’ chances of getting lung cancer are much, much higher than non-smokers’ chances.
Simpsons paradox • When a third variable seemingly reverses the association between two other variables • Hot hand example