1 / 20

Exploratory data analysis with two qualitative variables

Exploratory data analysis with two qualitative variables. Not in FPP. Exploratory data analysis with two qualitative/categorical variables. Main tools Contigency tables Conditional, marginal, and joint frequencies. Motivating example. Surviving the Titanic

mhester
Download Presentation

Exploratory data analysis with two qualitative variables

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploratory data analysis with two qualitative variables Not in FPP

  2. Exploratory data analysis with two qualitative/categorical variables • Main tools • Contigency tables • Conditional, marginal, and joint frequencies

  3. Motivating example • Surviving the Titanic • Was there a class discrimination in survival of the wreck of the Titanic? • “It has been suggested before the Enquiry that the third-class passengers had been unfairly treated, that their access to the boat deck had been impeded; and that when they reached the deck the first and second-class passengers were given precedence in getting places in the boats.” Lord Mersey, 1912

  4. Titanic: Class by survival

  5. Titanic: Marginal frequencies • % Dead = 1513/2224 = 0.68 • % Alive = 711/2224 = 0.32 • % in first class = 325/2224 = 0.14 • % in second class = 285/2224 = 0.13 • % in third class = 706/2224 = 0.32 • % crew = 908/2224 = 0.41

  6. Titanic: Conditional frequenceis • % (Alive | 1st) = 203/325 = 0.625 • % (Alive | 2nd) = 118/285 = 0.414 • % (Alive | 3rd) = 178/706 = 0.252 • % (Alive | Crew) = 212/908 = 0.233 • Based on these frequencies does there appear to be class discrimination?

  7. Titanic: Class by person type

  8. Titanic: percentage of men in each class • % (Man | 1st) = 175/325 = 0.54 • % (Man | 2nd) = 168/285 = 0.59 • % (Man | 3rd) = 462/706 = 0.65 • % (Man | Crew) = 885/908 = 0.97 • There are larger percentages of men in third class and crew

  9. Surviving the Titanic • A reason for class differences in survival: • Larger percentages of men died • 3rd class consisted of mostly men. • Hence, a larger percentage of 3rd class passengers died. • Once again keep in mind possible lurking variables that could be driving the relationship seen between two measured variables

  10. Relative risk and odds ratios • Motivating example • Physicians’ health study (1989): randomized experiment with 22071 male physicians at least 40 years old • Half the subjects assigned to take aspirin every other day • Other half assigned to take a placebo, a dummy pill that looked and tasted like aspirin

  11. Physicians’ health study • Here are the number of people in each cell:

  12. Relative risk Risk of y1 for level x1=a/(a+b) Risk of y1 for level x2=c/(c+d)

  13. Relative risk for physicians’ health study • Relative risk of a heart attack when taking aspirin versus when taking a placebo equals • People that took aspirin are 0.55 times as likely to have a heart attack than people that took the placebo • Or people that took placebo are 1/0.55 = 1.82 times as likely to have a heart attack than people that took aspirin

  14. Odds ratios Odds of y1 for level x1=a/b Odds of y1 for level x2=c/d

  15. Odds ratios for physicians’ health study • Relative risk of a heart attack when taking aspirin versus taking a placebo is • Odds of having a heart attack when taking aspirin over odds of a heart attack when taking a placebo (odds ratio)

  16. Interpreting odds ratios and relative risks • When the variables X and Y are independent • odds ratio = 1 relative risk = 1 • When subjects with level x1 are more likely to have y1 than subjects with level x2, the • odds ratio > 1 relative risk > 1 • When subjects with level x1 are less likely to have y1 than subjects with level x2, then • odds ratio < 1 relative risk < 1

  17. Which one should be used? • If Relative Risk is available then it should be used • In a cohort study, the relative risk can be calculated directly • In a case-control study the relative risk cannot be calculated directly, so an odds ratio is used instead • Case-control studies is an example. They compare subjects who have a “condition” to subjects that don’t but have similar controls • In this type of study we know %(exposure|disease). But to compute the RR we need %(disease|exposure). • Recall that RR = %(disease|exposure)/%(disease|placebo) • Not available in more complex modeling (logistic regression)

  18. Odds ratio vs relative risk • When is odds ratio a good approximation of relative risk • When cases are representative of diseased population • When controls are representative of population without disease • When the disease being studied occurs at low frequency • Of itself, an odds ratio is a useful measure of association

  19. Relative risk vs absolute risk • % smokers who get lung cancer: 8% (conservative guess here) • Relative risk of lung cancer for smokers: 800% • Getting lung cancer is not commonplace, even for smokers. But, smokers’ chances of getting lung cancer are much, much higher than non-smokers’ chances.

  20. Simpsons paradox • When a third variable seemingly reverses the association between two other variables • Hot hand example

More Related