490 likes | 697 Views
Principles of Epidemiology for Public Health (EPID600). Data analysis and causal inference – 1. Victor J. Schoenbach, PhD home page Department of Epidemiology Gillings School of Global Public Health University of North Carolina at Chapel Hill www.unc.edu/epid600/.
E N D
Principles of Epidemiology for Public Health (EPID600) Data analysis and causal inference – 1 Victor J. Schoenbach,PhD home page Department of EpidemiologyGillings School of Global Public HealthUniversity of North Carolina at Chapel Hill www.unc.edu/epid600/ Data analysis and causal inference
The Physicist, the Chemist, and the Statistician From “Science Jokes”, posted to Usenet groups by Joachim Verhagen (verhagen@fys.ruu.nl); downloaded from, Keith M. Gregg, keith.gregg@stanford.edu, www-leland.stanford.edu/~keithg/humor.shtml “Three professors (a physicist, a chemist, and a statistician) are called in to see their dean. Just as they arrive the dean is called out of his office, leaving the three professors there. The professors see with alarm that there is a fire in the wastebasket. Data analysis and causal inference
The Physicist, the Chemist, and the Statistician From “Science Jokes”, posted to Usenet groups by Joachim Verhagen (verhagen@fys.ruu.nl); downloaded from, Keith M. Gregg, keith.gregg@stanford.edu, www-leland.stanford.edu/~keithg/humor.shtml “The physicist says, ‘I know what to do! We must cool down the materials until their temperature is lower than the ignition temperature and then the fire will go out.’ Data analysis and causal inference
The Physicist, the Chemist, and the Statistician From “Science Jokes”, posted to Usenet groups by Joachim Verhagen (verhagen@fys.ruu.nl); downloaded from, Keith M. Gregg, keith.gregg@stanford.edu, www-leland.stanford.edu/~keithg/humor.shtml “The chemist says, ‘No! No! I know what to do! We must cut off the supply of oxygen so that the fire will go out due to lack of one of the reactants.’ Data analysis and causal inference
The Physicist, the Chemist, and the Statistician From “Science Jokes”, posted to Usenet groups by Joachim Verhagen (verhagen@fys.ruu.nl); downloaded from, Keith M. Gregg, keith.gregg@stanford.edu, www-leland.stanford.edu/~keithg/humor.shtml “While the physicist and chemist debate what course to take, they both are alarmed to see the statistician running around the room starting other fires. They both scream, ‘What are you doing?’ To which the statistician replies, ‘Trying to get an adequate sample size.’” Data analysis and causal inference
Data management • Managing epidemiologic data is “mass production” • A systematic, organized, professional approach is critical for detecting and avoiding problems Data analysis and causal inference
“You can never, never take anything for granted.” Noel Hinners, vice president for flight systems at Lockheed Martin Astronautics, whose engineering team reported measurements in English units that the Mars Climate Orbiter navigation team assumed were metric units. Data analysis and causal inference
Without the documentation, the data may be of little if any value (1995 NSFG) 00000000000003122222222402143041000 00000000000001144112131 070520310 00000000000003233112131 072331040 000000000000011163322227070350110 00000000000003133022221 02451121000 00000000000001111112131 02110041000 00000000000002111112131 07307131000 00000000000002122112131 01073041000 Data analysis and causal inference
Data analysis and causal inference • “Our data say nothing at all.”(Epidemiology guru Sander Greenland, Congress of Epidemiology 2001, Toronto) • Data are observer notes, respondent answers, biochemical measurements, contents of medical records, machine readable datasets, … • What does one do with them? Data analysis and causal inference
Steps in data management • Design the data collection process • Write down all data collection procedures • Train and supervise data collectors • Monitor all data collection activities • Document all data collection experiences • Keep track of, document, and safeguard data Data analysis and causal inference
Data processing • Review, edit, and code data forms, documenting exceptions and actions • Convert to electronic form • “Clean” data – check for illegal or improbable values, combinations of values • Prepare summaries Data analysis and causal inference
The case of the missing eights • Cancer Prevention study II (N=1.2 million) • Contractor keyed 20,000 forms/wk; checked weekly. • 28-item food frequency had peculiar pattern of missings • Pulled original QQs to check • Programmer checked code • Cause: “O” instead of “0” Steven D. Stellman. Am J Epidemiol 1989;129(4):857-860 Data analysis and causal inference
Can you find the data management error? 48 * get non-hispanic white population in county for 2000, first by adding 49 ages 15-24, 25-34, 35-44, and 45-64, then by excluding ages 45-64; 50 51 CWHITES=CST00609+CST00610+CST00611+CST00612; 52 CWHITES2=CWHITES-CST00612; 53 54 * get non-hispanic black population in county; 55 56 CBLACKS=CST00616+CST00617+CST00618+CST00619; 57 CBLACKS2=CBLACKS-CST00619; 58 59 * get hispanic or latino population in county; 60 61 CHISPS=CST00623+CST00624+CST00625+CST00626; 62 CHISPS2=CHISPS-CST00626; 63 (continues on next slide) Data analysis and causal inference
Can you find the data management error? CST00637 Female population white alone aged 15-24, 2000 – county CST00638 Female population white alone aged 25-34, 2000 – county CST00639 Female population white alone aged 35-44, 2000 – county CST00640 Female population white alone aged 45-64, 2000 – county CST00644 Female population black* alone aged 15-24, 2000 – county CST00645 Female population black* alone aged 25-34, 2000 – county CST00646 Female population black* alone aged 35-44, 2000 – county CST00647 Female population black* alone aged 45-64, 2000 – county CST00651 Female population Hispanic* aged 15-24, 2000 – county CST00652 Female population Hispanic* aged 25-34, 2000 – county CST00653 Female population Hispanic* aged 35-44, 2000 – county CST00654 Female population Hispanic* aged 45-64, 2000 – county * Full variable name: “black or African American”, “Hispanic or Latino” (continues on next slide) Data analysis and causal inference
Can you find the data management error? 64 * get non-hispanic white female population in county; 65 66 CWFEMALES=CST00637+CST00638+CST00639+CST00640; 67 CWFEMALES2=CWFEMALES-CST00640; 68 69 * get non-hispanic black female population in county; 70 71 CBFEMALES=CST00644+CST00645+CST00646+CST00647; 72 CBFEMALES2=CBFEMALES-CST00646; 73 74 * get hispanic female population in county; 75 76 CHFEMALES=CST00651+CST00652+CST00653+CST00654; 77 CHFEMALES2=CHFEMALES-CST00654; (continues on next slide) Data analysis and causal inference
Can you find the data management error? 64 * get non-hispanic white female population in county; 65 66 CWFEMALES=CST00637+CST00638+CST00639+CST00640; 67 CWFEMALES2=CWFEMALES-CST00640; 68 69 * get non-hispanic black female population in county; 70 71 CBFEMALES=CST00644+CST00645+CST00646+CST00647; 72 CBFEMALES2=CBFEMALES-CST00646; 73 74 * get hispanic female population in county; 75 76 CHFEMALES=CST00651+CST00652+CST00653+CST00654; 77 CHFEMALES2=CHFEMALES-CST00654; Data analysis and causal inference
Data exploration • Examine the data – frequency distributions, cross-tabulations, scatterplots – be alert for surprises and suspicious findings • Examine means and prevalence for factors of interest, overall and within interesting subgroups • Look at associations, prevalence ratios, relative risks, odds ratios, correlations Data analysis and causal inference
Carry out focused data analysis • Desirable to have a written analysis plan based on the research questions • Typically carry out “crude” analyses and analyses controlling for important variables • Methods of control: stratification, mathematical modeling Data analysis and causal inference
Distribution of U.S. household income, 2007 (CPS data) Income in $1000s/year Source: http://img55.imageshack.us/i/incomedistr07jo6.jpg/ Data analysis and causal inference
Stratified analysis • Divide the dataset into subsets according to relevant covariables (e.g., age, sex, smoking, …) • Examine the estimates and associations within each subset (unless there are too many) • Take averages across the subsets Data analysis and causal inference
Mathematical modeling • Express the outcome as some mathematical function of the relevant covariables • “Fit” this function to the data, so that it models the relations in the data • Interpret the resulting model to draw inferences about associations Data analysis and causal inference
Selecting a pattern to sew a pair of pants • Want one that fits the need • Can sew without a pattern, but takes time and may not look good • Select a pattern that will be well received • Have you seen anyone wearing it? • Has it been featured in magazines Data analysis and causal inference
The strategy of statistical data analysis Look for an available statistical model that will fit the situation (e.g., binomial, normal, chi-square, linear) • Have others used it? • Has it appeared in a methodology article? Data analysis and causal inference
The strategy of statistical data analysis Summarize the data in terms of the statistical model • Mean • Standard deviation • Other parameters Data analysis and causal inference
But should always look at the data • Distributions can have same mean and standard deviation but look very different – e.g., same mean: 5 5 Data analysis and causal inference
Regression models - Conceptual • Suppose risk factors of:Age 50 yearsBP 130 mmHG systolicCHL 220 mg/dLSMK 30 pack-years Data analysis and causal inference
Regression models - Conceptual Example of an additive model:Risk of CHD = Risk from Age (“Age_risk”) Risk from BP (“BP_risk”) Risk from CHL (“CHL_risk”) Risk from SMK (“SMK_risk”) Data analysis and causal inference
Propose the model Risk of CHD = Age_risk + BP_risk + CHL_risk + SMK_risk Age_risk = Age in years x risk increase per yearBP_risk = BP in mmHG x risk increase per mmHGCHL_risk = Cholest. in mg/dL x risk increase per mg/dLSMK_risk = Pack-years x risk increase per pack-year Data analysis and causal inference
Fit the model – estimate the coefficients • Risk = β0 +β1Age + β2BP + β3CHL + β4SMKβ0 = baseline riskβ1 = risk increase per yearβ2 = risk increase per mmHGβ3 = risk increase per mg/dLβ4 = risk increase per pack-year • Use the data and statistical techniques to estimate β1, β2, β3, β4. Data analysis and causal inference
P-values and Power • P-value: “the probability of obtaining an interesting-looking sample from a boring population” (1 – specificity) • Power: “the probability of obtaining an interesting-looking sample from an interesting population” (sensitivity) Data analysis and causal inference
The P-value If my study observes 0.5 [e.g., ln(OR)] 0 Boring population 0.7 [ln(OR)] Interesting population Data analysis and causal inference
The P-value If my study observes 0.5 [e.g., ln(OR)] P-value 0 Boring population 0.7 Interesting population Data analysis and causal inference
The Problem with the P-value But the P-value does not tell me the probability that what I observed was due to chance 0 Boring population 0.7 Interesting population Data analysis and causal inference
If I study only boring populations 0 Distributions of samples from boring populations Data analysis and causal inference
If I study only interesting populations 0 0.7 Distributions of samples from interesting populations Data analysis and causal inference
Many boring populations 0 Boring populations 0.7 Interesting populations Data analysis and causal inference
Many interesting populations 0 Boring populations 0.7 Interesting populations Data analysis and causal inference
Do epidemiologists study boring populations? That probability depends on how many boring populations there are. If we study 10 interesting populations 100 boring populations with 90% power and 5% significance level, we expect us to obtain 9 interesting samples from the interesting populations and 5 from the boring populations Data analysis and causal inference
P-values and predictive values Results: 14 interesting samples 5 came from boring populations Probability that an interesting sample came from a boring population: 5/14 = 36% – not 5%! Analogous to positive predictive value Data analysis and causal inference
Analogy to positive predictive value Data analysis and causal inference
Meta-analysis • Literature reviews • Systematic literature reviews • Every study is an observation from a population of possible studies • The set of studies that have been published may be a biased sample from that population Data analysis and causal inference
What should guide data analysis • What are the research questions? – Estimate means (e.g., cholesterol) and prevalences (e.g., HIV) – Assess associations (e.g., Is blood lead associated with elevated blood pressure?; Do prepaid health plans provide more preventative care? Do bednets protect against malaria?) Data analysis and causal inference
Association of helmet use with death in motorcycle crashes: a matched-pair cohort study(Daniel Norvell and Peter Cummings, AJE 2002;156:483-7) • Data from the National Highway Traffic Safety Administration’s Fatality Analysis Reporting System • Exposure: helmet use; Outcome: death • Potential confounders: sex, seat position, age, state helmet law Data analysis and causal inference
Association of helmet use with death in motorcycle crashes: a matched-pair cohort study(Daniel Norvell and Peter Cummings, AJE 2002;156:483-7) • 9,222 driver-passenger pairs after exclusions • Relative risk of death for a helmeted rider was 0.65 (0.57-0.74), (0.61 adjusted for seat position) • Examined effect measure modification by seat position and by type of crash. Data analysis and causal inference
When the proofreader takes a week off 12/29/2009, B5 www.google.com/finance/historical?q=INDEXDJX:.DJI Dec 22 23 24 2528
I hope he’s having a good break! 12/31/2009, B6 Dec 23 24 2528 29 www.google.com/finance/historical?q=INDEXDJX:.DJI
Thank you • Arigato • Asanti • Dhanyavaad • Dumela • Gracias • Merci • Obrigato • Xie xie Data analysis and causal inference