270 likes | 475 Views
Hierarchical models for combining multiple data sources measured at individual and small area levels. Chris Jackson With Nicky Best and Sylvia Richardson Department of Epidemiology and Public Health Imperial College, London chris.jackson@imperial.ac.uk. BIAS project
E N D
Hierarchical models for combining multiple data sources measured at individual and small area levels Chris Jackson With Nicky Best and Sylvia Richardson Department of Epidemiology and Public Health Imperial College, London chris.jackson@imperial.ac.uk BIAS project http://www.bias-project.org.uk
Outline • Infer some individual-level relationship, e.g. influence of individual socio-economic circumstances on risk of ill health • Use combination of datasets, individual and aggregate, to answer the question. • Multi-level models on multi-level data. Examples: • Hospital admission for cardiovascular disease and socio-demographic factors • Low birth weight and air pollution
Combining different forms of observational data • Census • National registers • Environmental monitors • Abundant, routinely collected • Covers whole population • Can study small-area variations • Ecological bias • Distinguishing individual from area-level effects • Not many variables • Surveys • Cohort studies • Case-control • Census SAR • Direct information on exposure-outcome relationship • More variables available • Low power • Little geographical • information confidentiality • Reduce confounding and bias • Maximise power • Separate individual and area-level effects • Conflicts between information from each COMBINED
Example 1: Cardiovascular hospitalisation Question • Socio-demographic predictors of hospitalisation for heart and circulatory disease for individuals • Is there any evidence of contextual effects (area-level as well as individual predictors) Design Data synthesis using • Area-level administrative data: hospital episode statistics and census small-area statistics • Individual-level survey data: Health Survey for England. Issue • Reduce ecological bias and improve power, compared to using datasets singly.
Example 2: Low birth weight and pollution Question • Influence of traffic-related air pollution (PM10, NO2, CO) on risk of intrauterine growth retardation ( low birth weight) Design Data synthesis using two individual-level datasets • National births register, 2000. (~600,000 births) • Millennium Cohort Study. (~20,000 births) Issue • Geographical identifiers ( pollution exposure), and outcome, available for both datasets • Important confounders (maternal age, smoking, ethnicity…) only available in the small dataset. Combine to increase power.
Multilevel models for individual and area data Most commonly used to model • individual-level outcomes yij (individual j, area i) in terms of • individual-level predictors xij • group-level (e.g. area-level) predictors xi • Allow baseline risk (possibly also covariate effects) to vary by area: yij ~ i+ xij + b xi However We want to model area-level outcomes yias well as individual outcomes yij
Individual outcome xij yij Individual exposure Aggregate exposure xi Individual outcome xij yij Individual exposure Aggregate outcome Aggregate exposure xi yi Modelling the area-level outcome
Ecological inference • Determining individual-level exposure-outcome relationships using aggregate data. • A simple ecological model: Yi ~ Binomial(pi, Ni), logit(pi) = a + b Xi Yi is the number of disease cases in area i Ni is the population in area i Xi is the proportion of individuals in area i with e.g. low social class. pi is the area-specific disease rate • exp(b) = odds ratio associated with exposure Xi • This is the group level association. Not necessarily equal to individual-level association → ecological bias
Ecological bias Bias in ecological studies can be caused by: • Confounding. As in all observational studies • confounders can be area-level (between-area) or individual-level (within-area). • Solution: try to account for confounders. • non-linear exposure-response relationship, combined with within-area variability of exposure • No bias if exposure is constant in area (contextual effect) • Bias increases as within-area variability increases • …unless models are refined to account for this hidden variability
Improving ecological inference • Alleviate bias associated with within-area exposure variability. • Get some information on within-area distribution fi(x)of exposures, e.g. from individual-level exposure data. • Use this to form well-specified model for ecological data by integrating the underlying individual-level model. Yi ~ Binomial(pi , Ni),pi = pik(x) fi(x) dx pi is average group-level risk pik(x) is individual-level model (e.g. logistic regression) fi(x) is distribution of exposure x within area i(or joint distribution of multiple exposures)
When ecological inference can work • Using well-specified model • Information on within-area distribution of exposure • Information, e.g. from a sample of individual exposures, to estimate the unbiased model that accounts for this distribution. • High between-area contrasts in exposure • Information on the variation in outcome between areas with low exposure rates and high exposure rates • E.g. to determine ethnic differences in health, better to study areas in London (more diverse) than areas in a rural region. When there is insufficient information in ecological data: • May be able to incorporate individual-level exposure-outcome data…
Model for aggregate data • Based on averaging the individual model over the within-area joint distribution of covariates. • Alleviates ecological bias. • Combined model • Individual and aggregate data assumed to be generated by the same baseline and relative risk parameters. • Estimate these parameters using both datasets simultaneously Hierarchical related regression Infer individual-level relationships using both individual and aggregate data Individual-level model • Logistic regression for individual-level outcome • Includes individual or area-level predictors • Use this to • model the individual-level data • construct correct model for aggregate data
Combining ecological and case-control data • If outcome is rare, individual-level data from surveys or cohorts will usually contain little information. • Supplement ecological data with case-control data instead. • Haneuse and Wakefield (2005) describe a hybrid likelihood for combination of ecological and case-control data • Even including individual data from the cases only can reduce ecological bias to acceptable levels.
Issues with combining data • Some variables missing in one dataset • e.g. smoking, blood pressure available in survey but not administrative data • Different but related information in each • e.g. self-reported disease versus hospital admission records. • Conflicts between datasets in information on what is nominally the same variable • e.g. self-completed and interviewed responses to surveys • Ideally the individual and aggregate data are from the same source (e.g. census small-area and SAR)
Example: Cardiovascular disease (CVD) AGGREGATE Hospital Episode Statistics • number of CVD admissions in area in 1998, by age group/sex Census small area statistics • marginal proportions non-white, social class IV/V,… Census Samples of Anonymised Records (2%) • full within-area cross-classification of individuals, age/sex/ethnicity/social class/car ownership - required for correct aggregate model INDIVIDUAL Health Survey for England • Self-reported admission to hospital for CVD (1998 only) • Self-reported long-term CVD (1997, 1999, 1998, 2000, 2001) Multiple imputation for missing hospital admission in not-1998. • individual age and sex • individual ethnicity • individual social class • individual car access Baseline and relative risk of CVD admission for individual
Are aggregate and individual data consistent? Health Survey for England aggregated over districts Census covariates or Hospital Episode Statistics data
Basic illustration of combining individual and aggregate data Aggregate census data disease yi Area admissions count UNKNOWNS e.g. proportion low social class exposure xi Relative risk for individuals Area baseline risk mi Areas i b DATA Individual survey data exposure xij Individual social class CVD admission disease yij Areas i, individuals j
Census Samples of Anonymised Records Areas i, individuals l xil Aggregate census data Cross-classification of individuals yik xirsk Areas i xir xis xik Relative risk for exposures Area/stratum baseline risk social class r, employment status s, age/sex strata k. mik DATA b Individual survey data exposures xij Exposures More complex models for disease, more confounders, need another data source. CVD admission disease yij Areas i, individuals j
Census Samples of Anonymised Records Areas i, individuals l xil Aggregate census data Cross-classification of individuals yik xirsk Areas i xir xis xik Relative risk for exposures Area/stratum baseline risk social class r, employment status s, age/sex strata k. DATA mik b Survey data (1998) Survey data (1997-2001) yij* xij Self reported CVD CVD admissions including imputed values Imputing missing outcomes in individual data CVD admissions yij yij Areas i, individuals j Areas i, individuals j
Estimated coefficients (with 95% CI) for multiple regression model of the risk of hospitalisation Individual data only Aggregate data only Models combining individual and aggregated data
Individual and area-level predictors • Area level covariates in underlying model for hospitalisation risk (Carstairs deprivation index) • No significant influence of Carstairs, after accounting for individual-level factors • Random effects models • Random area-level baseline risk, quantifies remaining variability between areas. • After adjusting for covariates, variance partitioned into individual / area-level components • 4% of residual variance between wards attributable to unobserved area-level factors (2% for districts) • Little evidence of contextual effects
Example: Low birth weight and pollution • Geographically complete individual dataset from national register, with exposure, outcome but not confounders • Geographically sparse survey dataset with all variables. → missing data problem • Impute missing covariates that are likely to be confounded with the pollution exposure. • Information for this imputation • from aggregate data (e.g. ethnicity, from census). • from sparse survey dataset
National register data (LARGE) Survey data (Small) m Low birth weight Low birth weight POLLUTION be Pollution regression model CONFOUNDERS Sex, age Socioeconomic ? ? ? ? Confounders Sex, age Socioeconomic Smoking Ethnicity Maternal age etc.. bc Aggregate census data Ethnicity
Parallel regression models • Desire unbiased inference on the effect of the primary exposure. • Available from small dataset with all confounders, but with low power. • Information for imputation comes from small dataset or ecological data is resulting uncertainty worth the precision gained? • Work in progress, currently awaiting some data.
Summary • Combining datasets can increase power and reduce bias, making use of strengths of each • Problems may arise when data are incompatible or inconsistent. • Bayesian hierarchical models useful in cases of conflicts. • All our methods can be implemented in WinBUGS • More applied studies needed to demonstrate the utility of the approach.
Publications Our papers available from http://www.bias-project.org.uk • C. Jackson, N. Best, S. Richardson. Hierarchical related regression for combining aggregate and survey data in studies of socio-economic disease risk factors. under revision, Journal of the Royal Statistical Society, Series A. • C. Jackson, N. Best, S. Richardson. Improving ecological inference using individual-level data. Statistics in Medicine (2006) 25(12):2136-2159. • C. Jackson, S. Richardson, N. Best. Studying place effects on health by synthesising area-level and individual data. Submitted. • S. Haneuse and J. Wakefield. The combination of ecological and case-control data. Submitted.