250 likes | 262 Views
This study explores the use of multiple imputation to address missing race data in a pre-invasive cervical cancer study across three states. It compares the results with a complete case method and examines the correlation between race and cervical cancer.
E N D
Multiple Imputation and Missing Race in the Pre-Invasive Cervical Cancer Study among Three States 2010 NAACCR Conference Quebec City, June 22, 2010 Bin Huang Kentucky Cancer Registry University of Kentucky
The Pre-invasive Cervical Cancer Study • HPV vaccine • Quadrivalent vaccine licensed for females in June 2006 • ACS developed the guideline for HPV vaccine use June 2007 • Anticipated reductions in cervical cancers, other anogenital cancers • Need for surveillance systems • Collection of population data for pre-invasive cervical cancer cases • Monitoring effectiveness and efficacy • CDC funded study • Includes three cancer registries – Michigan, Kentucky, Louisiana • Pre-pilot period (Sept-Dec 2008) • Data collection Jan 2009-Dec 2009
Missing Data In the Study • Missing data issue • Race : 30% missing. • Overall cases with complete data: 68.7% • Potential to cause bias or lead to inefficient analyses.
Missing Data Mechanism • Missing completely at random (MCAR). • The missingness is independent of both the missing response and the observed response. • Missing at random (MAR). • The missingness is independent of the missing response given the observed values. • Not missing at random (NMAR) . • The missingness depends on both observed and missing responses.
Methods to Treat Missing Data Available Case Methods • Complete case method (listwise deletion). • Pairwise deletion Single Imputation methods • Mean substitution • Hot deck imputation • Regression substitution Modern Approaches • Maximum Likelihood (ML) method • Bayesian method • Multiple Imputation (MI)
Multiple Imputation (MI) MI is a three-step approach to estimation for incomplete data, first proposed by Rubin in 1977. MI assumes missing data are MAR. • Imputation - the missing data are filled in m times to generate m complete data sets. Imputation model preserves the distributional relationship between the missing values and the observed values. • Analysis - the m complete data sets are analyzed separately using standard statistical analyses. • Combination - the results from the m complete data sets are combined to produce inferential results.
Software Available • SAS • PROC MI; PROC MIANALYZE. • MCMC option - assumption of multivariate normality. • SOLAS (Statistical Solutions Inc) • Same assumption as SAS Proc MI. • S-Plus: NORM • IVEware: SAS callable • PROC IMPUTE; PROC DESCRIBE; PROC REGRESS • Does not assume multivariate normality.
Aim of the Study • To impute the missing race with MI • To examine the difference of estimates between complete case method and the MI method • Percentage of race • The correlation between having AIS and Race.
Data – Pre-Cervical Cancer Cases • Three states – Kentucky, Louisiana and Michigan • Total – 3843 • Kentucky: 953 (24.8%), Louisiana: 653 (17.0%), Michigan: 2237 (58.2%) • Variables (17) • Demographics: race, address, age, ethnicity • Data sources: reporting facility, facility type, time at diagnosis • Disease data: site, histology code, histology terminology code, sequence code • Added variable (2) – 2000 US Census • % of Whites at county level • % of Blacks at county level
Missing Cases – Race, State at Diagnosis, County at Diagnosis
MI Methods • IVEware and SAS PROC MI • Used both methods • Only results from IVEware are presented • IVEware: http://www.isr.umich.edu/src/smp/ive/
Associations Multivariate logistic regression showed: • Race is significantly associated with ethnicity, histological terminology type, age, state. • Most notably, percent of race at county level is most dominate variable predicting race.
Imputation Model • Variables includes race, registry, age, ethnicity, facility type, site, histology terminology code, sequence code, percentages of races at county level • 10 imputation sets
Logistics Regression Analysis with AIS Status as the Dependent Variable
Summary • The high percentage of cases with missing race likely introduced bias to the estimate of proportion of race, mainly among data from Michigan. • The results shows that whites have much higher risk of getting AIS than blacks. • Quantitative differences in estimates between the two methods were found in the logistic model. • MI is relatively easy to implement and is appropriate for a wide range of datasets.
Acknowledgements CDC – DeblinaDatta and staff Kentucky Cancer Registry: Thomas Tucker, Mary Jane Byrne, Brent Shelton Michigan Cancer Registry: Glenn Copland, Won Silva and staff Louisiana Cancer Registry: Vivien Chen and staff Macro International - Benita O’Colma
Words to Share John Wooden - “Be quick, but don’t hurry” “If you don’t have time to do it right, how will you find time to do it again?”
Questions? Bin Huang bhuang@kcr.uky.edu 859-219-0773 x 280 Thank You ! Merci !