150 likes | 322 Views
Data Quality Sharp project 5. June 2010. Statistical Problems with Data Quality in EHR. Missing Data Uncertain Diagnosis Uneven/unequal precision / measurement error Bias …. Missing Data: (Rage in Statistical Theory). Common problem with observational/ retrospective data
E N D
Data QualitySharp project 5 June 2010
Statistical Problems with Data Quality in EHR • Missing Data • Uncertain Diagnosis • Uneven/unequal precision / measurement error • Bias • …
Missing Data: (Rage in Statistical Theory) • Common problem with observational/ retrospective data • Statistical approaches • Imputation • Multiple imputation (MI) (Statisticians have acronyms too) • Regression with residual error • draw from Posterior distribution
Missing Data– Empirical approach • Regression on Y with Missing X-variables • “X is missing” is also information. • Analyze data set using • Imputation (mean?) • “missing” indicator • Empirical approach– let data tell you what to do
Uncertain diagnosis • Universal problem with health data • No Gold standard • Disease/health is a spectrum, not a dichotomy • Probabilistic perspective • Probability (Peripheral Arterial Disease) • From {0,1} to [0-1] as phenotype • More realistic phenotype?
Uncertain Diagnosis • Result is a probability • Probability is a posterior distribution of a 0/1 variable • Use p itself (certainty equivalent) • Analogous to single imputation • Use multiple imputation • “1” with probability p, “0” with probability 1-p
Uncertain Diagnosis– PAD example (eMERGE) • Mayo Vascular Lab Database– n=18000 • Gold Standard— Ankle/Brachial Index (ABI) • Use of Diagnostic / procedural codes • ICD-9 / HICDA / CPT • Logistic regression of gold standard (PAD by ABI) on diagnostic codes • posterior probability of PAD
Uncertain Diagnosis • Model for Pr(PAD)– 90% predictive value • Export model for Pr{PAD} to patients without gold standard ascertainment? • (Coding practices?)
Uncertain Diagnosis • Use Pr{PAD} in analysis of • Incidence of PAD • Incidence trends • Surveillance • Analysis of etiology, risk factors
Unequal Precision of continuous phenotype • eMERGE example: Red Blood Count • Use retrospective Laboratory Data • N=3000, K=20,000 • 1 measurement 100 measurements/subject • Account for differential precision • Components of variance • Weighted regression? • Posterior distribution– same model fits
Sample from Posterior Distribution • Missing Data, uncertain diagnosis, unequal precision can all be represented by sampling from posterior distribution • They are all the “same problem” • Statistical / computational tools for this have been developed • Markov Chain Monte Carlo (MCMC) • Multiple Imputation
Summary: Data Quality • ‘Data’ is not ‘a number’ but ‘a posterior distribution’ • Mean and variance • Posterior probability • Data quality • Don’t try to change it • Measure it • Allow for it-- propagation of error
What is “Data”? • Data is whatever input goes into the next procedure. • (= output from previous procedure) • ‘Propagation of error’ • Output of NLP is also “Data”
How Assess Data Quality? • What if there is no Gold Standard? • Use any external standard • E.g. outcome data • Stronger predictive relationship= better signal/noise ratio? • “Errors-in-variables” principle • Larger error in X –> Smaller beta for Y|X
Summary: Help! • What are the important tasks in Data Quality? • Measurement? • Allowance for? • Important tasks for this Project? • Integrate with other projects