1 / 15

Data Quality Sharp project 5

Data Quality Sharp project 5. June 2010. Statistical Problems with Data Quality in EHR. Missing Data Uncertain Diagnosis Uneven/unequal precision / measurement error Bias …. Missing Data: (Rage in Statistical Theory). Common problem with observational/ retrospective data

aimee
Download Presentation

Data Quality Sharp project 5

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data QualitySharp project 5 June 2010

  2. Statistical Problems with Data Quality in EHR • Missing Data • Uncertain Diagnosis • Uneven/unequal precision / measurement error • Bias • …

  3. Missing Data: (Rage in Statistical Theory) • Common problem with observational/ retrospective data • Statistical approaches • Imputation • Multiple imputation (MI) (Statisticians have acronyms too) • Regression with residual error •  draw from Posterior distribution

  4. Missing Data– Empirical approach • Regression on Y with Missing X-variables • “X is missing” is also information. • Analyze data set using • Imputation (mean?) • “missing” indicator • Empirical approach– let data tell you what to do

  5. Uncertain diagnosis • Universal problem with health data • No Gold standard • Disease/health is a spectrum, not a dichotomy • Probabilistic perspective • Probability (Peripheral Arterial Disease) • From {0,1} to [0-1] as phenotype • More realistic phenotype?

  6. Uncertain Diagnosis • Result is a probability • Probability is a posterior distribution of a 0/1 variable • Use p itself (certainty equivalent) • Analogous to single imputation • Use multiple imputation • “1” with probability p, “0” with probability 1-p

  7. Uncertain Diagnosis– PAD example (eMERGE) • Mayo Vascular Lab Database– n=18000 • Gold Standard— Ankle/Brachial Index (ABI) • Use of Diagnostic / procedural codes • ICD-9 / HICDA / CPT • Logistic regression of gold standard (PAD by ABI) on diagnostic codes •  posterior probability of PAD

  8. Uncertain Diagnosis • Model for Pr(PAD)– 90% predictive value • Export model for Pr{PAD} to patients without gold standard ascertainment? • (Coding practices?)

  9. Uncertain Diagnosis • Use Pr{PAD} in analysis of • Incidence of PAD • Incidence trends • Surveillance • Analysis of etiology, risk factors

  10. Unequal Precision of continuous phenotype • eMERGE example: Red Blood Count • Use retrospective Laboratory Data • N=3000, K=20,000 • 1 measurement  100 measurements/subject • Account for differential precision • Components of variance • Weighted regression? • Posterior distribution– same model fits

  11. Sample from Posterior Distribution • Missing Data, uncertain diagnosis, unequal precision can all be represented by sampling from posterior distribution • They are all the “same problem” • Statistical / computational tools for this have been developed • Markov Chain Monte Carlo (MCMC) • Multiple Imputation

  12. Summary: Data Quality • ‘Data’ is not ‘a number’ but ‘a posterior distribution’ • Mean and variance • Posterior probability • Data quality • Don’t try to change it • Measure it • Allow for it-- propagation of error

  13. What is “Data”? • Data is whatever input goes into the next procedure. • (= output from previous procedure) • ‘Propagation of error’ • Output of NLP is also “Data”

  14. How Assess Data Quality? • What if there is no Gold Standard? • Use any external standard • E.g. outcome data • Stronger predictive relationship= better signal/noise ratio? • “Errors-in-variables” principle • Larger error in X –> Smaller beta for Y|X

  15. Summary: Help! • What are the important tasks in Data Quality? • Measurement? • Allowance for? • Important tasks for this Project? • Integrate with other projects

More Related