1 / 15

De-identification: A Critical Success Factor in Clinical and Population Research

De-identification: A Critical Success Factor in Clinical and Population Research. Steven Merahn MD Dee Lang, RHIT Prepared for 2007 APIII Pittsburgh, PA September 10, 2007. Major gaps exist today in between patient care, clinical research and evidence-based medicine. Sharing Data is the Key.

Download Presentation

De-identification: A Critical Success Factor in Clinical and Population Research

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. De-identification: A Critical Success Factor in Clinical and Population Research Steven Merahn MD Dee Lang, RHIT Prepared for 2007 APIII Pittsburgh, PA September 10, 2007

  2. Major gaps exist today in between patient care, clinical research and evidence-based medicine.

  3. Sharing Data is the Key • “Amassing large quantities of anonymized clinical and non-clinical information from medical records and reports and analyzing that data for patterns and other observations (is the best way to) to support continuous quality improvement, shape best practices and inform clinical and population-based decision making” • A Rapid Learning Health System Health Affairs 26(2), January 2007

  4. Processing Predicated on Protecting Patient Privacy • Clinical records can be an important source of information…most of the information in these records is in the form of free text and extracting useful information from them requires automatic processing (e.g., index, semantically interpret, and search). A prerequisite to the distribution of clinical records outside of hospitals, be it for Natural Language Processing (NLP) or medical re- search, is de-identification • J Am Med Inform Assoc. 2007;14:550-563. DOI 10.1197/jamia.M2444.

  5. Problems to Solve • Sources of data • Protecting patient privacy • Creating and maintaining a corpus of HIPAA compliant and searchable data • Building collaborations; creating networks of institutions sharing data • Emerging patient “data rights” issues

  6. Sources of Data • EMR/CIS systems • Large amounts of free text; not all data is parsed or field-limited • Transcribed Records and Reports • Even in systems without CIS, most transcriptions are delivered as electronic files • Pathology Reports (cf CaTIES) • Surgical Notes • Radiology Reports • Dischage Summaries • No need to wait for an EMR to create an RLHS

  7. Protecting Patient Privacy • De-identification is a well-defined, but limited, step in a broader research workflow or protocol • The defined nature of the step includes managing individually identifiable information in records and reports • Such schema includes redaction, elimination, categorical replacement (e.g., place, age range), and replacement with proxies (Dr X), and offsets (day 1) • A process which must be constantly “tuned” in response to dynamic input variables and patterns of documentation

  8. FIREWALL Transcribed Reports CIS Query Interface De-identification Methodology De-identified Database De-identified Data NLP Other processes RE-ID Method Trusted Proxy Admin QA QA

  9. Considerations • When choosing a de-identification methodology, four things need consideration • What is the reliability and validity of the methodology? • Can the method maintain its specificity and sensitivity in local use? • What are the limitations of the methodology? • Can files be re-identified?

  10. Consistency, Reliability and Validity • Fundamental problems is inter-record reliability, manpower resource and time constraints • The issue then becomes the quality of the quality -- over-marking (specificity) and under-marking (sensitivity) • What are acceptable levels of sensitivity and specificity? • 100% for sensitivity for names • What is the benchmark? • What is the value of consistency?

  11. Automated Methodologies:As Good As?/Better? • Classification of tokens • Sequence tracking problem (using Hidden Markov Models or Conditional Random Fields • Rule-based system utilizing global features (sentence position), local features (lexical cues, special characters, and format patterns), and syntactic features • Hybrid systems of rules, pattern matching algorithms, heuristics and dictionaries

  12. Local Use • Can your methodology be customized to meet local needs? • While some methods may have good ‘numbers’, will they hold up in local use? • Every community has its own acronyms, place names and other local vocabulary • What is the protocol to manage local quality? • Regular checks against manual review • Formal evaluation research

  13. “Data Rights” Issues • Legal models exist • Make ‘de-identified” data sharing part of informed consent • Offer different tiers of consent • Publicly-funded research • Academic research • Commercial research • Make the general public aware of the level of existing data sharing • Claims data already widely shared and sold

  14. FIREWALL Building Collaboration Query Interface De-identified Database QA

  15. Call to Action:Pathology Informatics Community • caBIG and caTIES are models for cross institutional data sharing • Major institutions are establishing data repositories of pathology reports • Help facilitate data aggregation among other departments • Radiology (Radiology Reports) • Surgery (Surgical Notes) • Medicine (Discharge Summaries) • Establish cross-departments “Rapid Learning” teams

More Related