De-identification: A Critical Success Factor in Clinical and Population Research

De-identification: A Critical Success Factor in Clinical and Population Research Steven Merahn MD Dee Lang, RHIT Prepared for 2007 APIII Pittsburgh, PA September 10, 2007

Major gaps exist today in between patient care, clinical research and evidence-based medicine.

Sharing Data is the Key • “Amassing large quantities of anonymized clinical and non-clinical information from medical records and reports and analyzing that data for patterns and other observations (is the best way to) to support continuous quality improvement, shape best practices and inform clinical and population-based decision making” • A Rapid Learning Health System Health Affairs 26(2), January 2007

Processing Predicated on Protecting Patient Privacy • Clinical records can be an important source of information…most of the information in these records is in the form of free text and extracting useful information from them requires automatic processing (e.g., index, semantically interpret, and search). A prerequisite to the distribution of clinical records outside of hospitals, be it for Natural Language Processing (NLP) or medical research, is de-identification • J Am Med Inform Assoc. 2007;14:550-563. DOI 10.1197/jamia.M2444.

Problems to Solve • Sources of data • Protecting patient privacy • Creating and maintaining a corpus of HIPAA compliant and searchable data • Building collaborations; creating networks of institutions sharing data • Emerging patient “data rights” issues

Sources of Data • EMR/CIS systems • Large amounts of free text; not all data is parsed or field-limited • Transcribed Records and Reports • Even in systems without CIS, most transcriptions are delivered as electronic files • Pathology Reports (cf CaTIES) • Surgical Notes • Radiology Reports • Dischage Summaries • No need to wait for an EMR to create an RLHS

Protecting Patient Privacy • De-identification is a well-defined, but limited, step in a broader research workflow or protocol • The defined nature of the step includes managing individually identifiable information in records and reports • Such schema includes redaction, elimination, categorical replacement (e.g., place, age range), and replacement with proxies (Dr X), and offsets (day 1) • A process which must be constantly “tuned” in response to dynamic input variables and patterns of documentation

FIREWALL Transcribed Reports CIS Query Interface De-identification Methodology De-identified Database De-identified Data NLP Other processes RE-ID Method Trusted Proxy Admin QA QA

Considerations • When choosing a de-identification methodology, four things need consideration • What is the reliability and validity of the methodology? • Can the method maintain its specificity and sensitivity in local use? • What are the limitations of the methodology? • Can files be re-identified?

Consistency, Reliability and Validity • Fundamental problems is inter-record reliability, manpower resource and time constraints • The issue then becomes the quality of the quality -- over-marking (specificity) and under-marking (sensitivity) • What are acceptable levels of sensitivity and specificity? • 100% for sensitivity for names • What is the benchmark? • What is the value of consistency?

Automated Methodologies:As Good As?/Better? • Classification of tokens • Sequence tracking problem (using Hidden Markov Models or Conditional Random Fields • Rule-based system utilizing global features (sentence position), local features (lexical cues, special characters, and format patterns), and syntactic features • Hybrid systems of rules, pattern matching algorithms, heuristics and dictionaries

Local Use • Can your methodology be customized to meet local needs? • While some methods may have good ‘numbers’, will they hold up in local use? • Every community has its own acronyms, place names and other local vocabulary • What is the protocol to manage local quality? • Regular checks against manual review • Formal evaluation research

“Data Rights” Issues • Legal models exist • Make ‘de-identified” data sharing part of informed consent • Offer different tiers of consent • Publicly-funded research • Academic research • Commercial research • Make the general public aware of the level of existing data sharing • Claims data already widely shared and sold

FIREWALL Building Collaboration Query Interface De-identified Database QA

Call to Action:Pathology Informatics Community • caBIG and caTIES are models for cross institutional data sharing • Major institutions are establishing data repositories of pathology reports • Help facilitate data aggregation among other departments • Radiology (Radiology Reports) • Surgery (Surgical Notes) • Medicine (Discharge Summaries) • Establish cross-departments “Rapid Learning” teams

De-identification: A Critical Success Factor in Clinical and Population Research