440 likes | 720 Views
Issues in Deterministic and Probabilistic Record Linkage . Scott DuVall Salt Lake City VHA MC. the age of. information. informatician. information =. information = . information. Linkage Adds Information. Linkage Corrects Errors. Missing information affects patient care 1.
E N D
Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC
the age of information
informatician information = information = information
Missing information affects patient care1 • Transitions in care • cause breakdown in communication2 1 Stiell et al. Prevalence of information gaps in the emergency department and the effect on patient outcomes. Cmaj 2003;169(10):1023-8. 2 Coleman et al. Lost in transition: challenges and opportunities for improving the quality of transitional care. Ann Intern Med 2004;141(7):533-6.
Resolving duplicates can cost $60 per case.1 1Thornton SN, Hood SK. Reducing Duplicate Patient Creation Using a Probabilistic Matching Algorithm in an Open-access Community Data Sharing Environment. Proc AMIA Symp 2005:1135.
“between $0.30 and $0.40 of every dollar spent on health care is wasted on overuse, under use, misuse, duplication, system failures, unnecessary repetition, poor communications and inefficiency.”1 1Reid PP, Compton WD, Grossman JH, Fanjiang G. Building a Better Delivery System: A New Engineering/ Health Care Partnership. National Academies Press, 2005:99.
Key element of health care information exchange and interoperability, estimated to be able to reduce costs $77.8 billion annually.1 1Walker J, Pan E, Johnston D, Adler-Milstein J, Bates DW, Middleton B. The value of health care information exchange and interoperability. Health Aff (Millwood). 2005 Jan-Jun;Suppl Web Exclusives: W5-10-W5-18.
Record Matching • Many systems have record matching software. • Errors still exist • 50% missed in CDC Survey1 • 5% missed in 1.5 million records = 75,0002 1 User Manual for the CDC Deduplication Evaluation Toolkit 2 Snow LA, DuVall SL. Clinical Data Exchange Through A Looking Glass: A Gray-Box Approach To Record Linkage. NLM 2005.
probability score Score Is Not Probability
Name + Date of Birth + Social Security Number MPI
Deterministic Linkage • IF r1.social_security_number = r2.social_security_number THEN match. 2) IF SoundexCompare(r1.last_name, r2.last_name) AND SoundexCompare(r1.first_name, r2.first_name) AND EditDistance(r1.birth_place, r2.place)<2 AND r1.birth_date = r2.birth_date AND r1.multiplicity = r2.multiplicity AND r1.birth_order = r2.birth_order THEN match.
IF contains(0..9) THEN NUMBER IF contains(North, South, East, West) THEN DIRECTION IF contains(Street, Road, Lane, Drive, ...) THEN STREET_TYPE ELSE STREET_NAME IF (NUMBER = NUMBER) AND (DIRECTION = DIRECTION) AND (STREET = STREET) AND (STREET_TYPE = STREET_TYPE) THEN MATCH
Probabilistic Linkage Each field given AGREEMENT and DISAGREEMENT weight Weight proportional to the field’s DISCRIMINATION and RELIABILITY Many more parameters, possibility of better matching
Record Matching Understand your Data + Understand Mistakes in your Data Good Strategy for Linkage MANUAL REVIEW
Understanding the Data • Compare characteristics of records in the duplicate subset with records in the full enterprise data warehouse • Describe instances where records in the duplicate subset are not typical of the database at large • Provide considerations for others looking at duplicate records in master patient indexes
Extension of the Probabilistic Model for Approximate Field Comparators
Probabilistic Model Field in Record A = Field in Record B Agreement Weight Field in Record A ≠ Field in Record B Disagreement Weight
M – probability that field matches in dup pair U – probability that field matches in non-dup pair
Agreement Weight LOG(M/U) Disagreement Weight LOG(1-M/1-U)
Approximate Comparator Edit Distance ED( Johnathan, Jonathan ) = 1
Approximate Comparator Weight LOG(Mδ /Uδ)
Mδ – probability that field approximately matches by δin dup pair Uδ – probability that field approximately matches by δin non-dup pair
Dups Non-Dups Load and randomize training set Initial Parameters Classify with estimated parameters Estimate Dups and Non-Dups Update Parameters
Dups Non-Dups Load and randomize training set Updated Parameters Classify with updated parameters Re-estimate Dups and Non-Dups Update Parameters
Dups Non-Dups Load and randomize validation set Training Set Parameters Classify with training set parameters Classified Dups and Non-Dups