230 likes | 409 Views
Blindfolded Record Linkage. Presented by Gautam Sanka. Susan C. Weber, Henry Lowe, Amar Das, Todd Ferris. Introduction and Objectives. Challenges Patient Privacy vs. Building Cross-Site records Solutions Mandate that identifiers be disclosed Privacy officers find this unacceptable
E N D
Blindfolded Record Linkage Presented by Gautam Sanka Susan C. Weber, Henry Lowe, Amar Das, Todd Ferris
Introduction and Objectives • Challenges • Patient Privacy vs. Building Cross-Site records • Solutions • Mandate that identifiers be disclosed • Privacy officers find this unacceptable • Keep only de-identified information in the registry but share an algorithm to Third Parties for generating an anonymous identifier
De-identification Explained • This anonymous identifier will be created in such a way that: • Probability of same identifier generated at two different sites is high for the same person • And low for different people
What can be used? • Using SSN – Bad Idea • Using names and DOB may seem best but: • Nicknames at one site and full name at another • Misspellings • Different Titles (Mr. Ms. Mrs.)
Goal of Project • Breast Cancer Patients at PAMF (Palo Alto Medical Foundation) and Stanford University Medical Center • Merge the Data with de-identification under HIPAA and IRB approval
Interesting Approaches • Bigrams • For the names Ann and Anne • [AN, NN] • [AN, NN, NE] • The Dice Co-efficient is 2 * (2/5) = 4/5 • Bloom Filter • Both were not implemented due to the complexities
A single SHA-1 string was constructed based on • Gender • DOB • Zip • Three letter Prefix of last name • In their case, only first two letters of patients’ first and last names were used
Composite Identifier • Felt that a combination of DOB and the first two letters of names would uniquely identify • Most applicable when: • Compliance restrictions preclude the exchange of actual identifiers • Total number of comparisons is less than 10^8 • Names and DOB are easily available • DOB has a low error rate
Methods • Measured Rate of false positives in data • Dropped name prefixes • Dropped DOB stating 1/1/1900 and 1/1/1901 • Performed a self-join on three sets of 1.5M rows, 0.5M rows and 10,000 rows
Measure False Negative • Both sites exchanged cryptographic hashes based on SSNs • The number of matches found by matching SSNs and not composite identifiers became the Lower Bound for False Negatives • Removal of all False Positives based on real identifiers
Sensitivity: • Specificity:
“This was a very interesting result in that it provided us with a measure of how much better our approach is compared to using full names rather than two-letter prefixes.”
Reasons for False Negatives in Composite Identification Found by SSN and later confirmed manually
Simply Using SSN • SSNs found only 1806 out of 2028 • Rate of false negatives is 10% higher than a composite identifier • Reasons • 172 of the 222 with false negatives had a missing SSN
What about the other 50? In conclusion, 57 False Positives for SSN matches 3 False Positives for Composite Identifier 20 times worse
When should we use this tool? • Most useful where privacy policies preclude the full exchange of the identifiers required by more sophisticated and sensitive linkage algorithms • For Data Sets of High quality, this approach (in comparison to complex algorithms) • Easy to explain • Adheres to minimum rules set by HIPAA • Faster and less cumbersome