670 likes | 682 Views
Learn about probabilistic record linkage concepts and techniques at the LinkPlus Summer Institute held in Portland, OR. This software training will cover deterministic matching, probabilistic matching, advantages of LinkPlus, and how to use it effectively. Join us for linkage exercises and open discussion.
E N D
Probabilistic Record Linkage Software LinkPlus Summer Institute, OHSU Portland, OR June 2015
CDC–NPCR Faculty Melissa Jim, CDC E-mail: mjim@cdc.gov or melissa.jim@ihs.gov 404-718-2602 David Espey, CDC/IHS E-mail: despey@cdc.gov 404-718-2603
Acknowledgements Kathleen Thoburn, CDC/NPCR Contractor KThoburn@cdc.gov Peter Kim, CDC/NPCR Contractor PWKim@cdc.gov David Gu, Developer
Training Outline • Brief Overview of Record Linkage • Deterministic Matching • Probabilistic Matching • Advantages of LinkPlus • Using LinkPlus • Linkage Exercises • Open Discussion
Overview of Record Linkage • “Record Linkage” aka “Matching” aka “Merge” • Combining information from a variety of data sources for the same individual • Merge information from a record in one data source (file 1) with information from another data source (file 2) • Example: merging cancer information from cancer registry file with death information from vital statistics file
Overview of Record Linkage • Can be accomplished manually, by visually comparing records from two separate sources • Approach becomes time consuming, tedious, inefficient, and unpractical as the number of records in file 1 and file 2 increases • Technological advances in computer systems and programming techniques • Economically feasible to perform computerized record linkage between large files • Efficient and relatively accurate
Duplicate Detection • Fundamental requirement for accuracy and validity of counts in any disease registry • Example: National Program of Cancer Registries/ North American Association of Central Cancer Registries standard • Maintain <= 0.1% (<=1 per 1,000) duplicates
Examples from cancer registry: Follow-Up • Death Clearance – State vital statistic file • Hospital discharge data – Statewide file • Department of Motor Vehicles – Drivers' licenses and renewals • Social Security Death Master – SSA maintained file of death benefit claims • Medicare/Medicaid – Files of state enrollees • Voter Registration/Voter History - Statewide file of last 6 elections • National Change of Address (U.S. Postal Service) - File of individuals reporting change of address in last 3 years
IHS Linkage examples • Show IHS Linkage presentation
Deterministic Matching • Computerized comparison where EVERYTHING needs to match EXACTLY:
Deterministic Matching • Often slight variations exist in the data between the two files for the same variables: • Or variables are missing from one of the files: • These variations would prevent a match from being identified
Deterministic MatchingManual Review • When we manually review, we use intuition to help us identify positive matches for records containing slight variations in, or missing information for, data between the two files for the same variables • Typo in SSN, transposition of digits in the day component of DOB, but would still deem a match
Probabilistic Matching • Translating intuition into formal decision rules • Use the concept of PROBABILITY and perform PROBABILISTIC matching • Recommended over traditional deterministic (exact matching) methods when: • coding errors, reporting variations, missing data or duplicate records • Estimate probability/likelihood that two records are for the same person versus not
Probabilistic Matching • Find the records in File 2 that seem to match records in File 1 • Calculate a score that indicates, for any pair of records, how likely it is that they both refer to the same person • Sort the likely and possible matched pairs in order of their scores • Define a threshold (Cut Off values) for automatically accepting and rejecting a potential link • Discard unlikely matched pairs (scores below 2nd Cut Off) • Gray area: range of scores between the two cut off values considered uncertain matches • Manually review uncertain matches
Probabilistic Matching • The total score for a linkage between any two records is the sum of the scores generated from matching individual fields • The score assigned to a matching of individual fields is: • Based on the probability that a matching variable agrees given that a comparison pair is a match • M Probability - similar to "sensitivity“ • Reduced by the probability that a matching variable agrees given that a comparison pair is not a match • U Probability - similar to "specificity"
Probabilistic Matching • Agreement argues for linkage • Disagreement argues against linkage • Full agreement argues more strongly for linkage than partial agreement • Some types of partial agreements are stronger than others • Rare surname versus Sex
Probabilistic Matching • Agreement on an uncommon value argues more strongly for linkage than a common value • Espey versus Smith • Agreement on a more specific variable argues more strongly for linkage than agreement on a less specific one • SSN versus Sex • Agreement on more variables/disagreement on few argues for linkage
Probabilistic Matching • Once comparisons are made, a weight is calculated for each field comparison • A total weight (or “score”) is derived by summing these separate field comparisons across all fields being compared • Probabilistic weights are • Field-specific – Birth date versus Sex • Value-specific - “Jane” versus “Janiqua”
Review • Applications of data linkages • Deterministic vs Probabilistic linkages • Weights: • For individual variable (M and U prob.) • Total weight (“score”)
Linkage basics • Blocking variables • Matching variables • Advantages of LinkPlus • Using LinkPlus
Concept of Blocking • With so many comparisons, large files can make impossible resource demands • Blocking is an initial probabilistic linkage step that reduces the number of record comparisons between files • Sort and match the two files by one or more identifying (“blocking”) variables • Comparisons subsequently made only within blocks • Discard very unlikely record-pairings from the start
Blocking Variables Sock Color: 7 of 13 socks fall outside pattern block 6 of 13 socks withinpattern block
Matching Within Blocks Blocking: ColorMatching: Size & Pattern High Score Gray Area Low Score
Link Plus Software • Stand-alone probabilistic record linkage program • Combines ease of use and statistical sophistication • Detects duplicates within a single database, or links 2 database files • Supports North American Association of Central Cancer Registries files, fixed width files, delimited files, and CRS Plus database
Link Plus Software • Can handle missing values of matching variables • automatically treats null or empty values as missing data and allows user to indicate additional values to be treated as missing data • Facilitates blocking ("OR blocking") by indexing the variables and comparing the pairs with the identical values on at least one of those variables • Provides support for manual review of uncertain matches
Link Plus Is Free $0.00
Link Plus Is Easy To Use Link Plus gets you from HERE: Cancer Registry data for John Smith: Vital Statistics data for John Smith:
Link Plus Is Easy To Use To HERE: Linked data for John Smith:
Link Plus Is Easy To Use Without having to go HERE:
Link Plus Is Easy To Use • Designed especially for cancer registry work • HOWEVER, can be used with any data • Mathematics largely hidden from user • Practical default values supplied for many tasks • Familiar Windows interface • Includes Help and test examples
Getting Started • Make sure you know your data! • Review and clean data files • Frequency distributions very helpful • Look for errant values • e.g. - DOB month component = 16 - DOB day component = 32 - DOB year component = 732
Data Cleaning Tips • Last Name • Link Plus automatically cleans punctuation and strips off suffices, numbers III • First Name • May find Dr. Bill or Rev Bill or Sister Mary • Remove prefix in First Name field • Middle Name • Link Plus automatically cleans numbers, weird symbols • NMI-no middle initial or NMN-no middle name • DOB • Review day, mo, yyyy component • Replace errant values with missing • Sex • Make sure files use same coding convention; M, F, or Blank OR 1, 2 9
Data Cleaning Example • Show in SAS
Linkage Overview Two main types of linkage: • Linkage of 2 files • Probabilistically link one file to another file • Deduplication • Special case of record linkage • Records in the same file are blocked, compared, and scored against each other • Result is a ranked list of record pairs • High-scoring pairs may be duplicates
File ImportFile 1 versus File 2 • File 1 and File 2 have a one to many relationship • File 1 (ONE); File 2 (MANY) • Death record is File 1; cancer registry is File 2: • One John Smith in death record file can link to many John Smith’s in cancer registry file • Cancer registry is File 1; death record is File 2 • One John Smith in cancer registry file can link to many John Smith’s in death record file
File Layout • As designate File 1 and File 2 will need to create file layout for each file • Link Plus provides view of first 20 records of each input file • Verify that data is being read in properly
Blocking Variables • Exact matches • Blocks of data to compare variables within • Common blocking variables are: • Last Name • First Name • Social Security Number • Date of Birth
Phonetic Systems • Phonetic coding involves coding a string based on how it is pronounced • Link Plus offers a choice of 2 Phonetic Coding Systems: Soundex • Code for a name consisting of a letter followed by three numbers: the letter is the first letter of the name, and the numbers encode the remaining consonants • Reduces matching problems due to different spellings • Smith and Smyth have same code • Simple and fast
Phonetic Systems New York State Identification and Intelligence System (NYSIIS) • Maps similar phonemes to the same letter; maintains relative vowel positioning • String can be pronounced by the reader without decoding • Improvement to the Soundex algorithm • More distinctive; people are more likely to have the same Soundex than the same NYSIIS • Reported accuracy increase of 2.7% over Soundex • Studies suggest NYSIIS performs better than Soundex when Spanish names are used • However, Soundex may bring more pairs for comparison when it used for blocking
Matching Variables • Up to 10 fields may be selected for matching • Recommended variables (Matching Methods): • Social Security Number (SSN) • Birth Date (Date) • Name--Last (LastName) • Name--First (FirstName) • Name--Middle (MiddleName) • Sex (Exact) • Date of Death (Date)
Matching Methods • Exact • Case insensitive character-for-character string comparison method • Results are either yes or no • Generic String • Uses edit distance function (Levenshtein distance) to compute the similarity of two long strings • Minimum number of operations (insertion, deletion, or substitution of a single character) needed to transform one string into the other • Last Name/First Name • Incorporate both partial matching and value-specific matching to account for minor typographical errors, misspellings, and hyphenated names
Matching Methods • SSN • Specifically for Social Security Number • Incorporates partial matching to account for typographical errors and transposition of digits • Date • Incorporates partial matching to account for missing month values and/or day values • Middle Name • Accounts for occurrence of the middle initial only versus the full middle name