360 likes | 377 Views
Make the most of your data using CDC’s Link Plus program. Learn about deterministic and probabilistic matching methods to link records effectively and accurately. Understand the benefits, challenges, and practical applications, including case finding, follow-up, and duplicate detection. Explore techniques like phonetic systems and blocking to enhance matching accuracy. Join CDC experts for insights at the NCRA 2011 Annual Conference.
E N D
Make the Most of Your Data Using CDC’s Link Plus Free, Fast, and Efficient Probabilistic Record Linkage Program Kathleen K. Thoburn CDC/NPCR Contractor Joe Rogers Team Lead, Data Analysis and Support Team, NPCR, CDC National Center for Chronic Disease Prevention and Health Promotion NCRA 2011 Annual Conference NPCR QC TrackOrlando, FloridaJune 24, 2010 Division of Cancer Prevention and Control
Overview of Record Linkage • Can be accomplished manually, by visually comparing records from two separate sources or reviewing a single data source for duplicate records • Approach becomes time consuming, tedious, inefficient, and unpractical as the number of records in the files increases • Technological advances in computer systems and programming techniques • Economically feasible to perform computerized record linkage on large files • Efficient and relatively accurate
Central Cancer Registry Record Linkage • Case Finding • Linking New Reports Consolidation • Follow Up • Special Studies • Duplicate Detection
Duplicate Detection • Fundamental requirement for accuracy and validity of counts in any disease registry • National Program of Cancer Registries and North American Association of Central Cancer Registries standard • Maintain <= 0.1% (<=1 per 1,000) duplicates
Deterministic Matching • Computerized comparison where EVERYTHING needs to match EXACTLY:
Deterministic Matching • Often slight variations exist in the data between the two files for the same variables: • Or variables are missing from one of the files:
When we manually review, we use intuition to help us identify positive matches for records containing slight variations in, or missing information for, data between the two files for the same variables Deterministic MatchingManual Review • Typo in SSN, transposition of digits in the day component of DOB, but would still deem a match
Probabilistic Matching • What do Humans know? • How can we translate intuition into formal decision rules to be used by a computer? • Use the concept of PROBABILITY and perform PROBABILISTIC matching • Recommended over traditional deterministic (exact matching) methods when: • coding errors, reporting variations, missing data or duplicate records • Estimate probability/likelihood that two records are for the same person versus not
Probabilistic Matching • Find the records in File 2 that seem to match records in File 1 • Calculate a score that indicates, for any pair of records, how likely it is that they both refer to the same person • Sort the likely and possible matched pairs in order of their scores • Define a threshold (Cut Off value) for automatically accepting and rejecting a potential link • Discard unlikely matched pairs (scores below Cut Off) • Gray area: range of scores considered as uncertain matches • Manually review uncertain matches
Probabilistic Matching • Agreement argues for linkage (higher score) • Disagreement argues against linkage (lower score) • Full agreement argues more strongly for linkage than partial agreement • Some types of partial agreements are stronger than others; probabilistic scores are • Field-specific – Birth date versus Sex • Value-specific - “Jane” versus “Janiqua”
Phonetic Systems • Phonetic coding involves coding a string based on how it is pronounced Soundex (120 + years old) • Code for a name consisting of a letter followed by three numbers: the letter is the first letter of the name, and the numbers encode the remaining consonants • Zeroes are added at the end if necessary to produce a four-character code. Additional letters are disregarded. • Washington is coded W-252 (W, 2 for the S, 5 for the N, 2 for the G, remaining letters disregarded • Reduces matching problems due to different spellings • Simple and fast
Phonetic Systems New York State Identification and Intelligence System (NYSIIS; 1970 +) • Maps similar phonemes to the same letter; maintains relative vowel positioning • String can be pronounced by the reader without decoding • Deborah Walker = DABARA WALCAR • Improvement to the Soundex algorithm • More distinctive; people are more likely to have the same Soundex than the same NYSIIS • Reported accuracy increase of 2.7% over Soundex • Studies suggest NYSIIS performs better than Soundex when Spanish names are used • Soundex may bring more pairs for comparison when used for blocking
Concept of Blocking • With so many comparisons, large files can make impossible resource demands • Blocking is an initial probabilistic linkage step that reduces the number of record comparisons between files • Sort and match the two files by one or more identifying (“blocking”) variables • Comparisons subsequently made only within blocks • Discard very unlikely record-pairings from the start
Blocking Variables • Exact matches • Blocks of data to compare variables within • Common blocking variables are: • Last Name • Social Security Number • Date of Birth
Matching Variables • Probabilistic matching algorithms • Comparing variables within blocks • Common matching variables: • Name--Last • Name--First • Name--Middle • Sex • Race • Birth Date • Social Security Number
Blocking Sock Pattern: 7 of 13 socks fall outside pattern block 6 of 13 socks withinpattern block
Matching Within Blocks Blocking: Sock PatternMatching: Sock Color & Size High Score Gray Area Low Score
Link Plus Software • Stand-alone probabilistic record linkage program • Combines ease of use and statistical sophistication • Detects duplicates within a cancer registry, or links cancer registry files to external files • Supports North American Association of Central Cancer Registries files, fixed width files, delimited files, and CRS Plus database • Provides powerful support for manual review of uncertain matches
CDC–NPCR Link Plus Contacts Kathleen K. Thoburn, CDC/NPCR Contractor E-mail: kthoburn@cdc.gov David Gu, CDC/NPCR Contractor E-mail: dgu@cdc.gov Tom Rawson, CDC Computer Programmer
Link Plus Is Free $0.00
Link Plus Is Easy To Use Link Plus gets you from HERE: Cancer Registry data for John Smith: Vital Statistics data for John Smith: To HERE: Linked data for John Smith:
Link Plus Is Easy To Use Without having to go HERE:
Link Plus Is Easy To Use • Designed especially for cancer registry work • HOWEVER, can be used with any data • Mathematics largely hidden from user • Practical default values supplied for many tasks • Familiar Windows interface • Includes Help and test examples
Link Plus Linkage Overview Deduplication Linkage Steps: • Select Data Type for File • Locate/Identify File • Data Import for File • Select Blocking Variables & Phonetic System • Select Matching Variables & Matching Methods • Select ID Variables • Define Missing Values • Enter Cut-off Value • Select Direct/EM Method • Specify Linkage File Name and Location • Perform Manual Review of Uncertain Matches • Export Merged File
Blocking Variables • Exact matches • Blocks of data to compare variables within • Up to 10 fields may be selected for blocking • Common blocking variables are: • Last Name • Social Security Number • Date of Birth
Matching Variables • Up to 10 fields may be selected for matching • Recommended variables (Matching Methods): • Name--Last (LastName) • Name--First (FirstName) • Name--Middle (MiddleName) • Sex (Exact) • Race (Value-Specific) • Birth Date (Date) • Social Security Number (SSN)
Matching Methods • Exact • Generic String • Last Name/First Name/Middle • SSN (Social Security Number) • Zip Code • Date • Generic ID • Confirmation • Value-Specific (Frequency-Based)
Missing Values • Specify date format on the missing value grid
Cut Off Value • The score value above which comparison pairs are accepted as potential links and presented for review • Value should always be positive • Initial value of around 7-10 recommended when using the recommended Matching Variables • Run linkage, and quickly review potential matches to identify lower and upper cut off scores • At what score do perfect matches end and uncertain matches (gray area) begin? • At what score do false matches begin?
Running Linkage &Linkage Process Progress Window • Linkage Process Progress window appears and provides the user with feedback about the linkage process as it is run • The progress window provides feedback regarding the preparation of the configuration, the reading in of the data files, the blocking of the files, and the calculation of the linkage scores
Enhancements New to Version 3.0 Data Link: • Removes the limitation on the number of records included in File 2; the program works for any number of records in File 2 as long as the computer has sufficient memory to read in data from File 1 • Users can choose whether to write all potential matches (many-to-many linkages) or only the matches with the highest score to the linkage report (1-to-many linkages) • Provides confirmation-like matching method for variables such as address that contributes positive weight for the linkage score with agreement but 0 weight with disagreement • Provides SSN-like matching method for a generic ID • Provides a new name matching method that is more robust against the frequency of names or outlier of names
Enhancements New to Version 3.0 ManualReview • Users can use “Assign Set ID” (de-duplication linkages only) to group matches into mutually exclusive match sets • Removes the limitation of the maximum size of 30,000 pairs on the manual review window; the new maximum is 300,000 pairs • Provides option to allow users to assign match status by linkage score without overwriting any existing assigned match status Export • Users can export the results of manual review to a NAACCR formatted file or any other fixed width file format
Thank You! Be sure to stop by the Registry Plus booth with questions or for further demonstrations! Kathleen K. Thoburn, kthoburn@cdc.gov Joe Rogers, jrogers@cdc.gov The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention. National Center for Chronic Disease Prevention and Health Promotion Division of Cancer Prevention and Control