670 likes | 678 Views
Learn about the concepts and techniques of record linkage, including deterministic and probabilistic matching. This training covers the Link Plus software and its application in record linkage. Held in Portland, Oregon in May 2006.
E N D
Probabilistic Record Linkage Software Link Plus • Application Overview & User Training • Portland, Oregon • May 2006
CDC–NPCR Faculty Kathleen K. Thoburn, CDC/NPCR Contractor E-mail: kthoburn@cdc.gov David Gu, CDC/NPCR Contractor E-mail: dgu@cdc.gov Tom Rawson, CDC Computer Programmer
Acknowledgements Rich Pinder Supervisor of Follow-UpLos Angeles Cancer Surveillance Program Melissa Jim, MPH Epidemiologist Centers for Disease Control Division of Cancer Prevention and Control
Training Outline • Brief Overview of Record Linkage • Central Cancer Registry Record Linkage • Deterministic Matching • Probabilistic Matching • Link Plus Software Overview • Link Plus Linkage Overview • Linkage Exercises • Open Discussion
Overview of Record Linkage • Combine or merge together information describing the same individual from a variety of data sources • Merge information from individual’s record in 1st data source (file 1) with information from individual’s record in 2nd second data source (file 2) • Cancer information from cancer registry file, death information from vital statistics file • “Merge” aka “Record Linkage”
Overview of Record Linkage • Can be accomplished manually, by visually comparing records from two separate sources • Approach becomes time consuming, tedious, inefficient, and unpractical as the number of records in file 1 and file 2 increases • Technological advances in computer systems and programming techniques • Economically feasible to perform computerized record linkage between large files • Efficient and relatively accurate
Central Cancer Registry Record Linkage • Case Finding • Linking New Reports Consolidation • Duplicate Detection • Follow Up • Special Studies
Case Finding • Matching reports from • Pathology labs • Medical Records Disease Index • Treatment centers • No Match: tumor has not yet been reported • Request report of cancer from facility of diagnosis • Positive Match: tumor is already reported • New diagnostic/treatment information can be added to existing tumor record
Linking New Reports • Multiple notifications of the same cancer due to multiple reporting sources • Efficient record linkage procedures on same individual very important • Consolidation…Is this a • new person? • new tumor for an existing person? • new report for an existing person/tumor? • Failure in record linkage process results in missed cases and/or duplicate registrations • Leads to inaccurate counts and rates
Duplicate Detection • Fundamental requirement for accuracy and validity of counts in any disease registry • National Program of Cancer Registries/ North American Association of Central Cancer Registries standard • Maintain <= 0.1% (<=1 per 1,000) duplicates
Follow Up • Death Clearance – State vital statistic file • Hospital discharge data – Statewide file • Department of Motor Vehicles – Drivers' licenses and renewals • Social Security Death Master – SSA maintained file of death benefit claims • Medicare/Medicaid – Files of state enrollees • Voter Registration/Voter History - Statewide file of last 6 elections • National Change of Address (U.S. Postal Service) - File of individuals reporting change of address in last 3 years
Special Studies • Research questions often require linking external data against the registry • Allows hypothesis testing not available using other methods • Efficiency is a key feature • Faster, more efficient linkage process allows more linkages for less $$ and staff time • More research • Increased utilization of registry data
Deterministic Matching • Computerized comparison where EVERYTHING needs to match EXACTLY:
Deterministic Matching • Often slight variations exist in the data between the two files for the same variables: • Or variables are missing from one of the files: • These variations would prevent a match from being identified
Deterministic Matching • Describes an algorithm in which the correctnext step is PRE-defined (match/no match) • Good for production environments • Easily incorporated into existing data systems HOWEVER, • Will miss significant numbers of true matches • Will require enormous amount of manual review of results for missed matches
Deterministic MatchingManual Review • When we manually review, we use intuition to help us identify positive matches for records containing slight variations in, or missing information for, data between the two files for the same variables • Typo in SSN, transposition of digits in the day component of DOB, but would still deem a match
Probabilistic Matching • What do Humans know? • How can we translate intuition into formal decision rules to be used by a computer? • Use the concept of PROBABILITY and perform PROBABILISTIC matching • Recommended over traditional deterministic (exact matching) methods when: • coding errors, reporting variations, missing data or duplicate records • Estimate probability/likelihood that two records are for the same person versus not
Probabilistic Matching Definition of Probability: • Measure of how likely it is that some event will occur • “What is the probability of rain tonight?" • The likelihood that a given event will occur • “There is little probability of rain tonight.”
Probabilistic Matching • Find the records in File 2 that seem to match records in File 1 • Calculate a score that indicates, for any pair of records, how likely it is that they both refer to the same person • Sort the likely and possible matched pairs in order of their scores • Define a threshold (Cut Off value) for automatically accepting and rejecting a potential link • Discard unlikely matched pairs (scores below Cut Off) • Gray area: range of scores considered as uncertain matches • Manually review uncertain matches
Probabilistic Matching • The total score for a linkage between any two records is the sum of the scores generated from matching individual fields • The score assigned to a matching of individual fields is: • Based on the probability that a matching variable agrees given that a comparison pair is a match • M Probability - similar to "sensitivity“ • Reduced by the probability that a matching variable agrees given that a comparison pair is not a match • U Probability - similar to "specificity"
Probabilistic Matching • Agreement argues for linkage • Disagreement argues against linkage • Full agreement argues more strongly for linkage than partial agreement • Some types of partial agreements are stronger than others • Rare surname verses residence county code
Probabilistic Matching • Agreement on an uncommon value argues more strongly for linkage than a common value • Thoburn verses Smith • Agreement on a more specific variable argues more strongly for linkage than agreement on a less specific one • SSN verses Sex • Agreement on more variables/disagreement on few argues for linkage
Probabilistic Matching • Once comparisons are made, a weight is calculated for each field comparison • A total weight is derived by summing these separate field comparisons across all fields being compared • Probabilistic weights are • Field-specific – Birth date versus Sex • Value-specific - “Jane” versus “Janiqua”
Concept of Blocking • With so many comparisons, large files can make impossible resource demands • Blocking is an initial probabilistic linkage step that reduces the number of record comparisons between files • Sort and match the two files by one or more identifying (“blocking”) variables • Comparisons subsequently made only within blocks • Discard very unlikely record-pairings from the start
Blocking Variables Sock Pattern: 7 of 13 socks fall outside pattern block 6 of 13 socks withinpattern block
Matching Within Blocks Blocking: PatternMatching: Color & Size High Score Gray Area Low Score
Link Plus Software • Stand-alone probabilistic record linkage program • Combines ease of use and statistical sophistication • Detects duplicates within a cancer registry, or links cancer registry files to external files • Supports North American Association of Central Cancer Registries files, fixed width files, delimited files, and CRS Plus database
Link Plus Software • Computes probabilistic record linkage scores based on the theoretical frame work developed by Fellegi and Sunter • Fellegi, I. P., and A. B. Sunter (1969), "A Theory for Record Linkage," Journal of the American Statistical Association, 64, pp. 1183-1210 • Can handle missing values of matching variables • automatically treats null or empty values as missing data and allows user to indicate additional values to be treated as missing data
Link Plus Software • Facilitates a simple and efficient blocking ("OR blocking") mechanism by indexing the variables for blocking and comparing the pairs with the identical values on at least one of those variables • Provides powerful support for manual review of uncertain matches
Link Plus Is Free $0.00
Link Plus Is Easy To Use Link Plus gets you from HERE: Cancer Registry data for John Smith: Vital Statistics data for John Smith:
Link Plus Is Easy To Use To HERE: Linked data for John Smith:
Link Plus Is Easy To Use Without having to go HERE:
Link Plus Is Easy To Use • Designed especially for cancer registry work • HOWEVER, can be used with any data • Mathematics largely hidden from user • Practical default values supplied for many tasks • Familiar Windows interface • Includes Help and test examples
Link Plus Is Robust • Program written by a mathematical statistician • Specifications based on research into the published literature • Tested by researchers experienced in record-linkage • Results are clear and accessible to novice users
Link Plus Linkage OverviewPrior to Linkage • Review and clean data files • Set up two data files • Link Plus provides view of first 20 records of each input file • Verify that data is being read in properly
Link Plus Linkage OverviewData Cleaning Tips • Make sure you know your data! • Frequency distributions very helpful • Look for errant values • e.g. - DOB day component = 16
Data Cleaning Tips • Last Name • Link Plus automatically cleans punctuation and strips off suffices, numbers III • First Name • May find Dr. Bill or Rev Bill or Sister Mary • Remove prefix in First Name field • Middle Name • Link Plus automatically cleans numbers, weird symbols • NMI-no middle initial or NMN-no middle name • DOB • Review day, mo, yyyy component • Replace errant values with missing • Sex • Make sure files use same coding convention; M, F, or Blank OR 1, 2 9
Link Plus Linkage Overview Two main types of linkage: • External Linkage • Probabilistically link one file to another file • Deduplication • Special case of record linkage • Records in the same file are blocked, compared, and scored against each other • Result is a ranked list of record pairs • High-scoring pairs may be duplicates
Link Plus Linkage Overview External Linkage Steps: • Select Data Type for File 1 • Locate/Identify File 1 • Data Import for File 1 • Select Data Type for File 2 • Locate/Identify File 2 • Data Import for File 2 • Select Blocking Variables & Phonetic System • Select Matching Variables & Matching Methods • Select ID Variables • Define Missing Values • Select Direct/EM Method • Enter Cut-off Value • Specify Linkage File Name and Location • Perform Manual Review of Uncertain Matches • Export Merged File
Link Plus Linkage Overview Deduplication Linkage Steps: • Select Data Type for File • Locate/Identify File • Data Import for File • Select Blocking Variables & Phonetic System • Select Matching Variables & Matching Methods • Select ID Variables • Define Missing Values • Select Direct/EM Method • Enter Cut-off Value • Specify Linkage File Name and Location • Perform Manual Review of Uncertain Matches • Export Merged File
File ImportFile 1 verses File 2 • File 1 and File 2 have a one to many relationship • File 1 (ONE); File 2 (MANY) • Cancer Registry is File 1; External file is File 2: • Each John Smith in External file has as many chances as possible to match a John Smith record in CCR file • Non_MatchReport.txt contains records from External file not matched to records in CCR (death clearance for unreported tumors) • External File is File 1; Cancer Registry is File 2 • Each John Smith in CCR file has as many chances as possible to match a John Smith record in External file • Non_MatchReport.txt contains records from CCR file not matched to records in External file (follow up for date of death)
Blocking Variables • Exact matches • Blocks of data to compare variables within • Common blocking variables are: • Last Name • Social Security Number • Date of Birth
Phonetic Systems • Phonetic coding involves coding a string based on how it is pronounced • Link Plus offers a choice of 2 Phonetic Coding Systems: Soundex • Code for a name consisting of a letter followed by three numbers: the letter is the first letter of the name, and the numbers encode the remaining consonants • Reduces matching problems due to different spellings • Simple and fast
Phonetic Systems New York State Identification and Intelligence System (NYSIIS) • Maps similar phonemes to the same letter; maintains relative vowel positioning • String can be pronounced by the reader without decoding • Improvement to the Soundex algorithm • More distinctive; people are more likely to have the same Soundex than the same NYSIIS • Reported accuracy increase of 2.7% over Soundex • Studies suggest NYSIIS performs better than Soundex when Spanish names are used • However, Soundex may bring more pairs for comparison when it used for blocking
Matching Variables • Up to 10 fields may be selected for matching • Recommended variables (Matching Methods): • Name--Last (LastName) • Name--First (FirstName) • Name--Middle (MiddleName) • Sex (Exact) • Race (Exact) • Birth Date (Date) • Social Security Number (SSN)