700 likes | 1.01k Views
Probabilistic Record Linkage Software . Link Plus. Application Overview & User Training Provided to Missouri Cancer Registry Staff Via web cast July 16, 2008. CDC–NPCR Link Plus Contacts. Kathleen K. Thoburn, CDC/NPCR Contractor E-mail: kthoburn@cdc.gov David Gu, CDC/NPCR Contractor
E N D
Probabilistic Record Linkage Software Link Plus • Application Overview & User Training • Provided to Missouri Cancer Registry Staff • Via web cast • July 16, 2008
CDC–NPCR Link Plus Contacts Kathleen K. Thoburn, CDC/NPCR Contractor E-mail: kthoburn@cdc.gov David Gu, CDC/NPCR Contractor E-mail: dgu@cdc.gov Tom Rawson, CDC Computer Programmer
Acknowledgements Rich Pinder Supervisor of Follow-UpLos Angeles Cancer Surveillance Program Melissa Jim, MPH Epidemiologist Centers for Disease Control Division of Cancer Prevention and Control
Training Outline • Brief Overview of Record Linkage • Central Cancer Registry Record Linkage • Deterministic Matching • Probabilistic Matching • Link Plus Software Overview • Link Plus Linkage Overview • Linkage Exercises • Open Discussion
Overview of Record Linkage • Combine or merge together information describing the same individual from a variety of data sources • Merge information from individual’s record in 1st data source (file 1) with information from individual’s record in 2nd second data source (file 2) • Cancer information from cancer registry file, death information from vital statistics file • “Merge” aka “Record Linkage”
Overview of Record Linkage • Can be accomplished manually, by visually comparing records from two separate sources • Approach becomes time consuming, tedious, inefficient, and unpractical as the number of records in file 1 and file 2 increases • Technological advances in computer systems and programming techniques • Economically feasible to perform computerized record linkage between large files • Efficient and relatively accurate
Central Cancer Registry Record Linkage • Case Finding • Linking New Reports Consolidation • Duplicate Detection • Follow Up • Special Studies
Case Finding • Matching reports from • Pathology labs • Medical Records Disease Index • Treatment centers • No Match: tumor has not yet been reported • Request report of cancer from facility of diagnosis • Positive Match: tumor is already reported • New diagnostic/treatment information can be added to existing tumor record
Linking New Reports • Multiple notifications of the same cancer due to multiple reporting sources • Efficient record linkage procedures on same individual very important • Consolidation…Is this a • new person? • new tumor for an existing person? • new report for an existing person/tumor? • Failure in record linkage process results in missed cases and/or duplicate registrations • Leads to inaccurate counts and rates
Duplicate Detection • Fundamental requirement for accuracy and validity of counts in any disease registry • National Program of Cancer Registries/ North American Association of Central Cancer Registries standard • Maintain <= 0.1% (<=1 per 1,000) duplicates
Follow Up • Death Clearance – State vital statistic file • Hospital discharge data – Statewide file • Department of Motor Vehicles – Drivers' licenses and renewals • Social Security Death Master – SSA maintained file of death benefit claims • Medicare/Medicaid – Files of state enrollees • Voter Registration/Voter History - Statewide file of last 6 elections • National Change of Address (U.S. Postal Service) - File of individuals reporting change of address in last 3 years
Special Studies • Research questions often require linking external data against the registry • Allows hypothesis testing not available using other methods • Efficiency is a key feature • Faster, more efficient linkage process allows more linkages for less $$ and staff time • More research • Increased utilization of registry data
Deterministic Matching • Computerized comparison where EVERYTHING needs to match EXACTLY:
Deterministic Matching • Often slight variations exist in the data between the two files for the same variables: • Or variables are missing from one of the files: • These variations would prevent a match from being identified
Deterministic Matching • Describes an algorithm in which the correctnext step is PRE-defined (match/no match) • Good for production environments • Easily incorporated into existing data systems HOWEVER, • Will miss significant numbers of true matches • Will require enormous amount of manual review of results for missed matches
Deterministic MatchingManual Review • When we manually review, we use intuition to help us identify positive matches for records containing slight variations in, or missing information for, data between the two files for the same variables • Typo in SSN, transposition of digits in the day component of DOB, but would still deem a match
Probabilistic Matching • What do Humans know? • How can we translate intuition into formal decision rules to be used by a computer? • Use the concept of PROBABILITY and perform PROBABILISTIC matching • Recommended over traditional deterministic (exact matching) methods when: • coding errors, reporting variations, missing data or duplicate records • Estimate probability/likelihood that two records are for the same person versus not
Probabilistic Matching Definition of Probability: • Measure of how likely it is that some event will occur • “What is the probability of rain tonight?" • The likelihood that a given event will occur • “There is little probability of rain tonight.”
Probabilistic Matching • Find the records in File 2 that seem to match records in File 1 • Calculate a linkage score that indicates, for any pair of records, how likely it is that they both refer to the same person • Sort the likely and possible matched pairs in order of their scores • Define a threshold (Cut Off value) for automatically accepting and rejecting a potential link • Discard unlikely matched pairs (scores below Cut Off) • Gray area: range of scores considered as uncertain matches • Manually review uncertain matches
Probabilistic Matching • The total score for a linkage between any two records is the sum of the scores generated from matching individual fields • The score assigned to a matching of individual fields is: • Based on the probability that a matching variable agrees given that a comparison pair is a match • M Probability - similar to "sensitivity“ • Reduced by the probability that a matching variable agrees given that a comparison pair is not a match • U Probability - similar to "specificity"
Probabilistic Matching • Agreement argues for linkage (higher score) • Disagreement argues against linkage (lower score) • Full agreement argues more strongly for linkage than partial agreement • Some types of partial agreements are stronger than others; probabilistic scores are • Field-specific – Birth date versus Sex • Value-specific - “Jane” versus “Janiqua”
Phonetic Systems • Phonetic coding involves coding a string based on how it is pronounced • Link Plus offers a choice of 2 Phonetic Coding Systems: Soundex (120 + years old) • Code for a name consisting of a letter followed by three numbers: the letter is the first letter of the name, and the numbers encode the remaining consonants • Zeroes are added at the end if necessary to produce a four-character code. Additional letters are disregarded. • Washington is coded W-252 (W, 2 for the S, 5 for the N, 2 for the G, remaining letters disregarded • Reduces matching problems due to different spellings • Simple and fast
Phonetic Systems New York State Identification and Intelligence System (NYSIIS; 1970 +) • Maps similar phonemes to the same letter; maintains relative vowel positioning • String can be pronounced by the reader without decoding • Deborah Walker = DABARA WALCAR • Improvement to the Soundex algorithm • More distinctive; people are more likely to have the same Soundex than the same NYSIIS • Reported accuracy increase of 2.7% over Soundex • Studies suggest NYSIIS performs better than Soundex when Spanish names are used • Soundex may bring more pairs for comparison when used for blocking
Concept of Blocking • With so many comparisons, large files can make impossible resource demands • Blocking is an initial probabilistic linkage step that reduces the number of record comparisons between files • Sort and match the two files by one or more identifying (“blocking”) variables • Comparisons subsequently made only within blocks • Discard very unlikely record-pairings from the start
Blocking Variables Sock Pattern: 7 of 13 socks fall outside pattern block 6 of 13 socks withinpattern block
Matching Within Blocks Blocking: PatternMatching: Color & Size High Score Gray Area Low Score
Link Plus Software • Stand-alone probabilistic record linkage program • Combines ease of use and statistical sophistication • Detects duplicates within a data file, or links two data files together • Supports fixed width files, delimited files, and North American Association of Central Cancer Registries files
Link Plus Software • Computes probabilistic record linkage scores based on the theoretical frame work developed by Fellegi and Sunter • Fellegi, I. P., and A. B. Sunter (1969), "A Theory for Record Linkage," Journal of the American Statistical Association, 64, pp. 1183-1210 • Can handle missing values of matching variables • automatically treats null or empty values as missing data and allows user to indicate additional values to be treated as missing data
Link Plus Software • Facilitates a simple and efficient blocking ("OR blocking") mechanism by indexing the variables for blocking and comparing the pairs with the identical values on at least one of those variables • Provides powerful support for manual review of uncertain matches
Link Plus Is Free $0.00
Link Plus Is Easy To Use Link Plus gets you from HERE: Cancer Registry data for John Smith: Vital Statistics data for John Smith:
Link Plus Is Easy To Use To HERE: Linked data for John Smith:
Link Plus Is Easy To Use Without having to go HERE:
Link Plus Is Easy To Use • Designed especially for cancer registry work • HOWEVER, can be used with any data • Mathematics largely hidden from user • Practical default values supplied for many tasks • Familiar Windows interface • Includes Help and test examples
Link Plus Is Robust • Program written by a mathematical statistician • Specifications based on research into the published literature • Tested by researchers experienced in record-linkage • Results are clear and accessible to novice users
Link Plus Linkage OverviewPrior to Linkage • Review and clean data files • Make sure you know your data! • Data cleaning tips in online help • Set up two data files • Make sure files use same coding convention • Link Plus provides view of first 20 records of each input file • Verify that data is being read in properly
Data Cleaning Tips • Last Name • Link Plus automatically cleans punctuation and strips off suffices, numbers III • First Name • May find Dr. Bill or Rev Bill or Sister Mary • Remove prefix in First Name field • Middle Name • Link Plus automatically cleans numbers, weird symbols • NMI-no middle initial or NMN-no middle name • DOB • Review day, mo, yyyy component • Replace errant values with missing • Sex • Make sure files use same coding convention; M, F, or Blank OR 1, 2 9
Link Plus Linkage Overview Two main types of linkage: • External Linkage • Probabilistically link one file to another file • Deduplication • Special case of record linkage • Records in the same file are blocked, compared, and scored against each other • Result is a ranked list of record pairs • High-scoring pairs may be duplicates
Link Plus Linkage Overview External Linkage Steps: • Select Data Type for File 1 • Locate/Identify File 1 • Data Import for File 1 • Select Data Type for File 2 • Locate/Identify File 2 • Data Import for File 2 • Select Blocking Variables & Phonetic System • Select Matching Variables & Matching Methods • Select ID Variables • Define Missing Values • Select Direct/EM Method • Enter Cut-off Value • Specify Linkage File Name and Location • Perform Manual Review of Uncertain Matches • Export Merged File
Link Plus Linkage Configuration Identify/Import Data Files Specify Data Type Select Blocking Variables/Phonetic System Select ID Variables Select Matching Variables/ Methods Save Linkage Configuration Direct Method/EM Algorithm Enter Cutoff Specify Missing Values Specify Linkage File Name and Location Run Linkage!
Link Plus Linkage Overview Deduplication Linkage Steps: • Select Data Type for File • Locate/Identify File • Data Import for File • Select Blocking Variables & Phonetic System • Select Matching Variables & Matching Methods • Select ID Variables • Define Missing Values • Select Direct/EM Method • Enter Cut-off Value • Specify Linkage File Name and Location • Perform Manual Review of Uncertain Matches • Export Merged File
File ImportFile 1 verses File 2 • Designation of File 1 and File 2 is important • File 1 is generally the larger of the two files • µ probabilities are based on FILE 1 • Non_MatchReport.txt contains records from File 2 file not matched to records in File 1
Blocking Variables • Exact matches • Blocks of data to compare variables within • Up to 10 fields may be selected for blocking • Common blocking variables are: • Last Name • Social Security Number • Date of Birth
Phonetic Systems • Phonetic coding involves coding a string based on how it is pronounced • Reduces matching problems due to different spellings • Link Plus offers a choice of 2 Phonetic Coding Systems: • Soundex • New York State Identification and Intelligence System (NYSIIS)
Matching Variables • Up to 10 fields may be selected for matching • Recommended variables (Matching Methods): • Name--Last (LastName) • Name--First (FirstName) • Name--Middle (MiddleName) • Sex (Exact) • Race (Exact) • Birth Date (Date) • Social Security Number (SSN)
Matching Methods • Exact • Case insensitive character-for-character string comparison method • Results are either yes or no • Generic String • Uses edit distance function (Levenshtein distance) to compute the similarity of two long strings • Minimum number of operations (insertion, deletion, or substitution of a single character) needed to transform one string into the other • Last Name/First Name • Partial matching and value-specific matching accounts for minor typographical errors, misspellings, and hyphenation • Use of nicknames in First Name Matching Method
Matching Methods • SSN (Social Security Number) • Partial matching accounts for typographical errors and transposition of digits • Accepts 4 digit SSNs • Zip Code • Can match 5 digit zip code to 9 digit zip code • Date • Incorporates partial matching to account for missing month values and/or day values • Middle Name • Accounts for occurrence of the middle initial only versus the full middle name