70 likes | 161 Views
(De-Identified) Record Linkage. Dongqiuye Pu , Ashraf Farrag , Javed Mostafa. Background. Identify duplicates in a file or across files AKA: Object identification, data cleaning, entity resolution, etc…. Motivation. Lack of unique identifiers Variations of spelling, misspelling , typo….
E N D
(De-Identified) Record Linkage DongqiuyePu, AshrafFarrag, JavedMostafa
Background • Identify duplicates in a file or across files • AKA: Object identification, data cleaning, entity resolution, etc….
Motivation • Lack of unique identifiers • Variations of spelling, misspelling, typo…
For Instance… (A) (B)
Methods In a Nutshell • Deterministic matching: straightforward, no human review needed, but suffer low recall • Approximate matching: harder to implement, human review needed, higher recall
Research Plan • Exact matching • Fuzzy matching for the rest
Evaluating accuracy of anonymous record linkage • Evaluate collision rate of hashing algorithm (most likely will be ZERO)