Record Linkage in a Distributed Environment

Record Linkagein a Distributed Environment Huang YipengWing group meeting, 11 March 2011

Record Linkage E.g. Distinguishing between data belonging to…<Yipeng, author of this presentation> and <Yipeng, son of PM Lee> Determining if pairs of personal records refer to the same entity Introduction

The Distributed Environment • Why? • Dealing with large data • Limitation of blocking • Advantages • Parallel computation • Data source flexibility • Complementary to blocking methods O(nC2) Amanda Amanda Amanda Amanda Introduction

The Distributed Environment • MapReduce • Distributed environment for large data sets • Hadoop • Open source implementation • Convenient model for scaling Record Linkage • Protects users from system level concerns Introduction

Research Problem Disconnect between generic parallel framework and specific Record Linkage problem The goal  Tailor Hadoop for Record Linkage tasks Introduction

Outline Introduction Related Work Methodology Evaluation Conclusion

Related Work • Record Linkage Literature • Blocking techniques • Parallel Record Linkage Literature • P-Febrl(P Christen 2003), • P-Swoosh (H Kawai 2006), • Parallel Linkage (H Kim 2007) • Hadoop Literature • Evaluation Metrics • Pairwise comparisons (T Elsayed 2008) Related Work

Outline Introduction Related Work Methodology Evaluation Conclusion

MapReduce Workflow Partitioner Methodology

Implementation • Map • Purpose: • Parallelism • Data manipulation • Blocking • Reads lines of input and outputs <key, value> pairs. • Reduce • Purpose: • Parallelism • Record Linkage ops • Records with the same <key> in same Reduce(). • Linkage results Methodology

Hash Partitioner 5416986 comparisons 210 comparisons Default implementation Hash(Key) mod N Good for uniformed data but not for skewed distributions Methodology

Record Linkage Partitioner Goal: Have all nodes finish the reduce phase at the same time Attain a better runtime but retaining the same level of accuracy Methodology

Domain principles Counting pairwise comparisons gives a more accurate picture of the true computational workload The distribution of names tends to follow a power law distribution in many countries (D Zanette 2001), (S Miyazima 2000) Methodology

Record Linkage Workflow Round 1 Range partition based on comparison workload Round 2 Merge lost comparisons from Round 1 Round 3 Remove cross duplicates Methodology

Round 1 Input 1. Calc avg comparison workload over N nodes Distribution 2. Check if a record will exceed the avg. If Yes, Divide by min number of nodes needed to drop below. Map Phase 3. Assign records to nodes and update the avg comparison workload to reflect lost comparisons , if any. 4. Recurse until comparison load can be evenly distributed among nodes Methodology

Round 2 List X B Methodology 17

Round 2 Only acts on lost comparisons Because input is indistinct, a 3rd round of deduplication may be needed. Methodology 18

Outline Introduction Related Work Methodology Evaluation Conclusion Introduction

Performance Metrics • Performance evaluation in absolute runtime, speedup & scaleupon a shared cluster. • “It’s what users care about” • Representative of real operations Evaluation

Input Records <rec-359705-org, talyor, swift, 5, canterbury crescent, , cooks hill, 4122, , 19090518, 38, 07 34366927, 6174819, 9> 10 million records, 0.9 million original, 0.1 million duplicate, up to 9 duplicates per record, 1 modification per field, 1 modification per record, duplicates follow Poisson distribution. Methodology

Data sets • Synthetic data produced with Febrl data generator • Artificially skewed distribution Methodology

Utilization Evaluation

Utilization A B C Evaluation

Utilization A B Evaluation

Round 2 J1 J2 ? J3 J4 J5 J6 Node Utilization 50-100% 27

Results so far…. Evaluation

Results so far…. • RL Workflow runtime • Similar to Hash-based runtime on small datasets • Better as the size of the dataset grows Evaluation

Conclusion • Parallelism a right step in the right direction for record linkage • Complementary to existing approaches • Hadoop can be tailored for Record Linkage tasks • “Record Linkage” Partitioner / Workflow is just one an example of possible improvements Conclusion

Record Linkage in a Distributed Environment

Record Linkage in a Distributed Environment

Presentation Transcript

Probabilistic Record Linkage: A Short Tutorial

MANAGING RISK IN A DISTRIBUTED ENVIRONMENT

NCHS Record Linkage Activities

Record Linkage Survey

Record Linkage: A Database Approach

Simulation in a Distributed Computing Environment

Efficient Record Linkage in Large Data Sets

Geant4 in a Distributed Computing Environment

Issues with record linkage

Record Linkage in a Distributed Environment

Record linkage results

Blindfolded Record Linkage

Record linkage in Birth cohort Biobanks

Record Linkage in Stata

Probabilistic Record Linkage in Genealogical Research

NCHS Record Linkage Program

(De-Identified) Record Linkage

Security in a Distributed Resource Environment

Security in a Distributed Resource Environment

ESSnet DI WP2: Record Linkage