290 likes | 465 Views
Record Linkage in a Distributed Environment. Huang Yipeng Wing group meeting, 11 March 2011. Record Linkage. E.g. Distinguishing between data belonging to… <Yipeng, author of this presentation> and <Yipeng, son of PM Lee>. Determining if pairs of personal records refer to the same entity .
E N D
Record Linkagein a Distributed Environment Huang YipengWing group meeting, 11 March 2011
Record Linkage E.g. Distinguishing between data belonging to…<Yipeng, author of this presentation> and <Yipeng, son of PM Lee> Determining if pairs of personal records refer to the same entity Introduction
The Distributed Environment • Why? • Dealing with large data • Limitation of blocking • Advantages • Parallel computation • Data source flexibility • Complementary to blocking methods O(nC2) Amanda Amanda Amanda Amanda Introduction
The Distributed Environment • MapReduce • Distributed environment for large data sets • Hadoop • Open source implementation • Convenient model for scaling Record Linkage • Protects users from system level concerns Introduction
Research Problem Disconnect between generic parallel framework and specific Record Linkage problem The goal Tailor Hadoop for Record Linkage tasks Introduction
Outline Introduction Related Work Methodology Evaluation Conclusion
Related Work • Record Linkage Literature • Blocking techniques • Parallel Record Linkage Literature • P-Febrl(P Christen 2003), • P-Swoosh (H Kawai 2006), • Parallel Linkage (H Kim 2007) • Hadoop Literature • Evaluation Metrics • Pairwise comparisons (T Elsayed 2008) Related Work
Outline Introduction Related Work Methodology Evaluation Conclusion
MapReduce Workflow Partitioner Methodology
Implementation • Map • Purpose: • Parallelism • Data manipulation • Blocking • Reads lines of input and outputs <key, value> pairs. • Reduce • Purpose: • Parallelism • Record Linkage ops • Records with the same <key> in same Reduce(). • Linkage results Methodology
Hash Partitioner 5416986 comparisons 210 comparisons Default implementation Hash(Key) mod N Good for uniformed data but not for skewed distributions Methodology
Record Linkage Partitioner Goal: Have all nodes finish the reduce phase at the same time Attain a better runtime but retaining the same level of accuracy Methodology
Domain principles Counting pairwise comparisons gives a more accurate picture of the true computational workload The distribution of names tends to follow a power law distribution in many countries (D Zanette 2001), (S Miyazima 2000) Methodology
Record Linkage Workflow Round 1 Range partition based on comparison workload Round 2 Merge lost comparisons from Round 1 Round 3 Remove cross duplicates Methodology
Round 1 Input 1. Calc avg comparison workload over N nodes Distribution 2. Check if a record will exceed the avg. If Yes, Divide by min number of nodes needed to drop below. Map Phase 3. Assign records to nodes and update the avg comparison workload to reflect lost comparisons , if any. 4. Recurse until comparison load can be evenly distributed among nodes Methodology
Round 2 List X B Methodology 17
Round 2 Only acts on lost comparisons Because input is indistinct, a 3rd round of deduplication may be needed. Methodology 18
Outline Introduction Related Work Methodology Evaluation Conclusion Introduction
Performance Metrics • Performance evaluation in absolute runtime, speedup & scaleupon a shared cluster. • “It’s what users care about” • Representative of real operations Evaluation
Input Records <rec-359705-org, talyor, swift, 5, canterbury crescent, , cooks hill, 4122, , 19090518, 38, 07 34366927, 6174819, 9> 10 million records, 0.9 million original, 0.1 million duplicate, up to 9 duplicates per record, 1 modification per field, 1 modification per record, duplicates follow Poisson distribution. Methodology
Data sets • Synthetic data produced with Febrl data generator • Artificially skewed distribution Methodology
Utilization Evaluation
Utilization Evaluation
Utilization A B C Evaluation
Utilization A B Evaluation
Round 2 J1 J2 ? J3 J4 J5 J6 Node Utilization 50-100% 27
Results so far…. Evaluation
Results so far…. • RL Workflow runtime • Similar to Hash-based runtime on small datasets • Better as the size of the dataset grows Evaluation
Conclusion • Parallelism a right step in the right direction for record linkage • Complementary to existing approaches • Hadoop can be tailored for Record Linkage tasks • “Record Linkage” Partitioner / Workflow is just one an example of possible improvements Conclusion