140 likes | 257 Views
Record Linkage in a Distributed Environment. Literature Review. Contents. Record linkage Runtime reduction techniques Blocking Canopies Sorted Neighborhood Shift to p arallel computing Research directions . Record Linkage Problem.
E N D
Record Linkage in a Distributed Environment Literature Review
Contents • Record linkage • Runtime reduction techniques • Blocking • Canopies • Sorted Neighborhood • Shift to parallel computing • Research directions
Record Linkage Problem • Determining if pairs of records refer to the same entity • E.g. Distinguishing between data belonging to… Yipeng, the NUS student and Yipeng, the son of PM Lee
Record Linkage Applications • Dedup Two Lists • Dedup Single List O(M*N) O(N2)
Dealing with Large Data • Pairwise comparison increasing expensive • Blocking techniques • Reduce the search space Amanda Amanda David Daniel
Sorted Neighborhood Comparison Window: 2w−1
Dealing with Large Data • Pairwise comparison increasing expensive • Blocking techniques • Reduce the search space • Limitations • Single node computation • Localized data source • Conflicting in function Amanda Amanda David Daniel
Shift to Parallel Computing • Multi node computation • Data source flexibility • Complementary to blocking methods • Frontrunners: • P-Febrl(P Christen 2003), • P-Swoosh (H Kawai 2006), • Parallel Linkage (H Kim 2007)
Parallel Record Linkage Contributions • Peter Christen • Parallelized Febrl with MPI • Linear Speedup but did not Scaleup well • HidekiKawai • Designed P-swoosh in a simulated environment • Match based parallelism • 2x speedup with use of domain knowledge
Parallel Record Linkage Contributions • Hung-sik Kim, Dongwon Lee • Explored parallel record linkage for different input cases in MATLAB • Consistent Speedup • Not validated with very large datasets
MapReduce and Hadoop • Handles system level concerns… • E.g. Data distribution, fault tolerance, dynamic load balancing, portability and scalability • Convenient model for scaling record linkage • Beterscaleupon pairwisecomparisions (T Elsayed 2008) • Runtime increased linearly with dataset (R Vernica 2010)
Research Directions • Tailoring Hadoop for record linkage problems • E.g. Bin packing blocks of different sizes • Experimenting with different problem types • E.g. Bipartite data centers • Adapting existing parallel clustering algorithms onto the MapReducemodel
Conclusions • Parallelism a right step in the right direction • Complementary to existing approaches • Consistent with the object orientation • But… • Parallel design and implementation is difficult • MapReduce is a viable solution