Managing Probabilistic Duplicates in Databases

Managing Probabilistic Duplicates in Databases Mohamed M. Hafez Supervisors: Prof. Osman Hegazy Mohamed Prof. Ali Hamid El-Bastawissy

Agenda • Introduction • Introduction • Data Fusion based on • Data Dependency • Motivation • Information Gain • Evaluation & Results • Data Conflict Problem • Data Fusion Existing Solutions • Conclusion & Future Work • Proposed Solution 2 of 57

Introduction Data Integration Components User Interface Consistent and Unambiguous Answer Fusion Query Global Unified Schema Data Integration System (Mediator) Local Schema Local Schema Local Schema Cleansing 3 of 57

Introduction(cont.) Common Three Levels of Inconsistencies Application Data Fusion Step 3: Duplicate Detection Step 2: Schema Matching Step 1: Data Sources 4 of 57

Agenda • Introduction • Data Fusion based on • Data Dependency • Motivation • Motivation • Information Gain • Evaluation & Results • Data Conflict Problem • Data Fusion Existing Solutions • Conclusion & Future Work • Proposed Solution 5 of 57

Motivation • Data is now everywhere and with multiple versions (Old Vs New). • Different social networks provide different kinds of data. • We are focusing on English Textual Data. • How data fusion can solve real life problems? • Passport(s) Inspection • News Verification • Building Knowledge base Network 6 of 57

Motivation(cont.) What is the JobTitle , Email and City of Mohamed Hafez? 7 of 57

Agenda • Introduction • Data Fusion based on • Data Dependency • Motivation • Information Gain • Evaluation & Results • Data Conflict Problem • Data Conflict Problem • Data Fusion Existing Solutions • Conclusion & Future Work • Proposed Solution 8 of 57

Data Conflict Problem { } Select JobTitle , Email , City From GS.Employee Where Name like ‘Mohamed%Hafez’ Which one to choose ? 9 of 57

Agenda • Introduction • Data Fusion based on • Data Dependency • Motivation • Information Gain • Evaluation & Results • Data Conflict Problem • Data Fusion Existing Solutions • Conclusion & Future Work • Data Fusion Existing Solutions • Proposed Solution 10 of 57

Data Fusion Existing Solutions Classification of the Conflict Handling Strategies SSN Name Address NULL Fusionplex 2k 4k 6k 8k 10k 11 of 57

Agenda • Introduction • Data Fusion based on • Data Dependency • Motivation • Information Gain • Evaluation & Results • Data Conflict Problem • Data Fusion Existing Solutions • Conclusion & Future Work • Proposed Solution • Proposed Solution 12 of 57

Proposed Solution (cont.) Classification of the Conflict Handling Strategies Full Automation with No User Intervention Proposed Techniques Non-Federated Data Sources! No Duplicates within Data Sources! Assumptions 13 of 57

Agenda • Data Fusion based on • Introduction • Data Fusion based on • Data Dependency • Data Dependency • Motivation • Information Gain • Evaluation & Results • Data Conflict Problem • Data Fusion Existing Solutions • Conclusion & Future Work • Proposed Solution 14 of 57

Data Fusion based on Data Dependency { } Select JobTitle , Email From GS.Employee TWO Scores to be used GAS LAS 15 of 57

Data Fusion based on Data Dependency(cont.) 1st Score (GAS) 16 of 57

Data Fusion based on Data Dependency(cont.) 1st Score (GAS) } The only one with no conflict 17 of 57

Data Fusion based on Data Dependency(cont.) 18 of 57

Data Fusion based on Data Dependency(cont.) Local Detectors The lowest dependent attribute in addition to the intersection of attributes between all of the contributing dependency lists having dependency score greater than ZERO FCI {Department, City} Banks {Employer, City} PrivateSector{EmployerName, City} 19 of 57

Data Fusion based on Data Dependency(cont.) Unified Detectors We define the unified detectors set, over all contributing data sources, to be the intersection of all the sets of local detectors { FCI {Department, City} Unified Detectors {City} Local Detectors Banks {Employer, City} PrivateSector{EmployerName, City} 20 of 57

Data Fusion based on Data Dependency(cont.) 2ndScore (LAS) 21 of 57

Data Fusion based on Data Dependency(cont.) 2ndScore (LAS) 22 of 57

Data Fusion based on Data Dependency(cont.) 1st Score (GAS) 2ndScore (LAS) Combined Score (TAS) 23 of 57

Data Fusion based on Data Dependency(cont.) 1st Score (GAS) 2ndScore (LAS) Combined Score (TAS) 24 of 57

Agenda • Data Fusion based on • Introduction • Data Fusion based on • Data Dependency • Motivation • Information Gain • Information Gain • Evaluation & Results • Data Conflict Problem • Data Fusion Existing Solutions • Conclusion & Future Work • Proposed Solution 25 of 57

Data Fusion based on Information Gain In Literature, used in signals processing (Compression, source and channel coding). City Department Which Attribute gives less access to JobTitle Values? Department >> City The lowest the information gain, the better the partitioning to be selected as a detector for the target attribute. 26 of 57

Data Fusion based on Information Gain(cont.) “Teaching Assistant” (3 times), “Assistant Lecturer” (3 times), “Associate Professor” (3 times) and “Professor” (1 time) 27 of 57

Data Fusion based on Information Gain(cont.) 28 of 57

Data Fusion based on Information Gain(cont.) The lowest the information gain, the better the splitter to be selected as a detector for the splitted attribute 29 of 57

Data Fusion based on Information Gain(cont.) Local Detectors The attribute with the lowest information gain for each of the query requested attributes’ detectors list in addition to the intersection between all of the contributing query requested attributes’ detectors lists having information gain not equal to the basic entropy of the requested attribute. { FCI {City, Department, Name} Unified Detectors {City} Local Detectors Banks {City, Profession} PrivateSector {City} 30 of 57

Data Fusion based on Information Gain(cont.) 2ndScore (DGS) 31 of 57

Data Fusion based on Information Gain(cont.) 2ndScore (DGS) 1st Score (GAS) Combined Score (TAPS) 32 of 57

Data Fusion based on Information Gain(cont.) 2ndScore (DGS) 1st Score (GAS) Combined Score (TAPS) 33 of 57

Agenda • Introduction • Data Fusion based on • Data Dependency • Motivation • Information Gain • Evaluation & Results • Evaluation & Results • Data Conflict Problem • Data Fusion Existing Solutions • Conclusion & Future Work • Proposed Solution 34 of 57

Evaluation & Results Simulation Environment Data Sources: 2,3,5,10,20 No. of Attributes: 3,5,7,10,50 No. of Distinct Values: [5,15] , [45,55] , [95,105] , [495,505] , [995,1005] No. of Records: 50,100,300,500,1000 No. of Runs for each parameter set: 10 No. of Simulation Input Parameter Sets: 5 * 5 * 5 * 5 = 625 Parameter Sets Total Number of Simulation Runs: 625 * 10 = 6250 Runs Technology used Microsoft Visual Studio. NET (C#) Microsoft SQL Server 35 of 57

Evaluation & Results(cont.) Partial Matching: Cases when Data Dependency is better than Information Gain Superior Successful Parameter Sets 36 of 57

Evaluation & Results(cont.) Partial Matching: Cases when Data Dependency is better than Information Gain Superior Successful Parameter Sets 37 of 57

Evaluation & Results(cont.) Partial Matching: Cases when Information Gain is better than Data Dependency Superior Successful Parameter Sets 38 of 57

Evaluation & Results(cont.) Partial Matching: Cases when Information Gain is better than Data Dependency Superior Successful Parameter Sets 39 of 57

50,100 300,500 1000 No_REC [45, 55] [495,505] [5, 15] [95,105] [995, 1005] No_DVAL Evaluation & Results(cont.) Partial Matching: Cases when both techniques successful & give the same result No_ATT = 50 and No_DS = 2 Superior Perfect Successful Parameter Sets 40 of 57

3 No_DS 50 100 500 No_REC [45, 55] [495,505] [5, 15] [95,105] [995, 1005] No_DVAL Evaluation & Results(cont.) Partial Matching: Cases when both techniques successful & give the same result No_ATT = 50 Superior Perfect Successful Parameter Sets 41 of 57

5 10 No_DS 100 50 No_REC [5, 15] No_DVAL Evaluation & Results(cont.) Partial Matching: Cases when both techniques successful & give the same result No_ATT = 50 Superior Perfect Successful Parameter Sets 42 of 57

Evaluation & Results(cont.) Partial Matching: Cases when both techniques are unsuccessful Total Superior Failure Parameter Sets 43 of 57

Evaluation & Results(cont.) Full Matching: Cases when Data Dependency is better than Information Gain 44 of 57

Evaluation & Results(cont.) Full Matching: Cases when Information Gain is better than Data Dependency Information Gain Superior Successful Parameter Sets with DD_Full_MATCH > 0% 45 of 57

Evaluation & Results(cont.) Full Matching: Cases when both techniques successful & give the same result 46 of 57

Evaluation & Results(cont.) Full Matching: Cases when both techniques are unsuccessful 47 of 57

Agenda • Introduction • Data Fusion based on • Data Dependency • Motivation • Information Gain • Evaluation & Results • Data Conflict Problem • Data Fusion Existing Solutions • Conclusion & Future Work • Conclusion & Future Work • Proposed Solution 48 of 57

Conclusion & Future Work Conclusion The larger the number of attributes (No_ATT) in each data source, the higher the partial matching scores for both techniques. Also, the opposite conclusion for a lower No_ATTis valid as well. Increasing the number of records (No_REC), number of data sources (No_DS) and number of distinct values (No_DVAL) will lead to lower matching for both techniques. Data Dependency performs betterwhen the average dependency between the attributes data is very high (AvgDataDep). Also, it scores high matching when the ratiobetween No_REC and No_DVALis high, which means that the pairs have higher probability to repeat. 49 of 57

Conclusion & Future Work Conclusion Information Gain performs betterwhen AvgDataDepis relatively lowand the ratiobetween No_RECand No_DVAL is relatively low. This is because Information Gain takes into account any pairs even if it appears once, and is based on partitioning not dependency. Both techniques behave the same when the AvgDataDepis in the middle (not low no high), and the ratiois also in the middle. It has been noticed as well; both techniques score very high matching and sometimes 100% partial matching when all simulation input parameters in its minimum values. The techniques failed when the ratiobetween No_REC and No_DVAL is very far meaning that No_DVAL is very big and No_REC is small. This ratio makes it so difficult for both techniques to get better matching results. 50 of 57

Managing Probabilistic Duplicates in Databases