1 / 45

Deduplication & Fusion

Deduplication & Fusion. Robert Ventura Simon rventura@ac.upc.edu. Index. Introduction Process Successful stories Architecture Demo. Index. Introduction Process Successful stories Architecture Demo. Introduction Benefits.

cili
Download Presentation

Deduplication & Fusion

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Deduplication& Fusion Robert Ventura Simon rventura@ac.upc.edu

  2. Index • Introduction • Process • Successful stories • Architecture • Demo

  3. Index • Introduction • Process • Successful stories • Architecture • Demo

  4. IntroductionBenefits Identification of suspected duplicated records inside a database Merging of data belonging to several databases with different formats detecting duplicated records Validation tools for the detected similarities

  5. IntroductionDeduplication

  6. Introduction • Deduplication Configuration Automatic execution Validation of results Personalized export

  7. IntroductionDeduplication Configuration Automatic execution Validation of results Personalized export

  8. IntroductionFusion

  9. IntroductionFusion Configuration Automatic execution Validation of results Personalized export

  10. IntroductionFusion Configuration Automatic execution Validation of results Personalized export

  11. IntroductionFeatures

  12. Index • Introduction • Process • Deduplication • Fusion • Successful stories • Architecture • Demo

  13. DeduplicationConfigurations • Input data file format: CSV • Select relevant columns to link registers • Assign types to columns to help using the most adequate automatic filters CSV Configurations Execution Validation Exportation CSV

  14. DeduplicationConfigurations • Comparative type: exact value, estimation by text, numerical estimation • Percentage of the importance of each column for the similarity computation CSV Configurations Execution Validation Exportation • 100% = • 30% • 35% • 35% CSV

  15. DeduplicationConfigurations • Use filters to normalize values • Available automatic and specific filters for values such as name, dates, address, etc… CSV Configurations Execution Validation Exportation • Filtersapplied CSV

  16. DeduplicationConfigurations • Edition of filters(create new filters, delete or update existing ones) • Use of dictionaries: name-converter dictionary (i.e.: Pepe Jose) CSV Configurations Execution Validation Exportation CSV

  17. DeduplicationConfigurations • Similarity computation algorithm called Record Linkage. Parameters: • Size for the sliding window: number of registers each one will be compared to. • Sorting columns: ordenation by columns. • Threshold of similarity acceptance CSV Configurations Execution Validation Exportation CSV

  18. DeduplicationExecution • Order by Surname 1 • Sliding window = 2 CSV Configurations Execution • Window = 2 Validation Exportation CSV

  19. DeduplicationExecution • Similarities detected CSV Configurations Execution • Window = 2 Validation • Similarities Exportation • Similarity degree CSV

  20. DeduplicationExecution • Similarities detected CSV • window = 2 Configurations • Similarities Execution Validation Exportation • Similaritydegree CSV

  21. DeduplicationExecution • List of detected similarities CSV Configurations Execution Validation Exportation CSV

  22. DeduplicationExecution • List of detected similarities with percentage bigger than threshold 50% CSV Configurations > 50% Execution Validation Exportation CSV

  23. DeduplicationValidation • Validation of results (including only those above the threshold) • Visualize by similarity/by group • Massive validation • Share validation between several supervisors CSV Configurations Execution Validation Exportation CSV

  24. DeduplicationExportation CSV • Select output format Configurations Execution Validation Exportation CSV

  25. Index • Introduction • Process • Deduplication • Fusion • Successful stories • Architecture • Demo

  26. FusionConfigurations • Input data file format: CSV • Select relevant columns to link registers • Relation between columns from different data sources (only when merging) • Assign types to columns to help using the most adequate automatic filters CSV Configurations Execution Validation Exportation CSV

  27. Fusion Configurations • Comparative type: exact value, estimation by text, numerical estimation • Percentage of the importance of each column for the similarity computation CSV Configurations Execution Validation Exportation • 100% = • 80% • 20% CSV

  28. Fusion Configurations • Specific percentage for registers with null valued columns • Use filters to make values standard • Available automatic and specific filters for values such as name, dates, address, etc… CSV Configurations Execution Validation Exportation CSV

  29. Fusion Configurations CSV • Edit filters (create new filters, delete or update existing ones) • Use of dictionaries: name-converter dictionary (I.e.: BCN BARCELONA) Configurations Execution Validation Exportation CSV

  30. Fusion Configurations • Similarity computation algorithm called Record Linkage. Parameters: • Size for the sliding window: number of registers each one will be compared to. • Sorting columns: ordenation by columns. • Threshold of similarity acceptance CSV Configurations Execution Validation Exportation CSV

  31. Fusion Execution • Order by City • Sliding window = 2 CSV Configurations Execution • Window = 2 Validation Exportation CSV

  32. Fusion Execution • Similarities detected CSV Configurations Execution • Window = 2 Validation • Similarity Exportation • Similarity degree CSV

  33. Fusion Execution • Similarities detected CSV Configurations • Similarities Execution Validation • Window = 2 Exportation • Similarity degree CSV

  34. Fusion Execution • List of detected similarities CSV Configurations Execution Validation Exportation CSV

  35. Fusion Execution • List of detected similarities with percentage bigger than threshold 50% CSV Configurations > 50% Execution Validation Exportation CSV

  36. Fusion Validation • Validation of results (including only those above the threshold) • Visualize by similarity/by group • Massive validation • Share validation between several supervisors CSV Configurations Execution Validation Exportation CSV

  37. Fusion Exportation • Output format • Select values for every similarity CSV Configurations Execution Validation Exportation CSV

  38. Index • Introduction • Process • Successful stories • Architecture • Demo

  39. Succesful storiesHealthService Who? Health Service Objective Detect repeated health id cards Solution Detect repeated registers in the database and delete them Deduplicaction with DAURUM Result Health id cards database cleaned of repetitions

  40. Who? Beer manufacturer Objective Detect dealers that deliver to not previously assigned centers Solution Identify duplicates in each dealer’s delivery database and delete them Deduplication with DAURUM Detect deliveries to centers shared between different dealers Fusion with DAURUM Result Master database clean of repetitions and detection of dealers with wrong deliveries • Succesful storiesBeer manufacturer

  41. Index • Introduction • Process • Successful stories • Architecture • Demo

  42. Architecture • Struts 2: Model-View-Controller • Hibernate: Database manipulation

  43. Index • Introduction • Process • Successful stories • Architecture • Demo

  44. Demo

  45. Thanks for your attention • Any questions? SPARSITY-TECHNOLOGIES Jordi Girona, 1-3, Edifici K2M 08034 Barcelona info@sparsity-technologies.com http://www.sparsity-technologies.com DAMA-UPC. DATA MANAGEMENT (UPC)Departamentd'Arquitectura de ComputadorsEdifici C6-S103. Campus Nord.   Jordi Girona, 1-3.  08034 - Barcelona  www.dama.upc.edu

More Related