450 likes | 548 Views
Deduplication & Fusion. Robert Ventura Simon rventura@ac.upc.edu. Index. Introduction Process Successful stories Architecture Demo. Index. Introduction Process Successful stories Architecture Demo. Introduction Benefits.
E N D
Deduplication& Fusion Robert Ventura Simon rventura@ac.upc.edu
Index • Introduction • Process • Successful stories • Architecture • Demo
Index • Introduction • Process • Successful stories • Architecture • Demo
IntroductionBenefits Identification of suspected duplicated records inside a database Merging of data belonging to several databases with different formats detecting duplicated records Validation tools for the detected similarities
Introduction • Deduplication Configuration Automatic execution Validation of results Personalized export
IntroductionDeduplication Configuration Automatic execution Validation of results Personalized export
IntroductionFusion Configuration Automatic execution Validation of results Personalized export
IntroductionFusion Configuration Automatic execution Validation of results Personalized export
Index • Introduction • Process • Deduplication • Fusion • Successful stories • Architecture • Demo
DeduplicationConfigurations • Input data file format: CSV • Select relevant columns to link registers • Assign types to columns to help using the most adequate automatic filters CSV Configurations Execution Validation Exportation CSV
DeduplicationConfigurations • Comparative type: exact value, estimation by text, numerical estimation • Percentage of the importance of each column for the similarity computation CSV Configurations Execution Validation Exportation • 100% = • 30% • 35% • 35% CSV
DeduplicationConfigurations • Use filters to normalize values • Available automatic and specific filters for values such as name, dates, address, etc… CSV Configurations Execution Validation Exportation • Filtersapplied CSV
DeduplicationConfigurations • Edition of filters(create new filters, delete or update existing ones) • Use of dictionaries: name-converter dictionary (i.e.: Pepe Jose) CSV Configurations Execution Validation Exportation CSV
DeduplicationConfigurations • Similarity computation algorithm called Record Linkage. Parameters: • Size for the sliding window: number of registers each one will be compared to. • Sorting columns: ordenation by columns. • Threshold of similarity acceptance CSV Configurations Execution Validation Exportation CSV
DeduplicationExecution • Order by Surname 1 • Sliding window = 2 CSV Configurations Execution • Window = 2 Validation Exportation CSV
DeduplicationExecution • Similarities detected CSV Configurations Execution • Window = 2 Validation • Similarities Exportation • Similarity degree CSV
DeduplicationExecution • Similarities detected CSV • window = 2 Configurations • Similarities Execution Validation Exportation • Similaritydegree CSV
DeduplicationExecution • List of detected similarities CSV Configurations Execution Validation Exportation CSV
DeduplicationExecution • List of detected similarities with percentage bigger than threshold 50% CSV Configurations > 50% Execution Validation Exportation CSV
DeduplicationValidation • Validation of results (including only those above the threshold) • Visualize by similarity/by group • Massive validation • Share validation between several supervisors CSV Configurations Execution Validation Exportation CSV
DeduplicationExportation CSV • Select output format Configurations Execution Validation Exportation CSV
Index • Introduction • Process • Deduplication • Fusion • Successful stories • Architecture • Demo
FusionConfigurations • Input data file format: CSV • Select relevant columns to link registers • Relation between columns from different data sources (only when merging) • Assign types to columns to help using the most adequate automatic filters CSV Configurations Execution Validation Exportation CSV
Fusion Configurations • Comparative type: exact value, estimation by text, numerical estimation • Percentage of the importance of each column for the similarity computation CSV Configurations Execution Validation Exportation • 100% = • 80% • 20% CSV
Fusion Configurations • Specific percentage for registers with null valued columns • Use filters to make values standard • Available automatic and specific filters for values such as name, dates, address, etc… CSV Configurations Execution Validation Exportation CSV
Fusion Configurations CSV • Edit filters (create new filters, delete or update existing ones) • Use of dictionaries: name-converter dictionary (I.e.: BCN BARCELONA) Configurations Execution Validation Exportation CSV
Fusion Configurations • Similarity computation algorithm called Record Linkage. Parameters: • Size for the sliding window: number of registers each one will be compared to. • Sorting columns: ordenation by columns. • Threshold of similarity acceptance CSV Configurations Execution Validation Exportation CSV
Fusion Execution • Order by City • Sliding window = 2 CSV Configurations Execution • Window = 2 Validation Exportation CSV
Fusion Execution • Similarities detected CSV Configurations Execution • Window = 2 Validation • Similarity Exportation • Similarity degree CSV
Fusion Execution • Similarities detected CSV Configurations • Similarities Execution Validation • Window = 2 Exportation • Similarity degree CSV
Fusion Execution • List of detected similarities CSV Configurations Execution Validation Exportation CSV
Fusion Execution • List of detected similarities with percentage bigger than threshold 50% CSV Configurations > 50% Execution Validation Exportation CSV
Fusion Validation • Validation of results (including only those above the threshold) • Visualize by similarity/by group • Massive validation • Share validation between several supervisors CSV Configurations Execution Validation Exportation CSV
Fusion Exportation • Output format • Select values for every similarity CSV Configurations Execution Validation Exportation CSV
Index • Introduction • Process • Successful stories • Architecture • Demo
Succesful storiesHealthService Who? Health Service Objective Detect repeated health id cards Solution Detect repeated registers in the database and delete them Deduplicaction with DAURUM Result Health id cards database cleaned of repetitions
Who? Beer manufacturer Objective Detect dealers that deliver to not previously assigned centers Solution Identify duplicates in each dealer’s delivery database and delete them Deduplication with DAURUM Detect deliveries to centers shared between different dealers Fusion with DAURUM Result Master database clean of repetitions and detection of dealers with wrong deliveries • Succesful storiesBeer manufacturer
Index • Introduction • Process • Successful stories • Architecture • Demo
Architecture • Struts 2: Model-View-Controller • Hibernate: Database manipulation
Index • Introduction • Process • Successful stories • Architecture • Demo
Thanks for your attention • Any questions? SPARSITY-TECHNOLOGIES Jordi Girona, 1-3, Edifici K2M 08034 Barcelona info@sparsity-technologies.com http://www.sparsity-technologies.com DAMA-UPC. DATA MANAGEMENT (UPC)Departamentd'Arquitectura de ComputadorsEdifici C6-S103. Campus Nord. Jordi Girona, 1-3. 08034 - Barcelona www.dama.upc.edu