Publish -Time Data Integration for Open Data Platforms

WOD 2013 Publish-Time Data Integration forOpen Data Platforms Julian Eberius, Patrick Damme, Katrin Braunschweig, Maik Thiele and Wolfgang Lehner (TU Dresden)

Motivation

Premise Continuous publishing without standardization will continuously increase heterogeneity on the platform. Is there a solution without predefined schemata / ontologies?

Problem Different namesforattributesofthe same meaning Different meaningsforattributeswith same values

System Overview

Offline • Domain Clustering • Bottom-upclustering on schema-level • Used online tolimitsearchspace • But also toimproveaccuracy • Domain Statistics • Create different formsvaluesetsynopses • Usedto save comparisonwork online

Online • Input • New datasetds+ withvaluesetsvs+ • Output • Attribute namesuggestions • Constraint • Instanteneousresponse time (Publish-Time!) • Basic Approach • Assignds+ todomainbased on schemainformation • Generaterecommendationsbased on values

Naiv-C • Most Naive Approach: • Iterateover Corpus C • returnthenamesof all attributeswithsufficientlysimilarvaluesets • orderthembyoverallfrequency in thecorpus • Properties: • Finds all similarvaluesets • Generates thelargestpossiblenumberofrecommendations • Extremelylongrun time • Mightgeneratetomanyrecommendations

Naiv-D • Domain-based Approach: • Classifyincomingdatasetintodomain D • Iterateover Domain D • continueas in Naiv-C • Properties: • Findslesssimilarvaluesets • Shorter run time • Onlygeneratesrecommendationsfromonedomain

Cluster / Analysis-D • Synopsis-basedApproaches: • Create representativevaluesets RVS fordatasets in domain • Match onlyagainst RVS • Clustering-D • Cluster VS in domain, create RVS • Pre-computerecommendationlistas all attributenamesofvaluesetsparticipating in final cluster • Online: find singlemostsimilar RVS in D • Analysis-D • Create RVS directlyforsetsof VS withequalname • Online: Find setofsimilar RVS in D

Evaluation

Quality I

Quality II

Runtimes

Cluster Size

Conclusion • Weneedstatistics-baseddataintegrationatpublish timetolimitthegrowthofheterogenityin large publicdatasetcorpora. • Lots ofworkto do: clustering, matching, statistics, indexing, performance.

Publish -Time Data Integration for Open Data Platforms

Publish -Time Data Integration for Open Data Platforms

Presentation Transcript

Data Integration

Data Integration

Data Integration

Data integration

Data Integration for Big Data

Data Integration

Data Integration

Data Integration

Data Integration

Open DMIX High Performance Web Services for Data Mining, Data Integration, and Data Exploration

Data Integration

Data Integration

Data integration

Data integration

Data Integration

Data Integration

Data Integration

Data Integration

Big Data Platforms

Real-time data integration

Real-Time Data Integration