160 likes | 320 Views
WOD 2013. Publish -Time Data Integration for Open Data Platforms. Julian Eberius , Patrick Damme, Katrin Braunschweig, Maik Thiele and Wolfgang Lehner (TU Dresden) . Motivation. Premise.
E N D
WOD 2013 Publish-Time Data Integration forOpen Data Platforms Julian Eberius, Patrick Damme, Katrin Braunschweig, Maik Thiele and Wolfgang Lehner (TU Dresden)
Premise Continuous publishing without standardization will continuously increase heterogeneity on the platform. Is there a solution without predefined schemata / ontologies?
Problem Different namesforattributesofthe same meaning Different meaningsforattributeswith same values
Offline • Domain Clustering • Bottom-upclustering on schema-level • Used online tolimitsearchspace • But also toimproveaccuracy • Domain Statistics • Create different formsvaluesetsynopses • Usedto save comparisonwork online
Online • Input • New datasetds+ withvaluesetsvs+ • Output • Attribute namesuggestions • Constraint • Instanteneousresponse time (Publish-Time!) • Basic Approach • Assignds+ todomainbased on schemainformation • Generaterecommendationsbased on values
Naiv-C • Most Naive Approach: • Iterateover Corpus C • returnthenamesof all attributeswithsufficientlysimilarvaluesets • orderthembyoverallfrequency in thecorpus • Properties: • Finds all similarvaluesets • Generates thelargestpossiblenumberofrecommendations • Extremelylongrun time • Mightgeneratetomanyrecommendations
Naiv-D • Domain-based Approach: • Classifyincomingdatasetintodomain D • Iterateover Domain D • continueas in Naiv-C • Properties: • Findslesssimilarvaluesets • Shorter run time • Onlygeneratesrecommendationsfromonedomain
Cluster / Analysis-D • Synopsis-basedApproaches: • Create representativevaluesets RVS fordatasets in domain • Match onlyagainst RVS • Clustering-D • Cluster VS in domain, create RVS • Pre-computerecommendationlistas all attributenamesofvaluesetsparticipating in final cluster • Online: find singlemostsimilar RVS in D • Analysis-D • Create RVS directlyforsetsof VS withequalname • Online: Find setofsimilar RVS in D
Conclusion • Weneedstatistics-baseddataintegrationatpublish timetolimitthegrowthofheterogenityin large publicdatasetcorpora. • Lots ofworkto do: clustering, matching, statistics, indexing, performance.