Core Schema Mappings

Core Schema Mappings Gianni Mecca*Paolo Papotti° Salvatore Raunich* * Università della Basilicata, Italy °Università Roma Tre, Italy Sigmod - Providence - 2009, July 2nd

I just wanttomove data Source Target Given a minimal abstractspecification IBDBook[0..*] Book [0..*] title title pubId LOC [0..*] title Publisher [0..*] publisher IBLBook[0..*] name title pubId IBLPublisher[0..*] id name Core Schema Mappings

I just wanttomove data Source Target Given a minimal abstractspecification IBDBook[0..*] Book [0..*] title title pubId LOC [0..*] title Publisher [0..*] publisher id IBLBook[0..*] name title pubId IBLPublisher[0..*] id name Core Schema Mappings

I just wanttomove data Source Target Given a minimal abstractspecification With source semanticspreserved in the target instance IBDBook[0..*] Book [0..*] title title pubId LOC [0..*] title Publisher [0..*] publisher id IBLBook[0..*] name title pubId IBLPublisher[0..*] Target: Book Target: Publisher id name IBDBook LOC [Clio Vldb02,Vldb06,Vldb08] [HepTox Vldb05] [ADO.net Sigmod07] [MapForce09, StylusStudio09] IBLPublisher IBLBook Core Schema Mappings

Mapping generation Source Target IBDBook[0..*] Book [0..*] title title pubId LOC [0..*] title Publisher [0..*] publisher id IBLBook[0..*] name title pubId IBLPublisher[0..*] id name • A first generation schema mappingtoolgenerates • m1 ∀ t1: IBDBook(t1) → ∃N: Book(t1, N ) Core Schema Mappings

Mapping generation Source Target IBDBook[0..*] Book [0..*] title title pubId LOC [0..*] title Publisher [0..*] publisher id IBLBook[0..*] name title pubId IBLPublisher[0..*] id name • A first generation schema mappingtoolgenerates • m1∀ t1: IBDBook(t1) → ∃N: Book(t1, N) • m2 LOC(t2, p) → Book(t2, N’) Publisher (N’, p) Core Schema Mappings

Mapping generation Source Target IBDBook[0..*] Book [0..*] title title pubId LOC [0..*] title Publisher [0..*] publisher id IBLBook[0..*] name title pubId IBLPublisher[0..*] id name • A first generation schema mappingtoolgenerates • m1 ∀ t1: IBDBook(t1) → ∃N: Book(t1, N) • m2 LOC(t2, p) → Book(t2, N’) Publisher (N’, p) • m3 IBLBook (t3, i) → Book(t3, i) • m4 IBLPublisher(i’, p’) → Publisher (i’, p’) Core Schema Mappings

I just wanttomove data Source Target Given a minimal abstractspecification With source semanticspreserved in the target instance IBDBook[0..*] Book [0..*] title title pubId LOC [0..*] title Publisher [0..*] publisher id IBLBook[0..*] name title pubId IBLPublisher[0..*] Target: Book Target: Publisher id name IBDBook LOC IBLPublisher IBLBook Core Schema Mappings

I just wanttomove data Source Target Given a minimal abstractspecification With source semanticspreserved in the target instance With no redundancy in the target instance IBDBook[0..*] Book [0..*] title title pubId LOC [0..*] title Publisher [0..*] publisher id IBLBook[0..*] name title pubId canonical IBLPublisher[0..*] Target: Book Target: Publisher id name Post processing step [TODS05, JACM08] core

I just wanttomove data Source Target Given a minimal abstractspecification With source semanticspreserved in the target instance With no redundancy in the target instance IBDBook[0..*] Book [0..*] title title pubId LOC [0..*] title Publisher [0..*] publisher id IBLBook[0..*] name title pubId IBLPublisher[0..*] Target: Book Target: Publisher id name null→ 245 Post processing step [TODS05, JACM08]

Source Target IBDBook[0..*] Book [0..*] title title pubId LOC [0..*] title Publisher [0..*] publisher id IBLBook[0..*] name title pubId IBLPublisher[0..*] id Theoretically, the problemissolved name

Source Target IBDBook[0..*] Book [0..*] title title pubId LOC [0..*] title Publisher [0..*] publisher id IBLBook[0..*] name title pubId IBLPublisher[0..*] id • Theoretically, the problemissolved • In practice, for a simple schema mappingwith 5000 source tuples • canonicalsolution = 1 sec • corecomputation= 8 hours name

What’s the problem? • Semantics of data exchange is defined using the chase [ICDT03]: chase(I,M)= canonical univ sol (I,M) • chase = SQL scripts • speed • flexibility and reuse • Post processing looksrecursivelyforhomomorphismsbetweeninstances • custom engine Canonicalsolutioncomputation Core Schema Mappings

What’s the solution? Coresolutioncomputation Core Schema Mappings

What’s the solution? • Core schema mappings • Findhomomorphismsbetweenformulas • Correlate mappingstoavoidhomomorphismbetweenfacts • negation • sophisticated skolemizationfornulls • Corecomputationasresultof the (standard) chase • chase(I,M)= core(I,M) • Enables a scalablesolution Coresolutioncomputation Core Schema Mappings

Outline • Howdifficultismymapping? • Subsumptions • Coverages • Self-joins • Howdoesit scale? • Complexity • SQL experiments • Modularity Core Schema Mappings

Subsumptions IBDBook Source Target Target: Book IBDBook[0..*] Book [0..*] title title pubId LOC [0..*] title Publisher [0..*] publisher id IBLBook[0..*] name IBLBook title pubId N1 to 245? • m1IBDBook(t1) → Book(t1, N) • m2 LOC(t2, p) → Book(t2, N’) Publisher (N’, p) • m3 IBLBook (t3, i) → Book(t3, i) Core Schema Mappings

Subsumptions IBDBook Source Target IBDBook[0..*] Book [0..*] title title pubId LOC [0..*] title Publisher [0..*] publisher id IBLBook[0..*] name IBLBook title pubId • m1IBDBook(t1) → Book(t1, N) • m2 LOC(t2, p) → Book(t2, N’) Publisher (N’, p) • m3 IBLBook (t3, i) → Book(t3, i) Core Schema Mappings

Subsumptions IBDBook Source Target Target: Book IBDBook[0..*] Book [0..*] title title pubId LOC [0..*] title Publisher [0..*] publisher id IBLBook[0..*] name IBLBook title pubId • m1IBDBook(t1) → Book(t1, N) • m2 LOC(t2, p) → Book(t2, N’) Publisher (N’, p) • m3 IBLBook (t3, i) → Book(t3, i) Core Schema Mappings

Subsumptions IBDBook Source Target Target: Book IBDBook[0..*] Book [0..*] title title pubId LOC [0..*] title Publisher [0..*] publisher id IBLBook[0..*] name IBLBook title pubId N1 to 245? • m1IBDBook(t1) → Book(t1, N) • m2 LOC(t2, p) → Book(t2, N’) Publisher (N’, p) • m3 IBLBook (t3, i) → Book(t3, i) Core Schema Mappings

Subsumptions Source Target m3subsumes m1 generate target tuples for m3, the “more informative” mapping for m1 generate only those tuples that add new content to the target IBDBook[0..*] Book [0..*] title title pubId LOC [0..*] title Publisher [0..*] publisher id IBLBook[0..*] name title pubId • Ouralgorithmidentifieshomomorphisms in the target side of the tgds • m1IBDBook(t1) → Book(t1, N) • m2 LOC(t2, p) → Book(t2, N’) Publisher (N’, p) • m3 IBLBook (t3, i) → Book(t3, i) homomorphism: t1 → t3, N → i Core Schema Mappings

Subsumptions Source Target m3subsumes m1 generate target tuples for m3, the “more informative” mapping for m1 generate only those tuples that add new content to the target IBDBook[0..*] Book [0..*] title title pubId LOC [0..*] title Publisher [0..*] publisher id IBLBook[0..*] name title pubId m’1 : IBDBook(t1) ∧ ¬(IBLBook(t1, i)) → Book(t1, N) • m1IBDBook(t1) →Book(t1, N) • m2 LOC(t2, p) → Book(t2, N’) Publisher (N’, p) • m3 IBLBook (t3, i)→ Book(t3, i) Core Schema Mappings

Subsumptions also m2 subsumes m1 Source Target IBDBook[0..*] Book [0..*] title title pubId LOC [0..*] title Publisher [0..*] publisher id IBLBook[0..*] name title pubId m’1 : IBDBook(t1) ∧ ¬(IBLBook(t1, i)) ∧ ¬(LOC(t1, p)) → Book(t1, N) • m1IBDBook(t1) → Book(t1, N) • m2LOC(t2, p) → Book(t2, N’) Publisher (N’, p) • m3 IBLBook (t3, i) → Book(t3, i) homomorphism: t1 → t2, N → N’ Core Schema Mappings

Coverages Source Target • m’1IBDBook(t1) ∧ ¬(IBLBook(t1, i)) ∧ • ¬ (LOC(t1, p)) → Book(t1, N) • m2 LOC(t2, p) → Book(t2, N’) Publisher (N’, p) • m3 IBLBook (t3, i) → Book(t3, i) • m4 IBLPublisher(i’, p’) → Publisher (i’, p’) IBDBook[0..*] Book [0..*] title title pubId LOC [0..*] title Publisher [0..*] publisher id IBLBook[0..*] name title pubId IBLPublisher[0..*] id name IBDBook LOC IBLPublisher IBLBook Core Schema Mappings

Coverages Source Target • m’1IBDBook(t1) ∧ ¬(IBLBook(t1, i)) ∧ • ¬ (LOC(t1, p)) → Book(t1, N) • m2 LOC(t2, p) → Book(t2, N’) Publisher (N’, p) • m3 IBLBook (t3, i) → Book(t3, i) • m4 IBLPublisher(i’, p’) → Publisher (i’, p’) IBDBook[0..*] Book [0..*] title title pubId LOC [0..*] title Publisher [0..*] publisher id IBLBook[0..*] name title pubId IBLPublisher[0..*] id name IBDBook LOC Target: Book Target: Publisher IBLPublisher IBLBook Core Schema Mappings

Coverages Source Target • m’1IBDBook(t1) ∧ ¬(IBLBook(t1, i)) ∧ • ¬ (LOC(t1, p)) → Book(t1, N) • m’2 LOC(t2, p) ∧ • ¬(IBLBook (t2, i) ∧ IBLPublisher(i, p)) • → Book(t2, N’) Publisher (N’, p) • m3 IBLBook (t3, i) → Book(t3, i) • m4 IBLPublisher(i’, p’) → Publisher (i’, p’) IBDBook[0..*] Book [0..*] title title pubId LOC [0..*] title Publisher [0..*] publisher id IBLBook[0..*] name title pubId IBLPublisher[0..*] id name more differences, new joins IBDBook LOC Target: Book Target: Publisher IBLPublisher IBLBook Core Schema Mappings

Self-joins m1 R(a, b, c, d) → S (x5, b, x1, x2, a) ∧ S (x5, c, x3, x4, a) ∧ S (d, c, x3, x4, b) m2 R(a, b, c, d) → S (d, a, a, x1, b) ∧ S (x, a, a, x, a) ∧ S (x, c, x, x, x) Examplefrom [TODS05] techniques discussed so far are of little help: the chase will either generate three tuples or none duplicate symbols: many diﬀerent ways of satisfying these constraints Core Schema Mappings

Self-joins: two-phase first exchange secondexchange Expansions S1 R S S2 m’1 R(a, b, c, d) → S1 (x5 , b, x1, x2, a) ∧ S2 (x5, c, x3, x4, a) ∧ S3 (d, c, x3, x4, b) m′2 R(e, f, g, h) → S4 (h, e, e, y1, f ) ∧ S5 (y5, e, e, y1, e) ∧ S6 (y5, g, y2, y3, y4) S1 (x5 , b, x1, x2, a) … → S (…) S2 (x5, c, x3, x4, a) … → S (…) S (…) S3 (d, c, x3, x4, b) … → S (…) S4 (h, e, e, y1, f ) … → S (…) S5 (y5, e, e, y1, e) … → S (…) S6 (y5, g, y2, y3, y4) … → S (…) … • we rewrite the tgds using distinct symbols and variables and do a first exchange • second exchange considers all possible combinations and copy the data in the original relation avoiding redundancy Core Schema Mappings

Self-joins m’1 R(a, b, c, d) → S1 (x5, b, x1, x2, a) ∧ S2 (x5, c, x3, x4, a) ∧ S3 (d, c, x3, x4, b) m′2 R(e, f, g, h) → S4 (h, e, e, y1, f ) ∧ S5 (y5, e, e, y1, e) ∧ S6 (y5, g, y2, y3, y4) R(n, n, n, k) Core Schema Mappings

Self-joins: expansions m’1 R(a, b, c, d) → S1 (x5, b, x1, x2, a) ∧ S2 (x5, c, x3, x4, a) ∧ S3 (d, c, x3, x4, b) m′2 R(e, f, g, h) → S4 (h, e, e, y1, f ) ∧ S5 (y5, e, e, y1, e) ∧ S6 (y5, g, y2, y3, y4) R(n, n, n, k) Core Schema Mappings

Self-joins: twophase first exchange secondexchange Expansions S1 R S S2 m1 R(a, b, c, d) → S1 (x5 , b, x1, x2, a) ∧ S2 (x5, c, x3, x4, a) ∧ S3 (d, c, x3, x4, b) m′2 R(e, f, g, h) → S4 (h, e, e, y1, f ) ∧ S5 (y5, e, e, y1, e) ∧ S6 (y5, g, y2, y3, y4) e12 S2(x5, c, x3, x4, a) ∧ S3(d, c, x3, x4, b) ∧ (S1(x5, b, x1, x2, a) ∧ b = c) → S (…) S (…) e13 … → S (…) … • useexpansionsaspremisesof the secondexchange • rewritethesenewtgdsusingsubsumptionstoavoidredundancy • favor more compact and more informative Core Schema Mappings

Outline • Howdifficultismymapping? • Subsumptions • Coverages • Self-joins • Howdoesit scale? • Complexity • SQL experiments • Modularity Core Schema Mappings

Complexity Realcases O(n) In realisticcases are lower n: numberoftgds d: maximumnumberofdifferenttgdsthatwriteinto a table k: maximumnumberofatoms in a tgdconclusion expression complexity (not data complexity!) Core Schema Mappings

Experimentssettings • Algorithmsimplemented in +Spicy • Scripts in SQL (and XQuery) • PostgreSQL 8.3 on a Intel CoreDuo 2.4Ghz/4GB Ram/Linux • Scenariosfrom the literature, mostlyfromSTBenchmark[Vldb08 – www.stbenchmark.org] • Each SQL test run with 10k, 100k, 250k, 500k, and 1M tuples in the source instance • Time limit = 1 hour • custom engine exceeded the time limit in all scenarios Core Schema Mappings

Experimentsresults Subsumption and coverages Selfjoins Times (sec) Times (sec) #tuples in the source #tuples in the source Core Schema Mappings

Experimentsresults Subsumption and coverages Selfjoins Times (sec) Times (sec) #tuples in the source #tuples in the source Times (sec) Scalabilityexperimentswith up to 100 tables (82 tgds, 51 subsuptions, 12 coverages): rewritingalgorithmran in 6 secs Core Schema Mappings

Modular solution canonicalsolution subsumption-freesolution coverage-freesolution core (self-join-coverage-freesolution) Algorithms allow to produce approximations of the core for expensive computations: “reduced” rewriting wrt to a subset of homomorphisms Core Schema Mappings

Repeatability & Workability Evaluation We participated in the ACM SIGMOD 2009 Repeatability & Workability Evaluation (cf., http://homepages.cwi.nl/~manegold/SIGMOD-2009-RWE/) The reviewers were able to repeat all the experiments presented in our paper, yielding results that in most cases match the ones published in our paper, except from insignificant and to be expected variation due to randomness and/or hardware/software differences. In addition, workability experiments confirmed - with few exceptions - the soundness of our results beyond the parameter setting and/or data sets presented in our paper. The detailed reports will shortly be made publicly available by ACM SIGMOD. Core Schema Mappings

Bridging the gap • +Spicy is the first mapping tool which generates core solutions efficiently • Standard solution for mapping generation • Novel algorithms for (natural) mapping rewriting • Execution times orders of magnitude faster than post processing • Twomainresults • Bridge the gap between the practiceofmapping generation [Popaet al. Vldb02] and the theoryof data exchange [Faginet al. TODS05, Gottlob&Nash JACM08] • Enable schema mappingsfor more practicalapplications Core Schema Mappings

Thankyou seeyou in Lyonfor the demo: Mecca, Papotti, Raunich, Buoncristiano “Concise and Expressive Mappings with +Spicy” VLDB 2009 Core Schema Mappings

Core Schema Mappings