Debugging Schema Mappings with Routes

Debugging Schema Mappings with Routes Laura Chiticariu UC Santa Cruz (joint work with Wang-Chiew Tan)

SPIDER: A Schema Mapping Debugger Demo group B Today 14:00-15:30 Thursday 11:00-12:30

I Source instance Schema Mappings • A schema mapping is a logical assertion that describes the correspondence between two schemas • Key element in data exchange and data integration systems • Data Exchange [FKMP05] • Translate data conforming to a source schema S into data conforming to a target schema T so that the schema mapping M is satisfied M Schema S Schema T J Target instance

I Source instance Debugging a Data Exchange Today M • Debugging at the (low) level of the implementation • Specific to the data exchange engine • Specific to the implementation language: XQuery, SQL, etc • Debugging at the level of schema mappings NO SUPPORT!!! Schema S Schema T XQuery/XSLT/Java J Target instance

I Source instance Debugging Schema Mappings M • Debugging schema mappings: the process of exploring, understanding and refining a schema mapping through the use of (test) data at the level of schema mappings Schema S Schema T J Target instance

Outline • Overview • Motivation • Debugging schema mappings with routes • Motivating example • What are routes? • Computing routes • Related work • Performance evaluation • Conclusions

Motivation • Schema mappings are good • Higher-level, declarative programming constructs • Hide implementation details, allow for optimization • Typically easier to understand vs. SQL/XSLT/XQuery/Java • Serve a similar goal as model management [Bernstein03, MBHR05] • Uniformity in specifying and debugging • Reduce programming effort by allowing a user to specify and debug at the level of schema mappings • Schema mappings are often generated by schema matching tools • Close to user’s intention, but may need further refinements • Hard to understand without the help of tools

Language for Schema Mappings • Tuple generating dependencies (tgds) • 8x ((x) !9y(x,y)) • Equality generating dependencies (egds) • 8x ((x) ! x1 = x2) • Remarks: • Widely used for relational schema mappings in data exchange and data integration [Kolaitis05,Lenzerini02] • TGDs generalize LAV, GAV and are equivalent to GLAV assertions in the terminology of data integration • Extended to handle XML data exchange [PVMHF02]

I Source instance Relational Schema Mappings [FKMP03] • Schema mapping M = (S, T, st[t) • S, T: relational schemas with no relation symbols in common • Source-to-target dependencies st: • Source-to-target tgds (s-t tgds) S(x)!9y T(x,y) • Target dependencies t: • Target tgds: T(x)!9y T(x,y) • Target egds:  T(x)!x1 = x2 ∑st ∑t Schema S Schema T J Target instance

Example Schema Mapping S: T: MANHATTAN CREDIT CardHolders: cardNo ² limit ² ssn ² name ² Dependents: accNo ² ssn ² name ² Source-to-target dependencies, st: m1: CardHolders(cn,l,s,n) ! 9L (Accounts(cn,L,s)  Clients(s,n)) m2: Dependents(an,s,n) ! Clients(s,n) Target dependencies,t: m3: Clients(s,n) !A L (Accounts(A,L,s)) FARGO FINANCE Accounts: ² accNo ² creditLine ² accHolder Clients: ² ssn ² name m1 fk1 m3 m2 Solution for I under the schema mapping Target instance J Source instance I CardHolders Accounts Clients Dependents

Example Debugging Scenario 1 Target instance J Source instance I CardHolders Accounts Clients Dependents Unknown credit limit? A route for the Accounts tuple Accounts CardHolders 123 L1 ID1 m1 123 $15K ID1 Alice Clients ID1 Alice 15K is not copied over to the target m1: CardHolders(cn,l,s,n) ! 9L (Accounts(cn,L,s) ^ Clients(s,n))

Example Debugging Scenario 1 Target instance J Source instance I CardHolders Accounts Clients Dependents Unknown credit limit? A route for the Accounts tuple Accounts CardHolders 123 L1 ID1 m1 123 $15K ID1 Alice Clients ID1 Alice 15K is not copied over to the target m1: CardHolders(cn,l,s,n) ! (Accounts(cn,l,s) ^ Clients(s,n))

Route for Accounts tuple with accNo A2 Dependents Accounts Clients m2 m3 123 ID2 Bob ID2 Bob A2 L2 ID2 Example Debugging Scenario 2 Target instance J Source instance I CardHolders Accounts Clients Dependents Unknown account number? 123 is not copied over to the target as Bob’s account number m2: Dependents(an,s,n) ! Clients(s,n)

Route for Accounts tuple with accNo A2 Dependents Accounts Clients m2 m3 123 ID2 Bob ID2 Bob A2 L2 ID2 Example Debugging Scenario 2 Target instance J Source instance I CardHolders Accounts Clients Dependents Unknown account number? 123 is not copied over to the target as Bob’s account number m’2: CardHolders(an,l,s’,n’)^ Dependents(an,s,n) ! Accounts(an,l,s)^ Clients(s,n)

Debugging Schema Mappings with Routes • Main intuition: routes describe the relationships between source and target data with the schema mapping • Definition: Let: • M be a schema mapping • I be a source instance • J be a solution for I under M and Jsµ J A route for Js with M and (I,J) is a finite non-empty sequence of satisfaction steps (I,;) ! (I,J1) ! … ! (I,Jn) such that: • Jiµ J, mi2st [ t, where 1· i· n • Jsµ Jn mn, hn m1, h1 m2, h2

Example of Satisfaction Step Target instance J Source instance I CardHolders Accounts Clients Dependents Unknown credit limit? Accounts CardHolders m1, h1 Clients m1: CardHolders(cn, l, s, n) !9L (Accounts(cn, L, s ) ^ Clients(s, n )) h1={cn ! ‘123’, l ! $15K, s ! ID1, n ! Alice, L ! L1}

Compute all routes • The schema mapping M is fixed • Input: source instance I, a solution J for I under M, a set of target tuples Jsµ J • Output: a forest representing all routes for Js • Algorithm idea: • For each tuple t in Js, consider every possible 2st[t and h for witnessing t • Do the same for all target tuples encountered during the process until tuples from the source instance are obtained

6, x  a T4(a) T6(a) Compute all routes: A simple example T7(a) • st: • 1: S1(x) ! T1(x) • 2: S2(x) ! T2(x) Æ T6(x) • t: • 3: T2(x) ! T3(x) • 4: T3(x) ! T4(x) • 5: T4(x) Æ T1(x) ! T5(x) • 6: T4(x) Æ T6(x) ! T7(x) • 7: T5(x) ! T3(x) • Source instance, I: • S1(a), S2(a) • A solution, J: • T1(a), …, T7(a)

4, x  a T3(a) Compute all routes: A simple example T7(a) • st: • 1: S1(x) ! T1(x) • 2: S2(x) ! T2(x) Æ T6(x) • t: • 3: T2(x) ! T3(x) • 4: T3(x) ! T4(x) • 5: T4(x) Æ T1(x) ! T5(x) • 6: T4(x) Æ T6(x) ! T7(x) • 7: T5(x) ! T3(x) • Source instance, I: • S1(a), S2(a) • A solution, J: • T1(a), …, T7(a) 6 T4(a) T6(a)

7 T5(a) Compute all routes: A simple example T7(a) • st: • 1: S1(x) ! T1(x) • 2: S2(x) ! T2(x) Æ T6(x) • t: • 3: T2(x) ! T3(x) • 4: T3(x) ! T4(x) • 5: T4(x) Æ T1(x) ! T5(x) • 6: T4(x) Æ T6(x) ! T7(x) • 7: T5(x) ! T3(x) • Source instance, I: • S1(a), S2(a) • A solution, J: • T1(a), …, T7(a) 6 T4(a) T6(a) 4 T3(a)

5 T4(a) T1(a) 1 S1(a) Compute all routes: A simple example T7(a) • st: • 1: S1(x) ! T1(x) • 2: S2(x) ! T2(x) Æ T6(x) • t: • 3: T2(x) ! T3(x) • 4: T3(x) ! T4(x) • 5: T4(x) Æ T1(x) ! T5(x) • 6: T4(x) Æ T6(x) ! T7(x) • 7: T5(x) ! T3(x) • Source instance, I: • S1(a), S2(a) • A solution, J: • T1(a), …, T7(a) 6 T4(a) T6(a) 4 T3(a) 7 T5(a)

2 S2(a) 3 T2(a) 5 2 T4(a) T1(a) S2(a) 1 S1(a) Compute all routes: A simple example T7(a) • st: • 1: S1(x) ! T1(x) • 2: S2(x) ! T2(x) Æ T6(x) • t: • 3: T2(x) ! T3(x) • 4: T3(x) ! T4(x) • 5: T4(x) Æ T1(x) ! T5(x) • 6: T4(x) Æ T6(x) ! T7(x) • 7: T5(x) ! T3(x) • Source instance, I: • S1(a), S2(a) • A solution, J: • T1(a), …, T7(a) 6 T4(a) T6(a) 4 T3(a) 7 T5(a)

5 T4(a) T1(a) 1 S1(a) Compute all routes: A simple example T7(a) • st: • 1: S1(x) ! T1(x) • 2: S2(x) ! T2(x) Æ T6(x) • t: • 3: T2(x) ! T3(x) • 4: T3(x) ! T4(x) • 5: T4(x) Æ T1(x) ! T5(x) • 6: T4(x) Æ T6(x) ! T7(x) • 7: T5(x) ! T3(x) • Source instance, I: • S1(a), S2(a) • A solution, J: • T1(a), …, T7(a) 6 T4(a) T6(a) 4 8 T3(a) S2(a) 7 3 T5(a) T2(a) 2 S2(a) Route for T7(a): 2, 3, 4, 8, 6

Properties of compute all routes • Completeness: Let F denote the route forest by our algorithm returned on Js. If R is a minimal route for Js, then it is represented in F. • Running time: polynomial in the sizes of I, J and Js • Every “branch of a tuple” once explored, is never explored again • Polynomial number of branches for each tuple since M is fixed • Challenge: • Exponentially many routes, but polynomial-size representation constructed in polynomial time

Compute one route • Our experimental results indicate that compute all routes can be expensive • Generate one route fast and alternative routes as needed? • Our solution: adapt compute all routes to compute only one route • Non-exhaustive: Stops when one witness is found. A witness that uses source tuples is preferred • Inference procedure: to deduce all consequences of a proven tuple and avoid recomputation of “branches” • Key step for polynomial time analysis • Completeness: If there is a route for Js, then our algorithm will produce a route for Js

Related work • Commercial data exchange systems • e.g., Altova MapForce, Stylus Studio • Use “lower-level” languages (e.g., XSLT, XQuery) to specify the exchange • Debugging is done at this low level • Source tuple centric • Data viewer [YMHF01] • Constructs an “example” source instance illustrative for the behavior of the schema mapping • Complementary to our approach • Works only for relational schema mappings

Related work • Computing routes for target data is related to computing provenance (aka lineage) of data

Empirical Evaluation • Implementation: on top of the Clio data exchange system from IBM Almaden Research Center • Scalable: push computation to the database • Handles relational and XML schema mappings [PVMHF02] • Testbed: • Created relational and XML schema mappings based on the TPCH schema • Created schema mappings based on Mondial, DBLP and Amalgam schemas • Methodology - measured the influence of: • The sizes of I, J and Js • The complexity of st[t • i.e., the number of tgds and the number of atoms in each tgd • Setup: P4 2.8GHz, 2Gb RAM, 256MB DB2 buffer pool • Our regret: No benchmark to base our comparisons

ComputeOneRoute with Rel. schema mappingInfluence of the Sizes of I and J

ComputeOneRoute with Rel. schema mappingInfluence of the Complexity of st[t

ComputeOneRoute vs. ComputeAllRoutes

Experimental results with Mondial, DBLP and Amalgam

Experimental results with Mondial, DBLP and Amalgam • Two DBLP schemas and datasets, both XML: • DBLP1, DBLP2 • First relational schema from Amalgam test suite

Experimental results with Mondial, DBLP and Amalgam • Two DBLP schemas and datasets, both XML: • DBLP1, DBLP2 • First relational schema from Amalgam test suite • Two Mondial schemas and datasets: • one relational (Mondial1), the other XML (Mondial2) • Designed st and used the foreign key constraints as t

Experimental results with Mondial, DBLP and Amalgam • Compute one route: under 3 seconds for 1-10 randomly selected tuples • Compute all routes: can take much longer • 18 seconds to construct the route forest for 10 selected tuples in the target instance of Mondial • Compute one route took under 1 second

Conclusions • Debugging schema mappings with routes • Complete, polynomial time algorithms for computing routes • Extension for routes for selected source data • Routes have declarative semantics, based on the logical satisfaction of tgds • What we don’t do: illustrate data merging • Future work: • Illustrate grouping semantics for nested schema mappings • Adapt target instance to changes in the schema mapping and data sources

Compute one/all routes Alternative routes Guided computation of routes Standard debugging features Breakpoints “Watch” windows Schema-level routes SPIDER: A Schema Mappings Debugger Demo group B Today 14:00-15:30 Thursday 11:00-12:30

Thank you!

How do we do it? M Source schema S Target Schema T Schema mappings debugger Source instance I Target instance J routes Witness selected target data with source data and M

How do we do it? M Source schema S Target Schema T Schema mappings debugger Source instance I Target instance J routes Illustrate consequences of selected source data with M

Key Concept: ROUTES - describe the relationships between source and target data with the schema mapping M Source schema S Target Schema T Schema mappings debugger Source instance I Target instance J routes

Data Data Schema Schema Clio Source Target • A semi-automatic schema mapping system • Supports user-guided mapping from source to target with constraints • Schema mapping language: a nested extension of tgds and egds • Automatically generate XQuery/SQL/XSLT scripts for the actual data transferring based on the schema mapping • Generates universal solutions under relational-to-relational schema mappings • Implemented our techniques on top of Clio, but… • Routes have declarative semantics • Independent of Clio’s transformation engine Mapping XQuery/SQL/XSLT

Provenance information Q’ Related work • Computing routes for target data is related to computing provenance (aka lineage) of data Q

Related work • Computing routes for target data is related to computing provenance (aka lineage) of data Q No reengineering of the query

Provenance information Eager Q’ Related work • Approaches to computing provenance: • Eager: changes the transformation to carry provenance information • Requires re-engineering of Q to Q’. No subsequent source access or access to the definition of Q or Q’. • Lazy: does not • No re-engineering of Q. Subsequent source access and access to the definition of Q may be needed. Q

Related work • Computing routes for target data is related to computing provenance (aka lineage) of data

Programming Languages vs. Schema Mappings • Debugging programming languages vs. debugging schema mappings • Procedural PL • We may have a specification (e.g. compute x2 on input x) which completely determines the output • Well-defined notion of correct answer • The program is an implementation of the specification • If the correct answer is not obtained, there’s a bug – need to debug the implementation • However, the specification may also not be that concrete • E.g., build a visual interface for … • Functional PL • Debugging is performed by analyzing a trace of the execution • Declarative approach for debugging [Nilsson94] • Schema mapping IS the specification • Infinite number of solutions consistent with the schema mapping • Best we can do: look at the target instance – if something looks wrong (e.g., the clients’ names are not copied to the target) go back to the schema mapping and try to refine it (or debug it)

Related Work: Computing Provenance of Data over SQL queries • Compute the provenance of relational data in a view in data warehouses [CWW2000] • The provenance of a tuple t in a view is described as the tuples in the base tables that witness the existence of t Provenance answered using two reverse queries: R(a,b) :- R(a,b) Æ S(b,c) Æ a=1 Æ c=3 S(a,b) :- R(a,b) Æ S(b,c) Æ a=1 Æ c=3 T View definition: T(a,c) :- R(a,b) Æ S(b,c) DB R S

Related Work: Computing Provenance of data over SQL queries • DBNotes: an annotation management system for relational databases • Each data value has zero or more annotations • pSQL: a query language for propagating annotations • 3 propagation schemes: DEFAULT, DEFAULT-ALL, CUSTOM • By default, annotations propagate according to provenance • Eager approach: annotations propagate along with data as data is transformed through queries • Provenance information readily available in the output • Automatically trace the provenance and flow of data over multiple transformation steps • Systematically maintains provenance annotations that describe the exact location of data values DB1 DB2 Transformation: T(a,c):-R(a,b)ÆS(b,c) R S T

Related Work: Computing the Provenance of Data over Schema Mappings • MXQL system over relational/XML schema mappings • Eager approach • Additional info about source schema elements and mappings that contribute to the creation of target data is propagated and stored • Our approach is lazy: no reengineering • Non-automatic approach for answering provenance • The additional info needs to be queried using MXQL • We automatically compute routes for selected data • Data involved in the transformation not considered • Our routes contain information about schema elements, dependencies and data involved

Debugging Schema Mappings with Routes