Data Exchange: Semantics and Query Answering

Data Exchange: Semantics and Query Answering Ronald Fagin -- IBM Almaden Research Center Phokion G. Kolaitis -- UC Santa Cruz Renee J. Miller -- University of Toronto Lucian Popa -- IBM Almaden Research Center IBM Almaden - November 12, 2002 (To appear in ICDT 2003)

Motivation and Overview • Data exchange problem: • How to restructure data from a source schema to a target schema, according to a given specification • Main motivation for this work: • Understanding of fundamental issues that lie underneath data exchange systems such as EXPRESS and Clio • Main challenge: • Inherent under-specification: • Specification (as we shall see) must be simple and intuitive, but • There are many ways in which the restructuring can be performed ! • Question: did we make the right choice in the design of Clio ? • Our approach (only relational case, so far): • Define and study universal solutions • Show this is the “best” way of performing data exchange • Study computational aspects • Study what happens after data exchange: query answering

The Data Exchange Problem Target schema T Source schema S t st Assume a data exchange setting: • source schemaS, • target schemaTwith a set tdependencies(see next) • set stsource-to-target dependencies(see next) • The data exchange problem is the following: Input: • source instanceI Output: • target instanceJsuch that:<I, J> standJ t (call suchJa solution forI) I J

Source-to-target Dependencies • For most practical purposes, st contains: • source-to-targettuple-generating dependencies (tgds) : S(x)  y T(x, y) e.g. DeptEmp(did, mgr_name, eid) M. Dept (did, M, mgr_name)  Emp (eid, did) (Move data from source table DeptEmp into two target tables, Dept and Emp. The existential variable M is an “unspecified” manager id) Dept did mgr_id mgr_name DeptEmp did mgr_name eid Emp eid did

Target Dependencies • The second, equally important, part of the specification, are the target dependencies t: • tgds : T(x)  y T(x, y) e.g. Dept (did, mgr_id, mgr_name) D. Emp (mgr_id, D) (A foreign key constraint in the target) • equality generating dependencies (egds): T(x)  (x1=x2) e.g. Emp (e, d1)  Emp (e, d2)  (d1 = d2) (A target key constraint) Dept did mgr_id mgr_name Emp eid did

Questions (To be Answered Next) • When more than one solution exists, how do we choose a “best” one ? • How do we compute a “best” solution ? • Is there always a solution ? Is there always a “best” solution ? • How does query answering on the chosen solution behave ?

Universal Solutions = “Best” Solutions

J1 T a0 b0 c0 <a0 b’0 c’0> P(a,b,c)  YZ.T(a,Y,Z) J T h1= {Y0 -> b0, Z0 -> c0, … } a0Y0 Z0 X0 b0U0 V0 W0 c0 Q(a,b,c)  XU.T(X,b,U) <a’’0 b0 c’’0> J2 h2 T a0 b0Z1 V1 W1c0 R(a,b,c)  VW.T(V,W,c) <a’’’0 b’’’0 c0> Existence of Multiple Solutions source target X0 , Y0 , Z0 … represent “unknown” values (or “nulls”) P A B C • There may be manysolutions for the target instance (J, J1, J2, etc.) • However,J seems to be more general: • there exist homomorphismsh1: J  J1 andh2: J  J2 (see definition next) • but none fromJ1 or J2toJ • intuitively,J1andJ2have extra information T Q A B C A B C R A B C

J1 T a0 b0 c0 J T a0Y0 Z0 X0b0U0 V0 W0 c0 h1= {Y0 -> b0, Z0 -> c0, … } Homomorphisms • As we have seen, the values of a target instance can be either: • constants (i.e. values coming from the source instance), or • nulls (unknown values) • Definition.Assume J1 and J2 are such target instances. A homomorphismh: J1 -> J2 is a mapping from values of J1 to values of J2 such that: • h(c) = c, for constants c (nulls of J1 can be mapped to any values of J2) • for every tuple <a1, …, an> in relation T of instance J1: < h(a1), … h(an) > must be a tuple in relation T of instance J2 • Example:

J T a0Y0 Z0 X0b0U0 V0 W0 c0 Universal Solution Definition. Assume a data exchange setting(S, T, st, t). Given source instance I, a universal solution forIis a target instance Jsuch that: (1)J is a solution for I (2) for every solutionJ’ for I, there exists homomorphismh: J  J’ • For the previous example, J is a universal solution. J1 and J2 are not. • Among all solutions, universal solutions are special: • They contain no more and no less than the amount of information given by the specification

Fact: • Uniqueness up to homomorphic equivalence: • If J1 and J2 are universal for I then there are homomorphisms between J1 and J2 in both directions • Representation of the space of solutions: • Sol(I1) = Sol(I2) iff J1 and J2 are homomorphically equivalent • We adopt the universal solution as the notion of “best” solution. • Later we will see another justification for universal solutions in terms of query answering.

When do universal solutions exist ? • How do we compute a universal solution ?

Computing Universal Solutions

Added in a first chase step (M0 is a null) J < CS M0 Mary > I CS Mary E003 < E003 CS > < M0 D0 > Added in a second chase step st : DeptEmp(did, mgr_name, eid) -> M. Dept (did, M, mgr_name)  Emp (eid, did) t : Dept (did, mgr_id, mgr_name) -> D.Emp (mgr_id, D) Chase • We canonically generate a universal solution by using the chase: • Given source instance I, start with an empty target instance J • Generate tuples in J by applying the dependencies in standt. • Example: Dept did mgr_id mgr_name DeptEmp did mgr_name eid Emp eid did

This process is repeatedly applied: • for all the source tuples and for the generated tuples, • as long as there are dependencies that are not yet satisfied • The chase may be infinite (cyclic t ) … • … or it may fail (e.g. target key constraints that are not possible to satisfy for the given source data) • (details in the paper) • However, if the chase successfully terminates, the resulting target instance is a solution.

Canonical Generation of Universal Solutions • Theorem.Assume a data exchange setting(S, T, st, t). Given source instance I: • If the chase is finite and successful then its result is a universal solution. • If the chase fails then there is no solution. • Thus, the chase is a procedure for computing universal solutions, provided that: • Solutions exist, and • The chase is finite • We call universal solutions computed by the chase canonical universal solutions When can we guarantee that the chase is finite ?

Weakly Acyclic Sets of Dependencies • Some cyclic sets of dependencies may cause infinite chases • In such case no universal solution may exist, and the semantics of the data exchange is undefined • Still there are cyclic sets of dependencies that behave well and are quite useful • Weakly acyclic sets of dependencies (defined in the paper): • Cover many practical cases of target constraints • Allow for restricted cyclicity • The chase is guaranteed to be finite

Polynomial-Time Chase Theorem. Let be a weakly acyclic set of dependencies. For every instance K, the chase of K with  can be computed in polynomial time. • Corollary.Assume a data exchange setting(S, T, st, t)such that tis a weakly acyclic set of dependencies. • For every source instance I, the existence of a solution can be checked in polynomial time • For every source instance I, if a solution exists then a universal solution can be produced in polynomial time.

Next: what happens after data exchange ? • In particular, how is subsequent query answering affected by our choice of a solution (universal solution) ?

Query Answering

Query Evaluation on a Solution q Target schema T Source schema S t st • Assume a fixed data exchange setting with a source instance I. Suppose that a System 1 chooses a solution J for data exchange. • A query q can now be asked against the target. • The evaluation of q, in System 1, is q(J). • However, a System 2 materializing a different solution J’ may give a different evaluation q(J’). • Different choices of J (for the same I) imply possibly different query evaluations. • Is there a notion of the “right” set of answers to qwith respect to I ? I J

Certain Answers • We will use a notion that has been around in the context of data integration and incomplete databases, where queries are asked against a set of possible databases. Definition. GivenI andq, a tupletis a certain answer if: • t  q(J), for every solution J • Notation: certain(q, I)= the set of all certain answers • Thus,tis certain if it is in the answer ofqon every solution. • The certain answers provide well-defined semantics to query answering because they are independent of the choice of a solution.

Can we compute the certain answers based just on our chosen (universal) solution ?

Positive Queries • Proposition.Assume a data exchange setting(S, T, st, t) and a source instance I. • Let q be a positive query. If J is a universal solution, then certain(q, I) = q(J) . • Let J be a solution such that for every positive query q we have that certain(q, I) = q(J) .Then J is a universal solution. Note: In the above: • Positive query means union of SPJ queries • q(J) means evaluate q on J and then throw away tuples that contain nulls) • Thus, the certain answers of positive queries can be computed by evaluating them on any universal solution. • Moreover, this property characterizes universal solutions.

q(u, v) :- xz. T(u, x) T(z, v)  xz a0 X0 Z0 a0 J(universal) R <a0, b0> d1 A B T A B S d2 J2 (not universal) A B a0 a0 <a1, a0> d1: R(a,b)  X.T(a,X) d2: S(a,b)  Z.T(Z,b) Conjunctive Queries with Inequalities • The situation changes when negation is involved (even in the very simple form of ). • Example: • It can be verified that: • <a0,a0> q(J), but<a0,a0>  q(J2)(thus, not a certain answer). • Hencecertain(q, I)  q(J) • The universal solution gives extra answers

For conjunctive queries with inequalities, we have seen that simple query evaluation on a universal solution is not enough for computing the certain answers. • Question: Can we find a different SQL query q* such that when evaluated on a universal solution gives the set of certain answers of q ? • There are examples for which such query q* exists. • However, we show next that the answer is “no”, in general.

Complexity: Two or More Inequalities Theorem.Computing the certain answers of unions of conjunctive queries with at most two inequalities per term is coNP-hard, even in a restricted data exchange setting (LAV). • [AD98] proved a similar result for the case of conjunctive queries with six or more inequalities. • The coNP-hardness implies: • the certain answers cannot be computed by evaluating the query q (or any other SQL query q*) on a polynomial-time generated universal solution (unless P = NP).

Complexity: One Inequality Theorem.Assume a data exchange setting(S, T, st, t)such that tis a weakly acyclic set of dependencies. Let q be a union of conjunctive queries with at most one inequality per term. Let I be a source instance and let J be an arbitrary universal solution for I. Then there exists a polynomial-time algorithm with input Jthat computes certain(q, I). • Thus, computing the certain answers for such queries is a tractable problem. • Moreover, this computation can take place on any universal solution. • The universal solution has all the information needed to compute the certain answers. • We show next that the problem of computing the certain answers, even for this tractable case, cannot be solved by means of SQL query evaluation.

First-Order Inexpressibility Theorem.There exists a data exchange setting and a boolean conjunctive query q with one inequality, for which there is no first-order query q* over the canonical universal solution such that certain(q, I) = q*(J) . • This is a strong inexpressibility result that shows that in data exchange we cannot use the notion of certain answers for answering queries with inequalities. • (In practice, instead of going for certain answers, we should just use query evaluation on the universal solution) • The proof uses an original combination of finite model theory techniques and the chase.

Conclusions • Universal solutions are a good candidate for using in data exchange • Clio produces such universal solution (in the relational case) • All universal solutions are equally good for answering positive queries. • Simple query evaluation has the same semantics as that of the certain answers. • For queries with inequalities, different universal solutions may give different query evaluations which may yet be different from the certain answers. • There is no hope to find the certain answers by means of SQL query evaluation on a universal solution

Future Work • Among all universal solutions, is there a universal solution that approximates, in a best way, the certain answers ? If yes, can this be computed efficiently ? • Extension to semantics and query answering for data exchange in the nested (XML) case.

The End

Relationship to Clio • Source-to-target tgds are the same formalism that Clio uses internally (for the relational case): • st is generated by Clio in the semantic translation phase [VLDB02] from correspondences. • User input • Then Clio generates, based on st , a set of queries in the data translation phase[VLDB02] • These queries compute a solution • Is Clio’s solution a good one ? (Since other solutions are also possible) • Here we try to understand, formally, the concept of “good” solutions • t is more general than the target constraints that Clio can currently handle.

Data Exchange: Semantics and Query Answering

Data Exchange: Semantics and Query Answering

Presentation Transcript

Semantics: Representations and Analyses

Query Execution in Databases: An Introduction

A Brief Introduction to MySQL

Join Synopses for Approximate Query Answering

COMP 4060 Natural Language Processing

Leveraging Social data with Semantics

IS698: Database Management

Information Exchange Workgroup

Semantics: Representations and Analyses

Answering Metric Skyline Queries by PM-tree

IHE ITI Profile Proposal XCA Query and Retrieve

CS245A – Syllabus (2005)

IST-2001-34825

Data exchange: semantics and query answering

SQL: The Query Language Part 1

Answering Approximate Queries Efficiently

Approximate XML Query Answers

XML Query Languages

FQAS 2011 Flexible Query Answering Systems GHENT

Query and Retrieve

RDF FOR DEVELOPERS

Query Answering on Trajectory Cuboids using Prime Numbers Encodings