290 likes | 395 Views
Recomputing Materialized Instances After Changes to Mappings and Data. Todd J. Green Zachary G. Ives ICDE ’12 Washington, DC April 2, 2012. &. Change is a Constant in Data Management. Databases are highly dynamic ; many kinds of changes need to be propagated efficiently:
E N D
Recomputing Materialized Instances After Changes to Mappings and Data Todd J. Green Zachary G. Ives ICDE ’12 Washington, DC April 2, 2012 &
Change is a Constant in Data Management • Databases are highly dynamic; many kinds of changes need to be propagated efficiently: • To data (“view maintenance”) • To viewdefinitions (“view adaptation”) • Others, such as schema evolution, etc. • Collaborative data sharing systems (e.g., Orchestra [Ives+ 05]), declarative programming platforms (e.g., LogicBlox [Huang+11]) exacerbate this need: • Large numbers (100s to 10s of 1000s) of materialized views • Frequent updates to data, schemas, mapping/view definitions
Change Propagation: a Problem of Computing Differences View maintenance view definition source data materialized view Given: R V change to source data (difference wrt current version) Goal: compute change to materialized view (difference) R¢ V¢ View adaptation view definition source data materialized view Given: R V Goal: change to view definition (another kind of difference) compute change to materialized view V¢
Challenges in Change Propagation • View maintenance: studied since at least the mid-eighties [Blakeley+ 86], but existing solutions quite narrow and limited • Various known methods to compute changes “incrementally”, e.g., count algorithm[Gupta+ 93] • How do we optimize this process? What is space of all update plans? • View adaptation: less attention, but renewed importance in context of data exchange/collaborative data sharing systems • Previous approaches: limited to case-based methods for simple changes [Gupta+ 01] • Complex changes? Again, space of all update plans? • Key challenge: compute changes using database queries!
Contributions • We build on our previous theoretical work [G+ 09] to implement a unifiedand cost-based approach to handling updates to source data, view definitions, or both • Practical implementation in Orchestra CDSS, based on rewriting queries using materialized views and enriched data model • Core of our engine is unaware of which kind of update (to source data or to view definition) it is dealing with! • Engine also automatically exploits provenance information, if present • We demonstrate significant practical benefits on workloads typical of data sharing in the life sciences
Background I: Orchestra CDSS [Ives+08]a Collaborative Data Sharing System • Set of peers (e.g., collaborating life scientists), each with database, agree to share information • Peers linked via network of compositional schema mappings • define how data/updates applied to one peer instance should be transformed and applied to other peer instances • System tracks provenance (lineage) information [Green+ 07] as updates are mapped/transformed • Basis of provenance-based trust policies • Also used to guide update propagation
Example: Sharing Morphological Data Alice’s field observations: a Carol wants to gather information from Alice, Bob, uBio, and put into own data repository: Bob’s field observations: b, c Carol’s Guide to Primate Hand Colors schema mappings Can do this using schemamappings Standard species names: d
Example: Sharing Morphological Data (2) Alice’s field observations: a Datalog mappings relating databases e(Name, Color) :– b(Id, “hand color”, Color), c(Id, Species,_), d(Species, Name). e(Name, Color) :– a(Id, Species,_, “hand color”, Color), d(Species, Name). Bob’s field observations: b, c Carol’s Guide to Primate Hand Colors: e Standard species names: d
Example: Sharing Morphological Data (2) Alice’s field observations: a Datalog mappings relating databases e(Name, Color) :– b(Id, “hand color”, Color), c(Id, Species,_), d(Species, Name). e(Name, Color) :– a(Id, Species,_, “hand color”, Color), d(Species, Name). Bob’s field observations: b, c Carol’s Guide to Primate Hand Colors: e join Standard species names: d
Example: Sharing Morphological Data (2) Alice’s field observations: a Datalog mappings relating databases e(Name, Color) :– b(Id, “hand color”, Color), c(Id, Species,_), d(Species, Name). e(Name, Color) :– a(Id, Species,_, “hand color”, Color), d(Species, Name). Bob’s field observations: b, c Carol’s Guide to Primate Hand Colors: E join Standard species names: d
Background II: Data Updates as Z-Relations [G+09] • Z-relation: a relation where each tuple is associated with a (positive or negative) count • Positive counts indicate (multiple) insertions; negative counts, (multiple) deletions • Uniform representation for both data and changes to data • Update application = union (a query!) • Can think of changes to data as a kind of annotated relation r¢ r’ = r [ r¢
Relational Algebra (RA) on Z-Relations [G+09] join (⋈) multiplies counts union ([), projection (¼) add counts selection (¾) multiplies counts by 0 or 1 difference (–) subtracts counts Note,Note, difference can lead to negative counts (unlike “proper subtraction” in bag semantics where negative counts are truncated to 0) (same as for semiring-annotated relations [G+07])
Incremental View Maintenance: An Application of Z-Relations Materialized view (with duplicates): Source relation: v(X,Y) :– r(X,Z), r(Z,Y) R 2 copies of (b,b) deletion R¢ V¢ delete 1 copy of (b,b) insertion insert 1 copy of (b,d) Delta rules[Gupta+ 93]for v with Z-relations semantics: v¢(X,Y) :– r(X,Z), r¢(Z,Y) v¢(X,Y) :– r¢(X,Z), r’(Z,Y)
Z-Relations are Amenable to Advanced Optimization Strategies [G+09] • For change propagation, fundamental need for difference in query language => full relational algebra (RA) • Under set or bag (multiset) semantics, basic optimization tasks---e.g., testing query equivalence---are undecidable for RA • Under Z-semantics, equivalence of RA queries is, surprisingly, decidable • Even better, rewriting queries using materialized views can be done via sound and complete procedure
Our Approach in This Work • Cast view maintenance/view adaptation as special cases of rewriting queries using views • Use cost-based search to find a good (not perfect) plan within time budget • Emulate Z-semantics with an off-the-shelf DBMS via encoding scheme • Handle / exploit provenance information, if we have it • Provenance has sizable storage overhead • But also unlocks many useful new rewritings
View Maintenance: a Special Case ofRewriting Queries Using Views on Z-Relations Materialized views: Query (to compute diff.): v(X,Y) :– r(X,Z), r(Z,Y) r’(X,Y) :– r(X,Y) r’(X,Y) :– r¢(X,Y) v¢(X,Y) :– r’(X,Z), r’(Z,Y) – v¢(X,Y) :– r(X,Z), r(Z,Y) rewrite v¢ using the materialized views ... OTHER PLANS…? Another delta rules rewriting: Delta rules rewriting: v¢(X,Y) :– r(X,Z), r¢(Z,Y) v¢(X,Y) :– r¢(X,Z), r’(Z,Y) v¢(X,Y) :– r¢(X,Z), r(Z,Y) v¢(X,Y) :– r’(X,Z), r¢(Z,Y)
View Adaptation: Another Application of Rewriting Queries Using Views Old view definition: New view definition: v(X,Y) :– r(X,Z), r(Z,Y). v(X,Y) :– s(X,Y,_). v’(X,Y) :– r(X,Z), r(Z,Y). reformulate using materialized view v ... OTHER PLANS…? A plan to “adapt” v into v’: v’(X,Y) :– v(X,Y). – v’(X,Y) :– s(X,Y,_).
Searching the Space of Rewritings:Time-Boxed, Cost-Based Hill-Climbing original plan p1 for q’ with its (est’d) cost : 27 p1: 27 … “one-step” rewritings of p1 using views : + costs p2: 45 p3 : 17 p14 : 20 “two-step” rewritings of p1 using views : + costs p15 : 12 p17 : 18 p16 : 74 (none) OUT OF TIME return best plan found, p15 : 12
How Does Provenance Fit In? • For view adaptation, often useful to “separate” disjuncts of a union, or “recover” values projected away • Would like some sort of index structure for this • Such a structure already exists in CDSS in form of provenance information
Graphical Model of Data Provenance Datalog mappings: m1: e(Name, Color) :– a(Id, Species, “hand color”, Color), d(Species, Name). ¢ ¢ Provenance table for m1: = a.Species = d.Comm. Name = a.Character Compress table using mapping’s correspondences
How to Compute Provenance Graph?Use Datalog! To record provenance for mapping m1 we convert it to a pair of mappings “Just more Datalogviews” => automatically exploited by reformulation engine! • engine doesn’t even know that it’s using provenance! e(N, C) :– a(I, S, “...”, C), d(S, N). The first rule builds the provenance table for m1 m1(I, S, N, C) :– a(I, S, “...”, C), d(S, N). e(N, C) :– m1(_, _, C, N). The second rule projects over m1 to populate e
Exploiting Provenance Information in View Adaptation Example (WITHOUT provenance): e(…) :– a(…), d(…). e(…) :– c(…), g(…). mapping revision cost ≈ 2-way join + 3-way join e’(…) :– a(…), d(…). e’(…) :– c(…), g(…), f(…). Incremental plan to compute e’ using e (faster???) e’(…) :– e(…). e’(…) :– c(…), g(…), f(…). – e’(…) :– c(…), g(…). cost ≈ 2-way join + 3-way join…
How Provenance Information Enables New Rewritings (cont’d) Same example (but WITH provenance): e(…) :– m1(…). m1(...) :– a(…), d(…). e(…):– m2(…). m2(,...) :– c(…), g(…). mapping revision cost ≈ 2-way join + 3-way join e’(…) :– m1(...). m1’(…) :– a(…), d(…). e’(…):– m2(…). m2’(…) :– c(…), g(…), f(…). Incremental plan to compute e’ (and mapping tables) cost ≈ 2-way join! e’(…) :– m1’(...). m1’(…) :– m1(...). e’(…):– m2’(...). m2’(...) :– m2(...), f(…).
Experimental Evaluation • Synthetic workload based on SWISS-PROT biological dataset • Generate source tables, mappings, changes to mappings and data • Changes to mapping definitions guided by empirical observations of schema changes in practice • Start with 16 source relations, 24 views • Apply sequences of 24 “primitive modifications” to view definitions • add/drop column in rule head, add/drop data source for view, add correspondence table, reorder columns, ...
Highlights of Experiments Net speedup ~40% Net speedup ~80% mapping changes only Net speedup ~20% Net speedup ~45% mapping changes + data changes without provenance with provenance
Summary • We’ve shown that optimized change propagation is feasible, and can yield large speedups • Can handle updates to mappings, data, or both via a generic reformulation engine based on optimizing queries using materialized views • For systems like Orchestra that store provenance information, even more opportunities for optimization • Easy to retrofit any Datalog-based system (e.g., LogicBlox) to store same kind of provenance information • (Benefits must be balanced with storage costs…)
Related Work • Incremental view maintenance [Blakeley+ 86], [Gupta+ 93], ... • “deltas” [Gupta+ 93]: an early form of our Z-relations • Answering queries using views [Levy+ 95], [Chaudhuri+ 95], [Afrati&Pavlaki 06], Chase&Backchase [Deutsch,Popa,Tannen 99], ... • Bag-containment/bag-equivalence of CQs/UCQs[Lovász 67], [Chaudhuri&Vardi 93], [Ioannidis&Ramakrishnan 95], [Cohen+ 99], [Jayram+ 06] • View adaptation [Mohania&Dong 96], [Gupta+ 01]
Related Work (cont) • Mapping evolution [Velegrakis+ 03] • Recursively-compiled view maintenance plans [Ahmad&Koch 09, Koch 10] • Data exchange [Fagin+05], P2P data exchange [Fuxman+05] • Youtopia[Koch09] • Mapping adaptation [Yu&Popa05]