460 likes | 554 Views
Rapid, Collaborative Sharing of Dynamic Data. Zachary G. Ives University of Pennsylvania. with Nicholas Taylor, T. J. Green, Grigoris Karvounarakis, Val Tannen North Carolina State University October 6, 2006. Funded by NSF IIS-0477972, IIS-0513778.
E N D
Rapid, Collaborative Sharing of Dynamic Data Zachary G. Ives University of Pennsylvania with Nicholas Taylor, T. J. Green, Grigoris Karvounarakis, Val Tannen North Carolina State University October 6, 2006 Funded by NSF IIS-0477972, IIS-0513778
An Elusive Goal: Building a Web of Structured Data A longtime goal of the computer science field: creating a “smarter” Web • e.g., Tim Berners-Lee’s “Semantic Web”, 15 years of Web data integration Envisioned capabilities: • Link and correlate data from different sources to answer questions that need semantics • Provide a convenient means of exchanging data with business partners, collaborators, etc.
Why Is this So Hard? Semantics is a fuzzy concept • Different terminology, units, or ways of representing things • e.g., in real estate, “full + half baths” vs. “bathrooms” • Difficult to determine and specify equivalences • e.g., conference paper vs. publication – how do they relate precisely? • Linking isn’t simply a matter of hyperlinking and counting on a human • Instead we need to develop and specify mappings (converters, synonyms) Real data is messy, uncertain, inconsistent • Typos; uncertain data; non-canonical names; data that doesn’t fit into a standard form/schema But (we believe): the data sharing architecture is the big bottleneck
Data Sharing, DB-Style:One Instance to Rule Them All? • Data warehouse/exchange: one schema, one consistent instance • Data integration / peer data management systems: • Map heterogeneous data into one or a few virtual schemas • Removeany data that’s inconsistent [Arenas+] Query Results Data Integration System Mediated Schema Source Catalog } Schema Mappings } Autonomous Data Sources
A Common Need: Partial, Peer-to-Peer Collaborative Data Exchange Sometimes we need to exchange data in a less rigid fashion… • Cell phone directory with our friend’s – with different nicknames • Citation DBs with different conference abbreviations • Restaurant reviews and ratings • Scientific databases, where inconsistency or uncertainty are common • “Peer to peer” in that no one DB is all-encompassing or authoritative • Participation is totally voluntary, must not impede local work • Each must be able to override or supplement data from elsewhere
Plasmo DB EBI Crypto DB Target Domain: Data Exchange among Bioinformatics DBs & Biologists Bioinformatics groups and biologists want to share data in their databases and warehouses • Data overlaps – some DBs are specialized, others general (but with data that is less validated) • Each source is updated, curated locally • Updates are published periodically We are providing mechanisms to: • Support local queries and edits to data in each DB • Allow on-demand publishing of updates made to the local DB • Import others’ updates to each local DB despite different schemas • Accommodate the fact that not all sites agree on the edits! (Not probabilistic – sometimes, no consensus on the “right” answer!)
Challenges Multi-“everything”: • Multiple schemas, multiple peers with instances, multiple possibilities for consistent overall instances Voluntary participation: • Group may publish infrequently, drop off the network, etc. • Inconsistency with “the rest of the world” must not prevent the user from doing an operation Unlike cvs or distributed DBs, where consistency with everyone else is always enforced Conflicts need to be captured at the right granularity: • Tuples aren’t added independently – they are generally part of transactions and that may have causal dependencies
Collaborative Data Sharing Philosophy: rather than enforcing a global instance, support many overlapping instances in many schemas (Conflicts are localized!) Collaborative Data Sharing System (CDSS): • Accommodate disagreement with an extended data model Track provenance and support trust policies • “Reconcile” databases by sharing transactions Detect conflicts via constraints and incompatible updates • Define update translation mappings to get all data into the target schema Based on schema mappings and provenance We are implementing the ORCHESTRA CDSS
DE DC Participant PC PE PP Consistent, trusted subset of DE, in Pc’s schema A Peer’s Perspective of the CDSS CDSS (ORCHESTRA) Updates (DC) Queries and Answers RDBMS • User interacts with standard database • CDSS coordinates with other participants • Ensures availability of published updates • Finds consistent set of trusted updates (reconciliation) Updates may first need to be mapped into the target schema
+ D m + E C E - > P + + D - P + m < - > C E - D + C + A CDSS Maps among Sources that Each Publish Updates in Transactions PlasmoDB ( P ) EBI ( P ) P E R ( GUSv 1 ) R ( MIAME ) P E CryptoDB ( P ) C R ( GUSv 3 ) C
m C E - > P m < - > C E Along with Schema Mappings, We Add Prioritized Trust Conditions PlasmoDB ( P ) EBI ( P ) P E R ( GUSv 1 ) R ( MIAME ) P E Priority 5 if 1 + D + E Priority 1 always + + D - P + CryptoDB ( P ) C Priority 3 always R ( GUSv 3 ) C - D + C +
The ORCHESTRA Approach to the Challenges of Collaborative Data Sharing • Accommodate disagreement with an extended data model • Reconcile updates at the transaction level • Define update translation mappings to get all data into the target schema
Multi-Viewpoint Tables (MVTs):Specialized Conditional Tables + Provenance • Each peer’s instance is the subset of tuples in which the peer’s name appears in the viewpoint set • Reconciling peer’s trust conditions assign priorities based on data, provenance, viewpoint set: Peer2:Study(A,B) :- {(GUSv1:Study(A,B; prv, _, _)& contains(prv, Peer1:*); 5), (GUSv1:Study(A,B; _, vpt, _)& contains(vpt, Peer3); 2)} GUSv1:Study AB Datalog rule body Priority
Summary of MVTs • Allow us to have one representation for disagreeing data instances – necessary for expressing constraints among different data sources • Really, we focus on updates rather than data • Relations of deltas (tuple edits), as opposed to tuples themselves
The ORCHESTRA Approach to the Challenges of Collaborative Data Sharing • Accommodate disagreement with an extended data model • Reconcile updates at the transaction level • Define update translation mappings to get all data into the target schema
d d Participant ORCHESTRA System CDSS Reconciliation [Taylor+Ives SIGMOD06] Operations are between one participant and “the system” • Publishing • Reconciliation • Participant applies consistent subset of updates • May get its own unique instance d request Publish New Updates Reconciliation Requests Published Updates d Local Instance Update Log d
Challenges of Reconciliation • Updates occur in atomic transactions • Transactions have causal dependencies (antecedents) • Peers may participate intermittently • (Requires us to make maximal progress at each step)
Ground Rules of Reconciliation Clearly, we must not: • Apply a transaction without having data it depends on (i.e., we need its antecedents) • Apply a transaction chain that causes constraint violations • Apply two transaction chains that affect the same tuple in incompatible ways Also, we believe we should: • Exhibit consistent, predictable behavior to the user • Monotonic treatment of updates: transaction acceptances are final • Always prefer higher priority transactions • Make progress despite conflicts with no clear winner • Allow user conflict resolutions to be deferred
Txn Priority: high medium low A A B C D Reconciliation in ORCHESTRA R(X,Y): XèY Accept highest priority transactions (and any necessary antecedents) Reconciliation 1 Reconciliation 2 û +(A,4) +(B,4) ü +(A,3) û +(B,3) +(C,5) +(A,2) û 6 +(D,8) Decision: ü Accept û Reject 6 Defer +(D,9) 6 +(C,6) ü
Transaction Chains Possible problem: transient conflicts We flatten chains of antecedent transactions (C,6) è(D,6) +(C,6) +(D,6) Peer 1: +(C,5) Peer 3:
Txn Priority: high medium A low Flattening and Antecedents R(X,Y): XèY ü +(A,2) (D,6) è(D,7) +(D,6) +(D,6) +(A,2) +(D,7) û +(A,1) +(B,3) +(F,4) ü +(A,1) +(B,3) +(F,4) (B,3) è(B,4) (C,5) è(E,5) Decision: ü Accept û Reject 6 Defer ü +(C,5) +(C,5) +(A,1) +(B,4) +(F,4) +(E,5) ü
Reconciliation Algorithm:Greedy, Hence Efficient Input: Flattened trusted applicable transaction chains Output: Set A of accepted transactions: For each priority p from pmax to 1: • Let C be the set of chains for priority p • If some t in C conflicts with a non-subsumed u in A, REJECTt • If some t in C • uses a deferred value, DEFER it • conflicts with a non-subsumed, non-rejected u in C, DEFERt • Otherwise, ACCEPTt by adding it to A
Distributed Update Store Participant Participant Participant Participant Participant ORCHESTRA Reconciliation Module Java reconciliation algorithm at each participant • Poly-time in size of update load + antecedent chain length Distributed update store built upon Pastry + BerkeleyDB • Store updates persistently • Compute antecedent chains Publish New Updates Reconciliation Algorithm Reconciliation Requests Published Updates RDBMS
Centralized Impl. Distributed Impl. Experimental Highlight: Performance Is Adequate for Periodic Reconciliation Simulated (Zipfian-skewed) update distribution over subset of SWISS-PROT at each peer (insert / replace workload) 10 peers each publish 500 single-update transactions • Infrequent reconciliation more efficient • Fetch times (i.e. network latency) dominate
Skewed Updates, Infrequent Changes Don’t Result in Huge Divergence Effect of reconciliation interval on synchronicity • synchronicity = avg. no. of values per key • ten peers each publish 500 transactions of one update • Infrequent reconciliation slowly changes synchronicity
Summary of Reconciliation Distributed implementation is practical • We don’t really need “real-time” updates, and operation is reasonable • (We are currently running 100s of virtual peers) Many opportunities for query processing research (caching, replication) Other experiments (in SIGMOD06 paper) How much disagreement arises? • Transactions with > 2 updates have negligible impact • Adding more peers has a sublinear effect Performance with more peers Increases execution time linearly • Need all of the data in one target schema…
The ORCHESTRA Approach to the Challenges of Collaborative Data Sharing • Accommodate disagreement with an extended data model • Reconcile updates at the transaction level • Define update translation mappings to get all data into the target schema
DE DC Participant PC PE PP Consistent, trusted subset of DE, in Pc’s schema Reconciling with Many Schemas Reconciliation needs transactions over the target schema • Break txns into constituent updates (deltas), tagged with txn IDs • Translate the deltas using schema mappings • Reassemble transactions by grouping deltas w/ same txn ID • Reconcile! CDSS (ORCHESTRA) Updates (DC) RDBMS
P P P E R P R E m C E - > P CryptoDB ( P ) C m < - > C E R C Given a Set of Mappings, What Data Should be in Each Peer’s Instance? PDMS Semantics [H+03]: each peer provides all certain answers
Schema Mappings from Data Exchange: A Basic Foundation Data exchange (Clio group at IBM, esp. Popa and Fagin): Schema mappings are tuple generating dependencies (TGDs) R(x,y), S(y, z) ∃w T(x,w,z), U(z,w,y) • Chase [PT99] over sources, TGDs, to compute target instances • Resulting instance: canonical universal solution [FKMP03], and queries over it give all certain answers Our setting adds some important twists…
m C E - > P m m < - - > C C E E Semantics of Consistency: Input, Edit, and Output Relations P P P E o R P Edit table (local updates) o R E + D - P + + Input relation D + E i + R P i R E P C i R C - D + C + o R Output relation C
Incremental Reconciliation in a CDSS [Green, Karvounarakis, Tannen, Ives submission] • Re-compute each peer’s instance individually, in accordance with the input-edit-output model • Don’t re-compute from scratch • Translate all “new” updates into the target schema, maintaining transaction and sequencing info • Then perform reconciliation as we described previously • This problem requires new twists on view maintenance
Mapping Updates: Starting Point Given schema mappings: R(x,y), S(y, z) ∃w T(x,z), U(z,w,y) Convert these into update translation mappingsthat convert “deltas” over relations (Similar to [GM95] count algorithm’s rules) -R(x,y), S(y, z) ∃w -T(x,z), -U(z,w,y) R(x,y), -S(y, z) ∃w -T(x,z), -U(z,w,y) -R(x,y), -S(y, z) ∃w -T(x,z), -U(z,w,y)
A Wrinkle: Incremental Deletion Suppose our mapping is R(x,y) S(x) And we are given: R(x,y) 1,2 1,3 2,4 S(x) 1 2 Then:
A Wrinkle: Incremental Deletion We want a deletion rule like: -R(x,y) -S(x) But this doesn’t quite work: • If we delete R(1,2), then S should be unaffected • If we map –R(1,2) to –S(1), we can’t delete S(1) yet… • Only if we also delete R(1,3) should we delete S(1) • Source of the problem is that S(1) has several distinct derivations! (Similar to bag semantics) R(x,y) 1,2 1,3 2,4 S(x) 1 2
A First Try… Counting [GM95] • (Gupta and Mumick’s counting algorithm) • When computing s, add a count for # derivations: • When we use -R(x,y) -S(x), for each deletion, decrement the count, and only remove when we get to 0 R(x,y) 1,2 1,3 2,4 S(x) c 1 2 2 1 1 0
Where this Fails… • Suppose we have a cyclic definition (two peers want to exchange data): M1: R(x,y) S(y,x) M2: S(x,y) R(x,y) R 1,2 2,4 2,1 4,2 S 2,1 4,2 1,2 2,4 S 2,1 4,2 R 1,2 2,4 R 1,2 2,4 2,1 4,2 M1 M2 M1 M2 … How many times is each tuple derived? We need a finite fixpoint, or else this isn’t implementable! What happens if R deletes the tuple? If S does? …
Desiderata for a Solution • Record a trace of each distinct derivation of a tuple, w.r.t. its original relation and every mapping • Different from, e.g., Cui & Widom’s provenance traces, which only maintain source info • In cyclic cases, only count “identical loops” a finite number of times (say once) • This gives us a least fixpoint, in terms of tuples and their derivations • … It also requires a non-obvious solution, since we can’t use sets, trees, etc. to define provenance … An idea: think of the derivation as being analogous to a recurrence relation…
Our Approach: S-Tables Trace tuple provenance as a semiring polynomial (S,+,*,0,1), to which we add mapping M(…): x + 0 = x x + x = x x * 0 = 0 (x+y)+z = x+(y+z) (x*y)*z = x*(y*z) x + y = y + x x(y + z) = xy + xz M(x + y) = M(x) + M(y) Tuple with provenance of 0 is considered to not be part of instance M1: R(x,y) S(y,x) M2: S(x,y) R(x,y) R Provenance 1,2 p0 = t0 2,4 p1 = t1 2,1 p2 = M2(p4) 4,2 p3 = M2(p5) R Provenance 1,2 p0 = t0 + M2(P6) 2,4 p1 = t1 + M2(p7) 2,1 p2 = M2(p4) 4,2 p3 = M2(p5) S Provenance S Provenance 2,1 p4 = M1(p0) 4,2 p5 = M1(p1) S Provenance 2,1 p4 = M1(p0) 4,2 p5 = M1(p1) 1,2 p6 = M1(p2) 2,4 p7 = M1(p3)
Incremental Insertion with S-tables Inserting a tuple t: • If there’s already an identical tuple t’, update the provenance of t to t’ to be prov(t)+prov(t’) • Then, simplify – note result may be no change! • Else insert t with its provenance
Deletion M1: R(x,y) S(y,x) M2: S(x,y) R(x,y) R Provenance 1,2 p0 = t0 + M2(P6) 2,4 p1 = t1 + M2(p7) 2,1 p2 = M2(p4) 4,2 p3 = M2(p5) R Provenance 2,4 p1 = t1 4,2 p3 = M2(p5) S Provenance 2,1 p4 = M1(p0) 4,2 p5 = M1(p1) 1,2 p6 = M1(p2) 2,4 p7 = M1(p3) S Provenance 4,2 p5 = M1(p1) • Given –R(1,2) and –S(2,4) • M1: -R(x,y) -S(y,x) M2: -S(x,y) -R(x,y) • Set p0 and p7 := 0 • Simplify (may be nontrivial if mutual recursion)
Summary: S-Tables and Provenance More expressive than “why & where provenance,” [Buneman+ 01], lineage tracing [Cui + Widom 01], other formalisms • Similar in spirit to “mapping routes” [Chiticaru+06], irrelevant rule elimination [Levy+ 92] If the set of mappings has a least fixpoint in datalog, it has one in our semantics • Our polynomial captures all possible derivation paths “through the mappings” – a form of “how provenance” (Tannen) Gives us a means of performing incremental maintenance in a fully P2P model, even with cycles (that have least fixpoints)
Ongoing Work Implementing the provenance-based maintenance algorithm • Procedure can be cast as a set of datalog rules • But: needs “slightly more” than SQL or stratified datalog semantics Inverse mappings • We propagate updates “down” a mapping – what about upwards? Necessary to support mirroring… • Provenance makes it quite different from existing view update literature Performance! Lots of opportunities for caching antecedents, reusing computations across reconciliations, answering queries using views, multi-query optimization!
SHAR SHARQ [with Davidson, Tannen, Stoeckert, White] ORCHESTRA is the core engine of a larger effort in bioinformatics information management: • SHARQ (Sharing Heterogeneous, Autonomous Resources and Queries) • Develop a network of database instances, views, query forms, etc. that: • Is incrementally extensible with new data, views, query templates • Supports search for the “the right” query form to answer a question • Accommodates a variety of different sub-communities • Supports both browsing and searching modes of operation • … Perhaps even supports text extraction and approximate matches
Related Work • Incomplete information [Imielinski & Lipski 84], info source tracking [Sadri 98] • Inconsistency repair [Bry 97], [Arenas+99] • Provenance [Alagar+ 95][Cui & Widom 01][Buneman+ 01][Widom+ 05] • Distributed concurrency control • Optimistic CC [KR 81], Version vectors [PPR+83], … • View update [Dayal & Bernstein 82][Keller 84, 85], … • Incremental maintenance [Gupta & Mumick 95], [Blakeley 86, 89], … • File synchronization and distributed filesystems • Harmony [Foster + 04], Unison [Pierce + 01]; CVS, Subversion, etc. • Ivy [MMGC 02], Coda [Braam 98,KS 95], Bayou [TTP+’96], … • Treo [Widom+], MystiQ [Suciu+] • Peer data management systems Piazza [Halevy + 03, 04], Hyperion [Kementsietsidis+ 04], [Calvanese+ 04], peer data exchange [Fuxman + 05], Trento/Toronto LRM [Bernstein+ 02]
Conclusions ORCHESTRA focuses on trying to coordinate disagreement, rather than enforcing agreement • Accommodate disagreement with an extended data model and trust policies • Reconcile updates at the transaction level • Define update translation mappings to get all data into the target schema Ongoing work: implementing update mappings, caching, replication, biological applications