260 likes | 382 Views
CHAPTER 14: DATA PROVENANCE. PRINCIPLES OF DATA INTEGRATION. ANHAI DOAN ALON HALEVY ZACHARY IVES. “Where Did this Data Come from?”. Challenge: integrated data may come from many sources and mappings – of different quality or trustworthiness! How did I get this particular result?
E N D
CHAPTER 14: DATA PROVENANCE PRINCIPLES OF DATA INTEGRATION ANHAI DOAN ALON HALEVY ZACHARY IVES
“Where Did this Data Come from?” Challenge: integrated data may come from many sources and mappings – of different quality or trustworthiness! • How did I get this particular result? • What mappings produced it? • How much should I trust (believe) it? Data provenance (lineage) captures the relationships between tuples in a set of data instances
An Example: View Tuple Derivations Source relations R S View V1 = R⋈ S ∪ S ⋈ S
Formulating a Provenance Model Conceptually, provenance captures the operations and operands going into a result There are many options to do this, and many levels of detail! A “good” provenance model should: • Have a formal semantics • Have equivalence properties such that equivalent query plans produce equivalent provenance • Connect to notions of value, quality or score
Outline • The two views of provenance • Applications of data provenance • Provenance semirings: one ring to rule them all • Storing provenance
Provenance as Annotations on Data • Annotate each derivation with an “explanation” in terms of relational algebra and the tuple operands • Lets us “look up” the derivation of a result • View V1 (in Datalog): • V1(x,z) :- R(x,y), S(y,z) • V1(x,x) :- S(x,y), S(y,x) R S
derives via V 1 R ( 1 , 4 ) S ( 4 , 3 ) derives via V 1 R ( 1 , 2 ) S ( 2 , 3 ) V ( 1 , 3 ) 1 derives via V 1 S ( 3 , 2 ) V ( 2 , 2 ) 1 derives via V ( 3 , 3 ) V 1 1 Provenance as a Graph of Relationships • Bipartite graph: tuple nodes connected via “derivation nodes” • Encodes a hypergraph (hyperedges = derivations) • Makes direct derivation relationships more explicit
Making the Two Interchangeable • We can make these equivalent by introducing provenance tokens (equiv. node IDs) for each tuple • Derived tuples’ annotations = expressions over tokens V R 1 r2 s3 V 1 r1 s1 v1 V S V1 1 s2 v2 V v3 1
Outline • The two views of provenance • Applications of data provenance • Provenance semirings: one ring to rule them all • Storing provenance
Where Can We Use Provenance? Explanations • Help the user understand why an item exists Scoring • Provide a ranked list of “most relevant” results Reasoning about interactions • Help the user understand data relationships
Examples of Provenance’s Utility Schema mapping debugging: We may have a bad result • Determine why that result exists, what is faulty Bioinformatics data integration: Different sources have different levels of reliability or authoritativeness • Rank results by score! Probabilistic databases: We may need to know that results are correlated • Encode the relationships, use to assign probabilities
Outline • The two views of provenance • Applications of data provenance • Provenance semirings: one ring to rule them all • Storing provenance
The Notion of Provenance as Annotations • Many formalisms were defined for using query computations to produce annotations • Each captured certain subtleties • The key question: Is there one “most powerful” model that captures the properties of the relational algebra*? • Equivalent queries should produce equivalent provenance * over multi-sets or bags, as used by “real” systems
The Provenance Semiring Model To represent provenance, use: • A set of provenance tokens or tuple IDs, K • Abstract operators representing combination of tuples Abstract sum operator, ⊕, for union or projection has identity element 0 (a ⊕ 0 ≡ 0 ⊕ a ≡ 0) Abstract product operator, ⊗, for join • has identity element 1 (a ⊗ 1 ≡ 1 ⊗ a ≡ 1) • also (a ⊗ 0 ≡ 0 ⊗ a ≡ 0) This is formally a commutative semiring
The Provenance Semiring Model • We can re-express our example as below, using the semiring operators instead of the relational algebra ones V R 1 r2 s3 V 1 r1 s1 v1 V S V1 1 s2 v2 V v3 1
Tokens for Mappings • Sometimes we would like to assign a token to the actual mapping or rule used – so we can assign it a value • View V1 (in Datalog): • V1(x,z) :- R(x,y), S(y,z) • V1(x,x) :- S(x,y), S(y,x) R Call this m1 Call this m2 V1 S
Example Application: Provenance Visualization Base tuple derivation(token not shown) Tuple nodes Derivation bymapping M5
Example Application: Tuple Scoring • For ranked query results, we may adopt the following model commonly used in ranking: • Assign a score to each base tuple = - log2(probability) • Use arithmetic sum as ⊗ • Use min as ⊕ • Suppose • prob(r1) = 0.5, prob(s1) = 0.5, others are 1.0 V1
Outline • The two views of provenance • Applications of data provenance • Provenance semirings: one ring to rule them all • Storing provenance
Storing Provenance • Use tuple keys as tokens • Encode provenance graph as relations • View V1 (in Datalog): • V1(x,z) :- R(x,y), S(y,z) • V1(x,x) :- S(x,y), S(y,x) Relate tuples with table Pv1-1 Relate tuples with table Pv1-2 R Pv1-1 V1 S Pv1-2
Storing Provenance • Use tuple keys as tokens • Encode provenance graph as relations • View V1 (in Datalog): • V1(x,z) :- R(x,y), S(y,z) • V1(x,x) :- S(x,y), S(y,x) These are redundantif we know the Datalog R Pv1-1 V1 S Pv1-2
Storing Provenance • Use tuple keys as tokens • Encode provenance graph as relations • View V1 (in Datalog): • V1(x,z) :- R(x,y), S(y,z) • V1(x,x) :- S(x,y), S(y,x) R Pv1-1 V1 S Pv1-2
Data Provenance Wrap-up • Provenance is critical to understanding and assessing the believability of data, and in debugging • Two equivalent representations – annotations vs graph • Provenance semiring model preserves the “expected” equivalences of the relational algebra • We can take semiring provenance and evaluate it with different semirings to get useful scores • We can store provenance using relations • Recent work beyond the scope of the book: • Extending provenance to more complex queries, e.g., with aggregation • Languages for querying provenance (primarily as a graph)