CHAPTER 14: DATA PROVENANCE

CHAPTER 14: DATA PROVENANCE PRINCIPLES OF DATA INTEGRATION ANHAI DOAN ALON HALEVY ZACHARY IVES

“Where Did this Data Come from?” Challenge: integrated data may come from many sources and mappings – of different quality or trustworthiness! • How did I get this particular result? • What mappings produced it? • How much should I trust (believe) it? Data provenance (lineage) captures the relationships between tuples in a set of data instances

An Example: View Tuple Derivations Source relations R S View V1 = R⋈ S ∪ S ⋈ S

Formulating a Provenance Model Conceptually, provenance captures the operations and operands going into a result There are many options to do this, and many levels of detail! A “good” provenance model should: • Have a formal semantics • Have equivalence properties such that equivalent query plans produce equivalent provenance • Connect to notions of value, quality or score

Outline • The two views of provenance • Applications of data provenance • Provenance semirings: one ring to rule them all • Storing provenance

Provenance as Annotations on Data • Annotate each derivation with an “explanation” in terms of relational algebra and the tuple operands • Lets us “look up” the derivation of a result • View V1 (in Datalog): • V1(x,z) :- R(x,y), S(y,z) • V1(x,x) :- S(x,y), S(y,x) R S

derives via V 1 R ( 1 , 4 ) S ( 4 , 3 ) derives via V 1 R ( 1 , 2 ) S ( 2 , 3 ) V ( 1 , 3 ) 1 derives via V 1 S ( 3 , 2 ) V ( 2 , 2 ) 1 derives via V ( 3 , 3 ) V 1 1 Provenance as a Graph of Relationships • Bipartite graph: tuple nodes connected via “derivation nodes” • Encodes a hypergraph (hyperedges = derivations) • Makes direct derivation relationships more explicit

Making the Two Interchangeable • We can make these equivalent by introducing provenance tokens (equiv. node IDs) for each tuple • Derived tuples’ annotations = expressions over tokens V R 1 r2 s3 V 1 r1 s1 v1 V S V1 1 s2 v2 V v3 1

Where Can We Use Provenance? Explanations • Help the user understand why an item exists Scoring • Provide a ranked list of “most relevant” results Reasoning about interactions • Help the user understand data relationships

Examples of Provenance’s Utility Schema mapping debugging: We may have a bad result • Determine why that result exists, what is faulty Bioinformatics data integration: Different sources have different levels of reliability or authoritativeness • Rank results by score! Probabilistic databases: We may need to know that results are correlated • Encode the relationships, use to assign probabilities

The Notion of Provenance as Annotations • Many formalisms were defined for using query computations to produce annotations • Each captured certain subtleties • The key question: Is there one “most powerful” model that captures the properties of the relational algebra*? • Equivalent queries should produce equivalent provenance * over multi-sets or bags, as used by “real” systems

The Provenance Semiring Model To represent provenance, use: • A set of provenance tokens or tuple IDs, K • Abstract operators representing combination of tuples Abstract sum operator, ⊕, for union or projection has identity element 0 (a ⊕ 0 ≡ 0 ⊕ a ≡ 0) Abstract product operator, ⊗, for join • has identity element 1 (a ⊗ 1 ≡ 1 ⊗ a ≡ 1) • also (a ⊗ 0 ≡ 0 ⊗ a ≡ 0) This is formally a commutative semiring

The Provenance Semiring Model • We can re-express our example as below, using the semiring operators instead of the relational algebra ones V R 1 r2 s3 V 1 r1 s1 v1 V S V1 1 s2 v2 V v3 1

Tokens for Mappings • Sometimes we would like to assign a token to the actual mapping or rule used – so we can assign it a value • View V1 (in Datalog): • V1(x,z) :- R(x,y), S(y,z) • V1(x,x) :- S(x,y), S(y,x) R Call this m1 Call this m2 V1 S

Example Application: Provenance Visualization Base tuple derivation(token not shown) Tuple nodes Derivation bymapping M5

Example Application: Tuple Scoring • For ranked query results, we may adopt the following model commonly used in ranking: • Assign a score to each base tuple = - log2(probability) • Use arithmetic sum as ⊗ • Use min as ⊕ • Suppose • prob(r1) = 0.5, prob(s1) = 0.5, others are 1.0 V1

Useful Semirings

Storing Provenance • Use tuple keys as tokens • Encode provenance graph as relations • View V1 (in Datalog): • V1(x,z) :- R(x,y), S(y,z) • V1(x,x) :- S(x,y), S(y,x) Relate tuples with table Pv1-1 Relate tuples with table Pv1-2 R Pv1-1 V1 S Pv1-2

Storing Provenance • Use tuple keys as tokens • Encode provenance graph as relations • View V1 (in Datalog): • V1(x,z) :- R(x,y), S(y,z) • V1(x,x) :- S(x,y), S(y,x) These are redundantif we know the Datalog R Pv1-1 V1 S Pv1-2

Storing Provenance • Use tuple keys as tokens • Encode provenance graph as relations • View V1 (in Datalog): • V1(x,z) :- R(x,y), S(y,z) • V1(x,x) :- S(x,y), S(y,x) R Pv1-1 V1 S Pv1-2

Data Provenance Wrap-up • Provenance is critical to understanding and assessing the believability of data, and in debugging • Two equivalent representations – annotations vs graph • Provenance semiring model preserves the “expected” equivalences of the relational algebra • We can take semiring provenance and evaluate it with different semirings to get useful scores • We can store provenance using relations • Recent work beyond the scope of the book: • Extending provenance to more complex queries, e.g., with aggregation • Languages for querying provenance (primarily as a graph)

CHAPTER 14: DATA PROVENANCE

CHAPTER 14: DATA PROVENANCE

Presentation Transcript

Data Provenance in ETL Scenarios

Lecture 11: Provenance and Data privacy

Circuits for Datalog Provenance

PODS ’08 Vancouver, B.C. June 11, 2008

A Provenance-based Access Control Model (PBAC)

Data Provenance Tiger Team

Collaborative Data Sharing with Mappings and Provenance

Tracking Web provenance with Swoogle

Neuroimaging Data Provenance Using the LONI Pipeline Workflow Environment

Data Provenance Tiger Team

Scalable and Eﬃcient Reasoning for Enforcing Role-Based Access Control for Provenance Data

Karma Provenance: Why and How? Provenance collection of unmanaged workflows PI: Dr. Beth Plale

Provenance (for Earth science data)

Provenance challenge --- my Grid

Emerging Trends in Provenance

Large Scale Nuclear Physics Calculations in a Workflow Environment and Data Provenance Capturing

ES3 architecture ES3’s provenance capture system uses a client-server architecture.

Circuits for Datalog Provenance

Provenance of scientific information as experienced in DRIVER

Provenance Semirings

Computing Provenance and Annotations of Derived Data