590 likes | 810 Views
Lecture 11: Provenance and Data privacy. December 8, 2010. Outline. Database provenance Slides based on Val Tannen’s Keynote talk at EDBT 2010 Data privacy Slides from my UW colloquium talk in 2005. Data Provenance. provenance, n .
E N D
Lecture 11:Provenance and Data privacy December 8, 2010 Dan Suciu -- CSEP544 Fall 2010
Outline • Database provenance • Slides based on Val Tannen’s Keynote talk at EDBT 2010 • Data privacy • Slides from my UW colloquium talk in 2005 Dan Suciu -- CSEP544 Fall 2010
Data Provenance provenance, n. The fact of coming from some particular source or quarter∅ origin, derivation[Oxford English Dictionary] • Data provenance[BunemanKhannaTan01]: aims to explain how a particular result (in an experiment, simulation, query, workflow, etc.) was derived. • Most science today is data-intensive. Scientists, eg., biologists, astronomers, worry about data provenance all the time. EDBT Keynote, Lausanne
Provenance? Lineage? Pedigree? • Cf. Peter Buneman: • Pedigree is for dogs • Lineage is for kings • Provenance is for art • For data, let’s be artistic (artsy?) EDBT Keynote, Lausanne
Database transformations? • Queries • Views • ETL tools • Schema mappings (as used in data exchange) EDBT Keynote, Lausanne
Outline • What’s with the semirings? Annotation propagation [GK&T PODS 07, GKI&T VLDB 07] • Housekeeping in the zoo of provenance models EDBT Keynote, Lausanne
Propagating annotations through database operations R A B C R ⋈ S A B C D E JOIN (on B) S D B E The annotation p⋅rmeans joint use of data annotated by p and data annotated by r EDBT Keynote, Lausanne
Another way to propagate annotations R A B C R ⋃ S A B C UNION S A B C The annotation p+rmeans alternative use of data EDBT Keynote, Lausanne
Another use of + R A B C ΠABR A B PROJECT +means alternative use of data EDBT Keynote, Lausanne
An example in positive relational algebra (SPJU) Q = C=eΠAC( ΠACR ⋈ΠBCR ⋃ΠABR ⋈ΠBCR ) R A B C A C For selection we multiply with two special annotations, 0 and 1 EDBT Keynote, Lausanne
Summary so far EDBT Keynote, Lausanne
Summary so far A space of annotations, K EDBT Keynote, Lausanne
Summary so far A space of annotations, K K-relations: every tuple annotated with some element from K. EDBT Keynote, Lausanne
Summary so far A space of annotations, K K-relations: every tuple annotated with some element from K. Binaryoperations on K: ⋅ corresponds to joint use (join), and +corresponds to alternative use (union and projection). EDBT Keynote, Lausanne
Summary so far A space of annotations, K K-relations: every tuple annotated with some element from K. Binaryoperations on K: ⋅ corresponds to joint use (join), and +corresponds to alternative use (union and projection). We assume Kcontainsspecial annotations 0 and 1. EDBT Keynote, Lausanne
Summary so far A space of annotations, K K-relations: every tuple annotated with some element from K. Binaryoperations on K: ⋅ corresponds to joint use (join), and +corresponds to alternative use (union and projection). We assume Kcontainsspecial annotations 0 and 1. ‘‘Absent’’ tuples are annotatedwith0! EDBT Keynote, Lausanne
Summary so far A space of annotations, K K-relations: every tuple annotated with some element from K. Binaryoperations on K: ⋅ corresponds to joint use (join), and +corresponds to alternative use (union and projection). We assume Kcontainsspecial annotations 0 and 1. ‘‘Absent’’ tuples are annotatedwith0! 1isa ‘‘neutral’’ annotation (no restrictions). EDBT Keynote, Lausanne
Summary so far A space of annotations, K K-relations: every tuple annotated with some element from K. Binaryoperations on K: ⋅ corresponds to joint use (join), and +corresponds to alternative use (union and projection). We assume Kcontainsspecial annotations 0 and 1. ‘‘Absent’’ tuples are annotatedwith0! 1is a ‘‘neutral’’ annotation (no restrictions). Algebra of annotations?What are the laws of (K, +, ⋅, 0, 1) ? EDBT Keynote, Lausanne
Annotated relational algebra • DBMS query optimizers assume certain equivalences: • union is associative, commutative • join is associative, commutative, distributes over union • projections and selections commute with each other and with union and join (when applicable) • Etc., but no R ⋈ R = R ⋃ R = R (i.e., no idempotence, to allow for bag semantics) • Equivalent queries should produce same annotations! EDBT Keynote, Lausanne
Annotated relational algebra • DBMS query optimizers assume certain equivalences: • union is associative, commutative • join is associative, commutative, distributes over union • projections and selections commute with each other and with union and join (when applicable) • Etc., but no R ⋈ R = R ⋃ R = R (i.e., no idempotence, to allow for bag semantics) • Equivalent queries should produce same annotations! Proposition. Above identities hold for queries on K-relations iff(K, +, ⋅, 0, 1) is a commutative semiring Hence, for each commutative semiringK we have a K-annotated relational algebra. EDBT Keynote, Lausanne
Annotated relational algebra • DBMS query optimizers assume certain equivalences: • union is associative, commutative • join is associative, commutative, distributes over union • projections and selections commute with each other and with union and join (when applicable) • Etc., but no R ⋈ R = R ⋃ R = R (i.e., no idempotence, to allow for bag semantics) • Equivalent queries should produce same annotations! Hence, for each commutative semiringK we have a K-annotated relational algebra. EDBT Keynote, Lausanne
What is a commutative semiring? An algebraic structure (K, +,⋅, 0, 1) where: • K is the domain • + is associative, commutative, with 0 identity • ⋅is associative, with 1 identity semiring • ⋅distributes over + • a⋅0 = 0⋅a = 0 • ⋅is also commutative Unlike ring, no requirement for inverses to + EDBT Keynote, Lausanne
Back to the example Q R A C A C A C A B C EDBT Keynote, Lausanne
Using the laws: polynomials Q R A C A C A C A B C Polynomials with coefficients in N and annotation tokens as indeterminatesp, r, s capture a very general form of provenance EDBT Keynote, Lausanne
Provenance reading of the polynomials Q R A C A C A C A B C • three different ways to derivede • two of the ways use only r • but they use it twice • the third way uses ronce and sonce EDBT Keynote, Lausanne
Low-hanging fruit: deletion propagation We used this in Orchestra[VLDB07]for update propagation Q R A B C A C EDBT Keynote, Lausanne
Low-hanging fruit: deletion propagation We used this in Orchestra[VLDB07]for update propagation Q R A B C A C Delete dbefrom R ? EDBT Keynote, Lausanne
Low-hanging fruit: deletion propagation We used this in Orchestra[VLDB07]for update propagation Q R A B C A C Delete dbefrom R ? Set r= 0 ! EDBT Keynote, Lausanne
Low-hanging fruit: deletion propagation We used this in Orchestra[VLDB07]for update propagation Q Q R A B C A C A C Delete dbefrom R ? Set r= 0 ! EDBT Keynote, Lausanne
Low-hanging fruit: deletion propagation We used this in Orchestra[VLDB07]for update propagation Q Q Q R A B C A C A C A C Delete dbefrom R ? Set r= 0 ! EDBT Keynote, Lausanne
But are there useful commutative semirings? EDBT Keynote, Lausanne
But are there useful commutative semirings? public EDBT Keynote, Lausanne
But are there useful commutative semirings? top secret public EDBT Keynote, Lausanne
Outline • What’s with the semirings? Annotation propagation • Housekeeping in the zoo of provenance models [GK&T PODS 07, FG&T PODS 08, Green ICDT 09] EDBT Keynote, Lausanne
Semirings for various models of provenance (1) Q A C A B C R Lineage[CuiWidomWiener00 etc.] Sets of contributing tuples Semiring: (Lin(X), ⋃, ⋃*, ∅, ∅*) EDBT Keynote, Lausanne
Semirings for various models of provenance (2) Q A C A B C R (Witness, Proof) why-provenance [BunemanKhannaTan01] & [Buneman+ PODS08] Sets of witnesses (w. =set of contributing tuples) Semiring: (Why(X), ⋃, ⋓, ∅, {∅}) EDBT Keynote, Lausanne
Semirings for various models of provenance (3) Q A C A B C R Minimal witness why-provenance [BunemanKhannaTan01] Sets of minimal witnesses Semiring: (PosBool(X), ∧, ∨, ⊤, ⊥) EDBT Keynote, Lausanne
Semirings for various models of provenance (4) Q A C A B C R Notation: { } set [ ] bag Trio lineage [Das Sarma+ 08] Bags of sets of contributing tuples (of witnesses) Semiring: (Trio(X), +, ⋅, 0, 1) (defined in [Green, ICDT 09]) EDBT Keynote, Lausanne
Semirings for various models of provenance (5) Q A C A B C R Notation: { } set [ ] bag Polynomials with boolean coefficients [Green, ICDT 09] ( B[X]-provenance ) Sets of bags of contributing tuples Semiring: (B[X], +, ⋅, 0, 1) EDBT Keynote, Lausanne
Semirings for various models of provenance (6) Q A C A B C R Provenance polynomials [GKT, PODS 07] ( N[X]-provenance ) Bags of bags of contributing tuples Semiring: (N[X], +, ⋅, 0, 1) EDBT Keynote, Lausanne
A provenance hierarchy most informative N[X] Trio(X) B[X] Why(X) Lin(X) PosBool(X) least informative EDBT Keynote, Lausanne
One semiring to rule them all… (apologies!) Example: 2x2y + xy + 5y2 + z N[X] drop coefficients x2y + xy + y2 + z drop exponents 3xy + 5y + z Trio(X) B[X] drop both exp. and coeff. xy + y + z Why(X) apply absorption (ab + b =b) y + z collapse terms xyz Lin(X) PosBool(X) A path downward from K1 to K2 indicates that there exists an onto (surjective) semiring homomorphism h : K1K2 EDBT Keynote, Lausanne
Using homomorphisms to relate models Example: 2x2y + xy + 5y2 + z N[X] drop coefficients x2y + xy + y2 + z drop exponents 3xy + 5y + z Trio(X) B[X] drop both exp. and coeff. xy + y + z Why(X) apply absorption (ab + b =b) y + z collapse terms xyz Lin(X) PosBool(X) Homomorphism? h(x+y) =h(x)+h(y)h(xy)=h(x)h(y) h(0)=0 h(1)=1 Moreover, for these homomorphismsh(x)=x EDBT Keynote, Lausanne
Containment and Equivalence[Green ICDT 09] N N[X] B[X] N[X] Trio(X) Why(X) Trio(X) B[X] B[X] N[X] B[X] N[X] N Why(X) Why(X) Trio(X) Why(X) Trio(X) Lin(X) Lin(X) Lin(X) Lin(X) N N PosBool(X) B PosBool(X) B PosBool(X) B PosBool(X) B SPJ containment SPJ equivalence SPJU containment SPJU equivalence Arrow from K1 to K2 indicates K1 containment (equivalence) implies K2 cont. (equiv.) All implications not marked↔are strict
Data Security • Based on my colloquium talk from 2005 EDBT Keynote, Lausanne
Data Security Dorothy Denning, 1982: • Data Security is the science and study of methods of protecting data (...) from unauthorized disclosure and modification • Data Security = Confidentiality + Integrity
Data Security • Distinct from systems and network security • Assumes these are already secure • Tools: • Cryptography, information theory, statistics, … • Applications: • An enabling technology
Outline • An attack • Data security research today
Latanya Sweeney’s Finding • In Massachusetts, the Group Insurance Commission (GIC) is responsible for purchasing health insurance for state employees • GIC has to publish the data: GIC(zip, dob, sex, diagnosis, procedure, ...) This is private ! Right ?
Latanya Sweeney’s Finding • Sweeney paid $20 and bought the voter registration list for Cambridge Massachusetts: VOTER(name, party, ..., zip, dob, sex) GIC(zip, dob, sex, diagnosis, procedure, ...) This is private ! Right ?