1 / 59

Lecture 11: Provenance and Data privacy

Lecture 11: Provenance and Data privacy. December 8, 2010. Outline. Database provenance Slides based on Val Tannen’s Keynote talk at EDBT 2010 Data privacy Slides from my UW colloquium talk in 2005. Data Provenance. provenance, n .

ajay
Download Presentation

Lecture 11: Provenance and Data privacy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 11:Provenance and Data privacy December 8, 2010 Dan Suciu -- CSEP544 Fall 2010

  2. Outline • Database provenance • Slides based on Val Tannen’s Keynote talk at EDBT 2010 • Data privacy • Slides from my UW colloquium talk in 2005 Dan Suciu -- CSEP544 Fall 2010

  3. Data Provenance provenance, n. The fact of coming from some particular source or quarter∅ origin, derivation[Oxford English Dictionary] • Data provenance[BunemanKhannaTan01]: aims to explain how a particular result (in an experiment, simulation, query, workflow, etc.) was derived. • Most science today is data-intensive. Scientists, eg., biologists, astronomers, worry about data provenance all the time. EDBT Keynote, Lausanne

  4. Provenance? Lineage? Pedigree? • Cf. Peter Buneman: • Pedigree is for dogs • Lineage is for kings • Provenance is for art • For data, let’s be artistic (artsy?) EDBT Keynote, Lausanne

  5. Database transformations? • Queries • Views • ETL tools • Schema mappings (as used in data exchange) EDBT Keynote, Lausanne

  6. Outline • What’s with the semirings? Annotation propagation [GK&T PODS 07, GKI&T VLDB 07] • Housekeeping in the zoo of provenance models EDBT Keynote, Lausanne

  7. Propagating annotations through database operations R A B C R ⋈ S A B C D E JOIN (on B) S D B E The annotation p⋅rmeans joint use of data annotated by p and data annotated by r EDBT Keynote, Lausanne

  8. Another way to propagate annotations R A B C R ⋃ S A B C UNION S A B C The annotation p+rmeans alternative use of data EDBT Keynote, Lausanne

  9. Another use of + R A B C ΠABR A B PROJECT +means alternative use of data EDBT Keynote, Lausanne

  10. An example in positive relational algebra (SPJU) Q = C=eΠAC( ΠACR ⋈ΠBCR ⋃ΠABR ⋈ΠBCR ) R A B C A C For selection we multiply with two special annotations, 0 and 1 EDBT Keynote, Lausanne

  11. Summary so far EDBT Keynote, Lausanne

  12. Summary so far A space of annotations, K EDBT Keynote, Lausanne

  13. Summary so far A space of annotations, K K-relations: every tuple annotated with some element from K. EDBT Keynote, Lausanne

  14. Summary so far A space of annotations, K K-relations: every tuple annotated with some element from K. Binaryoperations on K: ⋅ corresponds to joint use (join), and +corresponds to alternative use (union and projection). EDBT Keynote, Lausanne

  15. Summary so far A space of annotations, K K-relations: every tuple annotated with some element from K. Binaryoperations on K: ⋅ corresponds to joint use (join), and +corresponds to alternative use (union and projection). We assume Kcontainsspecial annotations 0 and 1. EDBT Keynote, Lausanne

  16. Summary so far A space of annotations, K K-relations: every tuple annotated with some element from K. Binaryoperations on K: ⋅ corresponds to joint use (join), and +corresponds to alternative use (union and projection). We assume Kcontainsspecial annotations 0 and 1. ‘‘Absent’’ tuples are annotatedwith0! EDBT Keynote, Lausanne

  17. Summary so far A space of annotations, K K-relations: every tuple annotated with some element from K. Binaryoperations on K: ⋅ corresponds to joint use (join), and +corresponds to alternative use (union and projection). We assume Kcontainsspecial annotations 0 and 1. ‘‘Absent’’ tuples are annotatedwith0! 1isa ‘‘neutral’’ annotation (no restrictions). EDBT Keynote, Lausanne

  18. Summary so far A space of annotations, K K-relations: every tuple annotated with some element from K. Binaryoperations on K: ⋅ corresponds to joint use (join), and +corresponds to alternative use (union and projection). We assume Kcontainsspecial annotations 0 and 1. ‘‘Absent’’ tuples are annotatedwith0! 1is a ‘‘neutral’’ annotation (no restrictions). Algebra of annotations?What are the laws of (K, +, ⋅, 0, 1) ? EDBT Keynote, Lausanne

  19. Annotated relational algebra • DBMS query optimizers assume certain equivalences: • union is associative, commutative • join is associative, commutative, distributes over union • projections and selections commute with each other and with union and join (when applicable) • Etc., but no R ⋈ R = R ⋃ R = R (i.e., no idempotence, to allow for bag semantics) • Equivalent queries should produce same annotations! EDBT Keynote, Lausanne

  20. Annotated relational algebra • DBMS query optimizers assume certain equivalences: • union is associative, commutative • join is associative, commutative, distributes over union • projections and selections commute with each other and with union and join (when applicable) • Etc., but no R ⋈ R = R ⋃ R = R (i.e., no idempotence, to allow for bag semantics) • Equivalent queries should produce same annotations! Proposition. Above identities hold for queries on K-relations iff(K, +, ⋅, 0, 1) is a commutative semiring Hence, for each commutative semiringK we have a K-annotated relational algebra. EDBT Keynote, Lausanne

  21. Annotated relational algebra • DBMS query optimizers assume certain equivalences: • union is associative, commutative • join is associative, commutative, distributes over union • projections and selections commute with each other and with union and join (when applicable) • Etc., but no R ⋈ R = R ⋃ R = R (i.e., no idempotence, to allow for bag semantics) • Equivalent queries should produce same annotations! Hence, for each commutative semiringK we have a K-annotated relational algebra. EDBT Keynote, Lausanne

  22. What is a commutative semiring? An algebraic structure (K, +,⋅, 0, 1) where: • K is the domain • + is associative, commutative, with 0 identity • ⋅is associative, with 1 identity semiring • ⋅distributes over + • a⋅0 = 0⋅a = 0 • ⋅is also commutative Unlike ring, no requirement for inverses to + EDBT Keynote, Lausanne

  23. Back to the example Q R A C A C A C A B C EDBT Keynote, Lausanne

  24. Using the laws: polynomials Q R A C A C A C A B C Polynomials with coefficients in N and annotation tokens as indeterminatesp, r, s capture a very general form of provenance EDBT Keynote, Lausanne

  25. Provenance reading of the polynomials Q R A C A C A C A B C • three different ways to derivede • two of the ways use only r • but they use it twice • the third way uses ronce and sonce EDBT Keynote, Lausanne

  26. Low-hanging fruit: deletion propagation We used this in Orchestra[VLDB07]for update propagation Q R A B C A C EDBT Keynote, Lausanne

  27. Low-hanging fruit: deletion propagation We used this in Orchestra[VLDB07]for update propagation Q R A B C A C Delete dbefrom R ? EDBT Keynote, Lausanne

  28. Low-hanging fruit: deletion propagation We used this in Orchestra[VLDB07]for update propagation Q R A B C A C Delete dbefrom R ? Set r= 0 ! EDBT Keynote, Lausanne

  29. Low-hanging fruit: deletion propagation We used this in Orchestra[VLDB07]for update propagation Q Q R A B C A C A C Delete dbefrom R ? Set r= 0 ! EDBT Keynote, Lausanne

  30. Low-hanging fruit: deletion propagation We used this in Orchestra[VLDB07]for update propagation Q Q Q R A B C A C A C A C Delete dbefrom R ? Set r= 0 ! EDBT Keynote, Lausanne

  31. But are there useful commutative semirings? EDBT Keynote, Lausanne

  32. But are there useful commutative semirings? public EDBT Keynote, Lausanne

  33. But are there useful commutative semirings? top secret public EDBT Keynote, Lausanne

  34. Outline • What’s with the semirings? Annotation propagation • Housekeeping in the zoo of provenance models [GK&T PODS 07, FG&T PODS 08, Green ICDT 09] EDBT Keynote, Lausanne

  35. Semirings for various models of provenance (1) Q A C A B C R Lineage[CuiWidomWiener00 etc.] Sets of contributing tuples Semiring: (Lin(X), ⋃, ⋃*, ∅, ∅*) EDBT Keynote, Lausanne

  36. Semirings for various models of provenance (2) Q A C A B C R (Witness, Proof) why-provenance [BunemanKhannaTan01] & [Buneman+ PODS08] Sets of witnesses (w. =set of contributing tuples) Semiring: (Why(X), ⋃, ⋓, ∅, {∅}) EDBT Keynote, Lausanne

  37. Semirings for various models of provenance (3) Q A C A B C R Minimal witness why-provenance [BunemanKhannaTan01] Sets of minimal witnesses Semiring: (PosBool(X), ∧, ∨, ⊤, ⊥) EDBT Keynote, Lausanne

  38. Semirings for various models of provenance (4) Q A C A B C R Notation: { } set [ ] bag Trio lineage [Das Sarma+ 08] Bags of sets of contributing tuples (of witnesses) Semiring: (Trio(X), +, ⋅, 0, 1) (defined in [Green, ICDT 09]) EDBT Keynote, Lausanne

  39. Semirings for various models of provenance (5) Q A C A B C R Notation: { } set [ ] bag Polynomials with boolean coefficients [Green, ICDT 09] ( B[X]-provenance ) Sets of bags of contributing tuples Semiring: (B[X], +, ⋅, 0, 1) EDBT Keynote, Lausanne

  40. Semirings for various models of provenance (6) Q A C A B C R Provenance polynomials [GKT, PODS 07] ( N[X]-provenance ) Bags of bags of contributing tuples Semiring: (N[X], +, ⋅, 0, 1) EDBT Keynote, Lausanne

  41. A provenance hierarchy most informative N[X] Trio(X) B[X] Why(X) Lin(X) PosBool(X) least informative EDBT Keynote, Lausanne

  42. One semiring to rule them all… (apologies!) Example: 2x2y + xy + 5y2 + z N[X] drop coefficients x2y + xy + y2 + z drop exponents 3xy + 5y + z Trio(X) B[X] drop both exp. and coeff. xy + y + z Why(X) apply absorption (ab + b =b) y + z collapse terms xyz Lin(X) PosBool(X) A path downward from K1 to K2 indicates that there exists an onto (surjective) semiring homomorphism h : K1K2 EDBT Keynote, Lausanne

  43. Using homomorphisms to relate models Example: 2x2y + xy + 5y2 + z N[X] drop coefficients x2y + xy + y2 + z drop exponents 3xy + 5y + z Trio(X) B[X] drop both exp. and coeff. xy + y + z Why(X) apply absorption (ab + b =b) y + z collapse terms xyz Lin(X) PosBool(X) Homomorphism? h(x+y) =h(x)+h(y)h(xy)=h(x)h(y) h(0)=0 h(1)=1 Moreover, for these homomorphismsh(x)=x EDBT Keynote, Lausanne

  44. Containment and Equivalence[Green ICDT 09] N N[X] B[X] N[X] Trio(X) Why(X) Trio(X) B[X] B[X] N[X] B[X] N[X] N Why(X) Why(X) Trio(X) Why(X) Trio(X) Lin(X) Lin(X) Lin(X) Lin(X) N N PosBool(X) B PosBool(X) B PosBool(X) B PosBool(X) B SPJ containment SPJ equivalence SPJU containment SPJU equivalence Arrow from K1 to K2 indicates K1 containment (equivalence) implies K2 cont. (equiv.) All implications not marked↔are strict

  45. Data Security • Based on my colloquium talk from 2005 EDBT Keynote, Lausanne

  46. Data Security Dorothy Denning, 1982: • Data Security is the science and study of methods of protecting data (...) from unauthorized disclosure and modification • Data Security = Confidentiality + Integrity

  47. Data Security • Distinct from systems and network security • Assumes these are already secure • Tools: • Cryptography, information theory, statistics, … • Applications: • An enabling technology

  48. Outline • An attack • Data security research today

  49. Latanya Sweeney’s Finding • In Massachusetts, the Group Insurance Commission (GIC) is responsible for purchasing health insurance for state employees • GIC has to publish the data: GIC(zip, dob, sex, diagnosis, procedure, ...) This is private ! Right ?

  50. Latanya Sweeney’s Finding • Sweeney paid $20 and bought the voter registration list for Cambridge Massachusetts: VOTER(name, party, ..., zip, dob, sex) GIC(zip, dob, sex, diagnosis, procedure, ...) This is private ! Right ?

More Related