170 likes | 267 Views
Computing Provenance and Annotations of Derived Data. Wang-Chiew Tan UC Santa Cruz. Provenance of data. When you see some data on the Web, do you know where it came from? why it is there?
E N D
Computing Provenance and Annotations of Derived Data Wang-Chiew Tan UC Santa Cruz
Provenance of data • When you see some data on the Web, do you know • where it came from? • why it is there? • This information (provenance) is typically lost in the process of copying/transcribing/transforming databases • Loss of provenance is an acute problem in some scientific databases
flow of data Complex interdependencies (Example from scientific databases) • Various problems: • Trace provenance of data • Propagate annotations GERD TRRD EpoDB BEAD Swissprot GAIA EMBL GenBank Transfac DDBJ
(Why-provenance) Why? (Where-provenance) Where? Two kinds of provenance NYRestaurants (Source table) NYHotels (Source table) Cost Type Restaurant Zip Rating Hotel Zip Peacock Alley $$$ French 10022 4.5 10022 Waldorf Astoria Bull & Bear $$$ Seafood 10022 Holiday Inn DT 10013 4.0 Pacifica $ Chinese 10013 $ Soho Kitchen & Bar American 10022 JOIN, PROJECT View Restaurant Hotel Rating Cost $$$ Waldorf Astoria Peacock Alley 4.5 Bull & Bear 4.5 $$$ Waldorf Astoria Waldorf Astoria $ Soho Kitchen & Bar 4.5 Pacifica $ Holiday Inn DT 4.0
SDSS - Sloan Digital Sky Server Select Specobj.z, photoobj.g, photoobj.r From Specobj, photoobj Where Specobj.objid = photoobj.objid and Specobj.specclass = 3 and Specobj.zconf > .95
Compute provenance • Question: Suppose a database is created by a query. Can we compute the why and where provenance of an element? • Answer: Computing provenance (both why and where) is NP-hard in general.
Annotations • Adds value to data • knowledge sharing : annotations can be read & reviewed by independent parties • Annotations are loosely structured • Annotations on data at various levels of granularity, annotations on annotations • Source Data: • proprietary • fixed schema • A system that overlays annotations on existing data • Useful tool for scientific databases • Annotations should spread back to the source and forward to other databases
Serves fine French Cuisine in elegant setting. Jackets required. Cost Type Restaurant Pacifica $ Chinese $ Soho Kitchen & Bar American Cost Type Restaurant Extensive wine list! Peacock Alley $$$ French Bull & Bear $$$ Seafood Pacifica $ Chinese $ Soho Kitchen & Bar American Propagating annotations NYRestaurants (Source Table) Cost Type Restaurant Zip Peacock Alley $$$ French 10022 Bull & Bear $$$ Seafood 10022 Pacifica $ Chinese 10013 $ Soho Kitchen & Bar American 10022 Yummy chicken curry!! Cheap Restaurants (View 2) All Restaurants (View 1)
Location and Propagation Rules relation name tuple in R A is an attribute in schema of R • A location is a triple: (R, t, A) • Propagation Rules: • Select: • Project: • Join: • Union: A1 A2 A3 A1 A2 A3 R A1 A2 A3 A3 R A1 A2 A2 A3 A1 A2 A3 R2 R1 A1 A2 A3 A1 A2 A3 R1 A1 A2 A3 R2
Query Computing annotation propagation Model: • Question: Suppose a database is created by a query over some source data, can we compute how to propagate an annotation on a data element back to the source with minimum side-effects? • Answer: Computing the minimum side-effect annotation is NP-hard in general Source: Relational Database View : result of query applied on source
Related Work on Annotations (not exhaustive!) • Superimposed Information (D. Maier, L. Delcambre [WebDB’99]) • data “placed over” existing information e.g. bookmark files, schema of a database • Annotation Systems • Annotea (W3C) • annotate web pages • Multivalent Browser (R. Wilensky, T. A. Phelps. UC Berkeley DL Project) • annotate on PDF files, HTML, etc. • BioDAS (Distributed Annotation Server) (L.Stein et al. ) • annotate on genome sequences • No one has formally studied annotation placement problem
Provenance and Annotations • Where-provenance & annotation placement • where should the annotation be placed in the source in order to propagate the annotation to view data d ? • Annotate the source data in one of the source locations in the where-provenance of d • Provenance & Archiving • trace a piece of data to its correct source version • Why-provenance & view deletion • which source data should be deleted in order to delete view data d ? A combination of source data that altogether “disable” every witness for d
How do we attach annotations to data? • Relational tables: Identify a particular column of a particular table of a particular relation: (R, t, A) • Tree-like data: Need a canonical path to the data element A R t
Lots more to do! • Further study on provenance for queries that involve negation, aggregates select sum(sal) from Employee where sal > 50K • Handle “irregular” annotations and on tree-like data. • How about databases which are manually constructed and annotated? • Organize data with keys • Use of constraints and special cases to derive efficient algorithms for propagating annotations back • Language specific issues
=a [Name:”Joe”, Sal:50K , Dept:”Marketing” , Manager:”Jane”] • Equivalent queries in the same language, but different annotation behavior Q1= SELECT e.Name, e.Sal FROM Empe WHERE e.Sal = “50K” Q2= SELECT e.Name, “50K” AS Sal FROM Emp e WHERE e.Sal = “50K” [Name:”Joe”, Sal:50k ] [Name:”Joe”, Sal:50K , Dept:”Marketing” , Manager:”Jane”] Inconsistencies in “annotation-aware” language(s) Emp Department Name Sal Dept Joe 50K Marketing Dept Manager Marketing Jane • The same query in different languages, but different annotation behavior Relational Algebra: Emp JOIN Department SQL: SELECT e.Name, e.Sal, e.Dept, d.Manager FROM Empe, Department d WHERE e.Dept = d.Dept [Name:”Joe”, Sal:50k]
Do we need an “annotation-aware” QL? • Relational algebra suggests a natural set of propagation rules • SQL suggests another natural propagation rule • based on variable bindings • Question: Can we extend/design the the query language(s) so that • Equivalent queries have the same annotation behavior • Translation of a query from one language (e.g. SQL) into another (e.g. relational algebra) yields the same annotation behavior • Perhaps a more fundamental question... • Should a query language be “annotation-aware” ? • Perhaps we should have language constructs to allow the user to explicitly control annotation propagation?