150 likes | 235 Views
Brief Introduction to Provenance. "As data becomes plentiful, verifiable truth becomes scarce” http://go-to-hellman.blogspot.com/2010/02/named-graphs-argleton-and-truth-economy.html. For JISC KeepIt course on Digital Preservation Tools for Repository Managers
E N D
Brief Introduction to Provenance "As data becomes plentiful, verifiable truth becomes scarce” http://go-to-hellman.blogspot.com/2010/02/named-graphs-argleton-and-truth-economy.html For JISC KeepItcourse on Digital Preservation Tools for Repository Managers Module 3, Primer on preservation workflow, formats and characterisation Westminster-Kingsway College, London, 2 March 2010
Provenance: example The following excerpt and slides are taken with permission from Moreau, L. The Open Provenance Model:Towards inter-operability of Provenance Systems http://users.ecs.soton.ac.uk/lavm/talks/iam09.pdf Example The provenance of a bottle of wine includes: Grapes from which it is made Where those grapes grew Process in the wine’s preparation How the wine was stored Between which parties the wine was transported, e.g. producer to distributer to retailer Where it was auctioned
Provenance Definition • Oxford English Dictionary: • the fact of coming from some particular source or quarter; origin, derivation • the historyor pedigree of a work of art, manuscript, rare book, etc.; • concretely, a record of the passage of an item through its various owners. • The provenance of a piece of data is the process that led to that piece of data
Virtual Learning Environment Reprints Peer-Reviewed Journal & Conference Papers Technical Reports LocalWeb Preprints & Metadata Repositories Certified Experimental Results & Analyses The Science Lifecycle Undergraduate Students Next Generation Researchers Digital Libraries scientists Graduate Students experimentation Data, Metadata, Provenance, Scripts, Workflows, Services,Ontologies, Blogs, ... Adapted from David De Roure’s slides
Virtual Learning Environment Reprints Peer-Reviewed Journal & Conference Papers Technical Reports LocalWeb Preprints & Metadata Repositories Certified Experimental Results & Analyses Undergraduate Students Next Generation Researchers Digital Libraries scientists Graduate Students experimentation Finding the Provenance of research outputs across all the systems data transited through Data, Metadata, Provenance, Scripts, Workflows, Services,Ontologies, Blogs, ...
Open Provenance Model (OPM) • Allows us to express all the causes of an item • Allow for process-oriented and dataflow oriented views • Based on a notion of annotated causality graph Moreau, L., et al. v1.00 (Dec 2007), OPM v1.01 (Jul 2008), OPM v1.1 (Dec 2009)
OPM Requirements • To allow provenance information to be exchanged between systems, by means of a compatibility layer based on a shared provenance model. • To allow developers to build and share tools that operate on such provenance model. • To define the model in a precise, technology-agnostic manner. • To define bindings to XML/RDF separately • To support a digital representation of provenance for any “thing”, whether produced by computer systems or not
OPM Serialisation • OPM is an abstract data model to represent past execution and what causes data and processes to occur • OPM can be serialised in different formats, referred to as “technology bindings” or serializations • OPM XML schema (http://openprovenance.org/model/v1.01.a) • OPM RDF schema • OPM OWL ontology • Effort underway to ensure full equivalence of representations
Nodes • Artifact: Immutable piece of state, which may have a physical embodiment in a physical object, or a digital representation in a computer system. • Process: Action or series of actions performed on or caused by artifacts, and resulting in new artifacts. • Agent: Contextual entity acting as a catalyst of a process, enabling, facilitating, controlling, affecting its execution. A P Ag
Edges A P used(R) P1 P2 wasTriggeredBy P A wasGeneratedBy(R) A1 A2 wasDerivedFrom Ag P wasControlledBy(R) Edge labels are in the past to express that these are used to describe past executions
Illustration A1 A3 A2 A4 • Process “used” artifacts and “generated” artifact • Edge “roles” indicate the function of the artifact with respect to the process (akin to function parameters) • Edges and nodes can be typed Causation chain: • P was caused by A1 and A2 • A3 and A4 were caused by P • Does it mean that A3 and A4 were caused by A1 and A2? used(dividend) used(divisor) P type=division wasGeneratedBy(quotient) wasGeneratedBy(rest)
Time Constraints Ag start: T2 end: T5 wasControlledBy(R) A P A wasGeneratedBy(R) wasGeneratedBy(R) used(R) used(R) T1 T6 T3 T4 T1<T3 (artifact must exist before being used) T2<T3 (process must have started before using artifacts) T3<T5 (process uses artifacts before it ends) T2<T4 (process must have started before generating artifacts) T4<T5 (process generates artifacts before it ends) T4<T6 (artifact must exist before being used) T2<T5 (process must have started before ending) no constraint between t3 and t4
Dublin Core Profile (draft) with Simon Miles and Joe Futrelle • To many people, provenance is primarily about attribution, citation, bibliographic information • DC provides terms to relate resources to such information • DC profile aims to use of Dublin Core terms to OPM concepts and graph patterns
DC to OPM example: dc:publisher state=unpublished A1 used publish Ag P wasSameResourceAs wasActionOf person name=Luc wasGeneratedBy A2 state=published
What have we learned about provenance? Provenance: describes and records the results of processes on objects over time OPM represents provenance as XML OPM can be serialised in different formats RDF, Semantic Web OPM is a work in progress By working with an open standard model, that can pass information as XML and in standard serialisation formats (e.g. RDF), it should be possible to build provenance services into repository environments