790 likes | 1.01k Views
Security and privacy in provenance. Simon Miles King’s College London. Outline. Provenance Models and Systems Illustrative Application Privacy and Security Issues. Provenance. What Provenance Is. Oxford English Dictionary:
E N D
Security and privacy in provenance Simon Miles King’s College London
Outline • Provenance • Models and Systems • Illustrative Application • Privacy and Security Issues
What Provenance Is • Oxford English Dictionary: • the fact of coming from some particular source or quarter; origin, derivation • the historyor pedigree of a work of art, manuscript, rare book, etc.; • concretely, a record of the passage of an item through its various owners. • Provenance is important for: • Interpretation • Judging value
Causation • Everything that is part of the provenance of an item is a cause of that item being as it is • For example, provenance of a bottle of wine includes: • Grapes from which it is made • Where those grapes grew • Steps in the wine’s preparation • How the wine was stored • Between which parties the wine was transported, e.g. producer to distributer to retailer
Motivating Applications • We and other projects interviewed and supported users with issues regarding provenance in a range of domains, including: • Bioinformatics Particle Physics • Proteomics Organ transplant • Aircraft simulation Police database integration • Social planning Chemical analysis • Genetic diseases Grid service fault tolerance • Brain image analysis Astronomy
Provenance Questions • How did I (or someone else) come by this result? • What was common and relevant in the history of this set of successful outcomes? • Was the process claimed to be performed the one which was actually performed?
Provenance Questions • What inputs were used to derive this output? • What software produced this data? • Can I generalise from the process by which this result was produced to a re-usable plan?
Provenance Questions • Were these regulations followed in producing this result? • Are these two independent conclusions actually based on the same faulty assumption/input? • What differed between the way these two results were produced?
Shared Histories and Futures • Multiple data can be produced by one process • One process can use data from many sources as input • The provenance (and futures) of data items overlap • It is suspect to say that one data item = one provenance, provenance stored with data
Causal graphs Donor Organ Decision: Yes
Causal graphs Family Consent Decision: Yes Blood Test Results: -ve decision based on Donor Organ Decision: Yes
Causal graphs Family Consent Request: 432 Blood Test Request: 432 response to response to Family Consent Decision: Yes Blood Test Results: -ve decision based on Donor Organ Decision: Yes
Causal graphs Patient Brain Death: PID 432 triggered by Family Consent Request: 432 Blood Test Request: 432 response to response to Family Consent Decision: Yes Blood Test Results: -ve decision based on Donor Organ Decision: Yes
Causal graphs Patient Brain Death: PID 432 triggered by Family Consent Request: 432 Blood Test Request: 432 response to response to triggered by Family Consent Decision: Yes Blood Test Results: -ve decision based on Donor Organ Decision: Yes
Causal Connections • Causes and effects are occurrences • Occurrence of an event, or • Occurrence of a data item or physical object being in a particular state Donation operation Patient after donation with two kidneys
Documentation and Provenance • We can distinguish • process documentation(the documentation recorded into a store about processes) • provenance (everything that caused an item to be as it is) • Process documentation is recorded as processes are executed • The data items that a process will ultimately produce may not be known at that time • Provenance of an entity is obtained as the result of a query over process documentation Process documentation Provenance
Process Documentation • Documentation of one process comes from multiple, possibly independent, sources • May share a store or use separate ones Family Consent Request Doctor Family Patient Brain Death: PID 432 Family Consent Decision Blood Test Request Donor Organ Decision: Yes Testing Lab Blood Test Results
Provenance Scope • An item is caused to be as it is by previous events, which were themselves caused by other events • The causal graph could go back to the beginning of time • If all this information was provided as a result of a query, it would be unmanageable and mostly irrelevant to the querier • Therefore, the querier needs to scope the query to that which is relevant scope
Open Data Model • Distributed processes involve functionality from multiple independent organisations • Each needs to record documentation independently • We need a common, open data model and interfaces for recording and querying data in that model Provenance Stores Organisation 2 Organisation 1 Organisation 3
Inference Digitally Controlled Process Blood Test Request Blood Test Results
Inference Digitally Controlled Process Inferred Physical Process Blood Test Request Sent Blood Sample Received Blood Sample Blood Test Results
Anonymised User Actions • Provenance records for healthcare will include documentation regarding the actions of patients (or samples of theirs) • Going to see a particular (their) GP • Undergoing surgery at a particular hospital • Their blood sample being sent to a testing lab • Even if the patient is anonymised within the records, the pattern of their actions can be enough to uniquely identify them
Data and Metadata Rights • Provenance is often viewed as metadata to the data of which it provides a history • Provenance information is usually generated automatically at runtime, and it is not known what that information will be in advance, appropriate rights have to be applied to the provenance • How do access rights of the provenance metadata relate to those of the data?
Multi-Data Metadata • Furthermore, provenance is often metadata to multiple data items • For example, a record of the process of a transplant operation is the provenance of • The transplanted organ, • The decision to transplant, • Blood tests carried out to decide to transplant, etc. • Each may be stored separately and have very different access control policies
Necessary Distribution of Query • It is sometimes necessary to distribute parts of the provenance data about a process into multiple stores • For example, in the OTM case, by EU law the data regarding activity within each hospital had to remain within that hospital • To answer a provenance question, we need to query across distributed stores
Automatic Capture • Provenance is often viewed as metadata to the data of which it provides a history • Provenance information is usually generated automatically at runtime, and it is not known what that information will be in advance, appropriate rights have to be applied to the provenance • How do access rights of the provenance metadata relate to those of the data?
Traffic Confidentiality and Inference • Traffic confidentiality means hiding the fact that a service was used by a client, even where transmitted data is encrypted • A pharmaceutical companyquerying a small lab’s public database concerning a particular disease • Can help achieve confidentiality by using intermediaries who use multiple services • But could infer actual service used from provenance set up to allow inferences
Extra Material Index • Motivation for general provenance models • Interoperability and the Open Provenance Model • Provenance technologies in database research, digital libraries, semantic web • Provenance in Tupelo (from NCSA) • Provenance in Taverna (from Manchester) • The Provenance Challenges • Open research issues
Separately Documented Aspects • Attribution and related events • Modified by Simon Miles, compressed by X • Created at time T1, deposited at T2 • Documentation of the processing of data • Enactment of workflows • Chain of ownership • Versioning • Differing practice, technologies, emphasis: workflows, DB research, libraries, semweb
Preparation for Questions • Don’t know in advance of something being produced that it will be produced • When documenting events, can’t yet associate that documentation with what those events ultimately produce • Don’t know in advance of being asked (about provenance) what will be asked • When documenting provenance, can’t restrict documentation to that you know will be used
Shared Histories and Futures • Multiple data can be produced by one process • One process can use data from many sources as input • The provenance (and futures) of data items overlap • It is suspect to say that one data item = one provenance, provenance stored with data
Alternative Accounts • In some disciplines or for some kinds of data, provenance can be disputed • Even within a computer system, there can be multiple accounts of apparently the same event A sent X to B A sent Y to B corruption A B
Common General Models • Provide skeleton for documenting all aspects of provenance • Record lots without (much) regard to particular questions... • Then query as relevant to required usage • System interoperation through common serialisation • Can connect records from different systems involved in producing 1 data item
Provenance Scope • An item is caused to be as it is by previous events, which were themselves caused by other events • The causal graph could go back to the beginning of time • If all this information was provided as a result of a query, it would be unmanageable and mostly irrelevant to the querier • Therefore, the querier needs to scope the query to that which is relevant scope
Open Data Model • Distributed processes involve functionality from multiple independent organisations • Each needs to record documentation independently • We need a common, open data model and interfaces for recording and querying data in that model Provenance Stores Organisation 2 Organisation 1 Organisation 3
Open Provenance Model http://openprovenance.org • Can describe any process (not just WF execution) • Allows alternate accounts by different observers
OPM Requirements • To allow provenance information to be exchanged between systems, by means of a compatibility layer based on a shared provenance model. • To allow developers to build and share tools that operate on such provenance model. • To define the model in a precise, technology-agnostic manner. • To support a digital representation of provenance for any “thing”, whether produced by computer systems or not.
OPM Non-Requirements • OPM does not specify the internal representations that systems have to adopt to store and manipulate provenance internally. • OPM does not define a computer-parsable syntax for this model (but prototype RDF, XML schemas have been developed) • OPM does not specify protocols to store such provenance information in provenance repositories. • OPM does not specify protocols to query provenance repositories.
Contributors • Original contributors from: • Universities: Southampton, Indiana, King’s College, Manchester, Davis, Hasselt, Utah, Southern California • Microsoft, NCSA, PNNL • Plus 3rd challenge participants including: • Universities: Harvard, Chicago, Santa Barbara, Amsterdam • SDSC
Open Provenance Model • 3 node types – artifact, process, agent • 5 arc types – used, generated, triggered, derived, controlled – and inference rules • Generic – extensibility via annotation • Choice of granularity and focus (e.g., artifact or process-centric)
Entities • Artifact: Immutable piece of state, which may have a physical embodiment in an physical object, or a digital representation in a computer system. • Process: Action or series of actions performed on or caused by artifacts, and resulting in new artifacts. • Agent: Contextual entity acting as a catalyst of a process, enabling, facilitating, controlling, affecting its execution.
Edges used A P was generated by P A was triggered by P P was derived from A A Role identifiers on edges specify in what way an artifact relates to a process
Pegasus Example agent Pegasus / Condor DAGMan was controlled by (enactor) artifact FITS DataSet used (inputSet) was generated by (output) Produce Sky Mosaic Mosaic Degree artifact process used (size) artifact
Mapping Attribution to OPM agent Simon Miles wasActionOf artifact used was generated by creation A artifact process used A dc:creator “Simon Miles” artifact