1 / 77

Security and privacy in provenance

Security and privacy in provenance. Simon Miles King’s College London. Outline. Provenance Models and Systems Illustrative Application Privacy and Security Issues. Provenance. What Provenance Is. Oxford English Dictionary:

hue
Download Presentation

Security and privacy in provenance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Security and privacy in provenance Simon Miles King’s College London

  2. Outline • Provenance • Models and Systems • Illustrative Application • Privacy and Security Issues

  3. Provenance

  4. What Provenance Is • Oxford English Dictionary: • the fact of coming from some particular source or quarter; origin, derivation • the historyor pedigree of a work of art, manuscript, rare book, etc.; • concretely, a record of the passage of an item through its various owners. • Provenance is important for: • Interpretation • Judging value

  5. Causation • Everything that is part of the provenance of an item is a cause of that item being as it is • For example, provenance of a bottle of wine includes: • Grapes from which it is made • Where those grapes grew • Steps in the wine’s preparation • How the wine was stored • Between which parties the wine was transported, e.g. producer to distributer to retailer

  6. Motivating Applications • We and other projects interviewed and supported users with issues regarding provenance in a range of domains, including: • Bioinformatics Particle Physics • Proteomics Organ transplant • Aircraft simulation Police database integration • Social planning Chemical analysis • Genetic diseases Grid service fault tolerance • Brain image analysis Astronomy

  7. Provenance Questions • How did I (or someone else) come by this result? • What was common and relevant in the history of this set of successful outcomes? • Was the process claimed to be performed the one which was actually performed?

  8. Provenance Questions • What inputs were used to derive this output? • What software produced this data? • Can I generalise from the process by which this result was produced to a re-usable plan?

  9. Provenance Questions • Were these regulations followed in producing this result? • Are these two independent conclusions actually based on the same faulty assumption/input? • What differed between the way these two results were produced?

  10. Shared Histories and Futures • Multiple data can be produced by one process • One process can use data from many sources as input • The provenance (and futures) of data items overlap • It is suspect to say that one data item = one provenance, provenance stored with data

  11. Causal Provenance ModelsIllustrative Application

  12. Causal graphs Donor Organ Decision: Yes

  13. Causal graphs Family Consent Decision: Yes Blood Test Results: -ve decision based on Donor Organ Decision: Yes

  14. Causal graphs Family Consent Request: 432 Blood Test Request: 432 response to response to Family Consent Decision: Yes Blood Test Results: -ve decision based on Donor Organ Decision: Yes

  15. Causal graphs Patient Brain Death: PID 432 triggered by Family Consent Request: 432 Blood Test Request: 432 response to response to Family Consent Decision: Yes Blood Test Results: -ve decision based on Donor Organ Decision: Yes

  16. Causal graphs Patient Brain Death: PID 432 triggered by Family Consent Request: 432 Blood Test Request: 432 response to response to triggered by Family Consent Decision: Yes Blood Test Results: -ve decision based on Donor Organ Decision: Yes

  17. Causal Connections • Causes and effects are occurrences • Occurrence of an event, or • Occurrence of a data item or physical object being in a particular state Donation operation Patient after donation with two kidneys

  18. Documentation and Provenance • We can distinguish • process documentation(the documentation recorded into a store about processes) • provenance (everything that caused an item to be as it is) • Process documentation is recorded as processes are executed • The data items that a process will ultimately produce may not be known at that time • Provenance of an entity is obtained as the result of a query over process documentation Process documentation Provenance

  19. Process Documentation • Documentation of one process comes from multiple, possibly independent, sources • May share a store or use separate ones Family Consent Request Doctor Family Patient Brain Death: PID 432 Family Consent Decision Blood Test Request Donor Organ Decision: Yes Testing Lab Blood Test Results

  20. Provenance Scope • An item is caused to be as it is by previous events, which were themselves caused by other events • The causal graph could go back to the beginning of time • If all this information was provided as a result of a query, it would be unmanageable and mostly irrelevant to the querier • Therefore, the querier needs to scope the query to that which is relevant scope

  21. Open Data Model • Distributed processes involve functionality from multiple independent organisations • Each needs to record documentation independently • We need a common, open data model and interfaces for recording and querying data in that model Provenance Stores Organisation 2 Organisation 1 Organisation 3

  22. Inference Digitally Controlled Process Blood Test Request Blood Test Results

  23. Inference Digitally Controlled Process Inferred Physical Process Blood Test Request Sent Blood Sample Received Blood Sample Blood Test Results

  24. Privacy and Security Issues

  25. Anonymised User Actions • Provenance records for healthcare will include documentation regarding the actions of patients (or samples of theirs) • Going to see a particular (their) GP • Undergoing surgery at a particular hospital • Their blood sample being sent to a testing lab • Even if the patient is anonymised within the records, the pattern of their actions can be enough to uniquely identify them

  26. Data and Metadata Rights • Provenance is often viewed as metadata to the data of which it provides a history • Provenance information is usually generated automatically at runtime, and it is not known what that information will be in advance, appropriate rights have to be applied to the provenance • How do access rights of the provenance metadata relate to those of the data?

  27. Multi-Data Metadata • Furthermore, provenance is often metadata to multiple data items • For example, a record of the process of a transplant operation is the provenance of • The transplanted organ, • The decision to transplant, • Blood tests carried out to decide to transplant, etc. • Each may be stored separately and have very different access control policies

  28. Necessary Distribution of Query • It is sometimes necessary to distribute parts of the provenance data about a process into multiple stores • For example, in the OTM case, by EU law the data regarding activity within each hospital had to remain within that hospital • To answer a provenance question, we need to query across distributed stores

  29. Automatic Capture • Provenance is often viewed as metadata to the data of which it provides a history • Provenance information is usually generated automatically at runtime, and it is not known what that information will be in advance, appropriate rights have to be applied to the provenance • How do access rights of the provenance metadata relate to those of the data?

  30. Traffic Confidentiality and Inference • Traffic confidentiality means hiding the fact that a service was used by a client, even where transmitted data is encrypted • A pharmaceutical companyquerying a small lab’s public database concerning a particular disease • Can help achieve confidentiality by using intermediaries who use multiple services • But could infer actual service used from provenance set up to allow inferences

  31. Extra Material

  32. Extra Material Index • Motivation for general provenance models • Interoperability and the Open Provenance Model • Provenance technologies in database research, digital libraries, semantic web • Provenance in Tupelo (from NCSA) • Provenance in Taverna (from Manchester) • The Provenance Challenges • Open research issues

  33. Motivation forCommon, GeneralProvenance Models

  34. Separately Documented Aspects • Attribution and related events • Modified by Simon Miles, compressed by X • Created at time T1, deposited at T2 • Documentation of the processing of data • Enactment of workflows • Chain of ownership • Versioning • Differing practice, technologies, emphasis: workflows, DB research, libraries, semweb

  35. Preparation for Questions • Don’t know in advance of something being produced that it will be produced • When documenting events, can’t yet associate that documentation with what those events ultimately produce • Don’t know in advance of being asked (about provenance) what will be asked • When documenting provenance, can’t restrict documentation to that you know will be used

  36. Shared Histories and Futures • Multiple data can be produced by one process • One process can use data from many sources as input • The provenance (and futures) of data items overlap • It is suspect to say that one data item = one provenance, provenance stored with data

  37. Alternative Accounts • In some disciplines or for some kinds of data, provenance can be disputed • Even within a computer system, there can be multiple accounts of apparently the same event A sent X to B A sent Y to B corruption A B

  38. Common General Models • Provide skeleton for documenting all aspects of provenance • Record lots without (much) regard to particular questions... • Then query as relevant to required usage • System interoperation through common serialisation • Can connect records from different systems involved in producing 1 data item

  39. Provenance Scope • An item is caused to be as it is by previous events, which were themselves caused by other events • The causal graph could go back to the beginning of time • If all this information was provided as a result of a query, it would be unmanageable and mostly irrelevant to the querier • Therefore, the querier needs to scope the query to that which is relevant scope

  40. Interoperability

  41. Open Data Model • Distributed processes involve functionality from multiple independent organisations • Each needs to record documentation independently • We need a common, open data model and interfaces for recording and querying data in that model Provenance Stores Organisation 2 Organisation 1 Organisation 3

  42. Open Provenance Model http://openprovenance.org • Can describe any process (not just WF execution‏) • Allows alternate accounts by different observers

  43. OPM Requirements • To allow provenance information to be exchanged between systems, by means of a compatibility layer based on a shared provenance model. • To allow developers to build and share tools that operate on such provenance model. • To define the model in a precise, technology-agnostic manner. • To support a digital representation of provenance for any “thing”, whether produced by computer systems or not.

  44. OPM Non-Requirements • OPM does not specify the internal representations that systems have to adopt to store and manipulate provenance internally. • OPM does not define a computer-parsable syntax for this model (but prototype RDF, XML schemas have been developed) • OPM does not specify protocols to store such provenance information in provenance repositories. • OPM does not specify protocols to query provenance repositories.

  45. Contributors • Original contributors from: • Universities: Southampton, Indiana, King’s College, Manchester, Davis, Hasselt, Utah, Southern California • Microsoft, NCSA, PNNL • Plus 3rd challenge participants including: • Universities: Harvard, Chicago, Santa Barbara, Amsterdam • SDSC

  46. Open Provenance Model • 3 node types – artifact, process, agent • 5 arc types – used, generated, triggered, derived, controlled – and inference rules • Generic – extensibility via annotation • Choice of granularity and focus (e.g., artifact or process-centric)‏

  47. Entities • Artifact: Immutable piece of state, which may have a physical embodiment in an physical object, or a digital representation in a computer system. • Process: Action or series of actions performed on or caused by artifacts, and resulting in new artifacts. • Agent: Contextual entity acting as a catalyst of a process, enabling, facilitating, controlling, affecting its execution.

  48. Edges used A P was generated by P A was triggered by P P was derived from A A Role identifiers on edges specify in what way an artifact relates to a process

  49. Pegasus Example agent Pegasus / Condor DAGMan was controlled by (enactor) artifact FITS DataSet used (inputSet) was generated by (output) Produce Sky Mosaic Mosaic Degree artifact process used (size) artifact

  50. Mapping Attribution to OPM agent Simon Miles wasActionOf artifact used was generated by creation A artifact process used A dc:creator “Simon Miles” artifact

More Related