430 likes | 738 Views
Talk roadmap . . . Knowledge for Provenance. The Provenance of Knowledge. Knowledge technologies. Where do knowledge assertions come from?. What is this provenance about and for?. How do we represent knowledge for and about provenance?. my Context. Knowledge-driven Middleware for data intensive
E N D
1. Knowledge and Provenance:A knowledge model perspective Carole Goble,
University of Manchester, UK It’s a framework for application integration
Integrative encycloepedia of life – it’s the same as myGrid.
Basic strategy is annotation, integration and presentation.
It’s a middleware project – you don’t have to buy into everything. Its toolkits.
Provenance and knowledge - musing on what users might want.
Given the emphasis on "characteristics of provenance-in-the-large":
>From a user's perspective one of the main uses of provenance data could be to bring the user up to date with the current working context, or summarise the areas of activity in a given period.
- what has happened (within this project) since I last logged on to my experimental environment (virtual organisation)?
- what have been the new experiments between date X and date Y?
For situations such "what has happened of interest to me", a provenance record of the relevant user proxy could provide the basis for answers.
For general queries, where a user may be exploring an environment from different perspectives, there is no reason for the user to distinguish between provenance metadata and other metadata.
A key aspect is that this gives a requirement for queries over large volumes of metadata (and data) and possible outputs include:
workflow_486 has been run 37 times in the last month
workflow_486 has been run 7 times by user_34
workflow_486 has been run 7 times by user_34 with lsid:3213 as input_geneId
workflow_486 has been run 5 times by user_21
...
Affymetrix_probe_ids have been used 1041 times this month
lsid:3213 has been used 45 times
lsid:3213 has been input to workflow_486 7 times
...
This contrast with provenance that concetrates on the lineage or origin of a specific data item. If the lineage is to allow others to verify in-silico experiments then the user and time are usually considered irrelevant. In addition, the value of the "course provenance queries" above is that the user does not know exactly the data of interest (compared with asking the detailed "Where did this data come from, and is it up to date?")
Mark
PS This line of thinking raises one area of interest as the relationship between provenance metadata and other metadata.
It’s a framework for application integration
Integrative encycloepedia of life – it’s the same as myGrid.
Basic strategy is annotation, integration and presentation.
It’s a middleware project – you don’t have to buy into everything. Its toolkits.
Provenance and knowledge - musing on what users might want.
Given the emphasis on "characteristics of provenance-in-the-large":
>From a user's perspective one of the main uses of provenance data could be to bring the user up to date with the current working context, or summarise the areas of activity in a given period.
- what has happened (within this project) since I last logged on to my experimental environment (virtual organisation)?
- what have been the new experiments between date X and date Y?
For situations such "what has happened of interest to me", a provenance record of the relevant user proxy could provide the basis for answers.
For general queries, where a user may be exploring an environment from different perspectives, there is no reason for the user to distinguish between provenance metadata and other metadata.
A key aspect is that this gives a requirement for queries over large volumes of metadata (and data) and possible outputs include:
workflow_486 has been run 37 times in the last month
workflow_486 has been run 7 times by user_34
workflow_486 has been run 7 times by user_34 with lsid:3213 as input_geneId
workflow_486 has been run 5 times by user_21
...
Affymetrix_probe_ids have been used 1041 times this month
lsid:3213 has been used 45 times
lsid:3213 has been input to workflow_486 7 times
...
This contrast with provenance that concetrates on the lineage or origin of a specific data item. If the lineage is to allow others to verify in-silico experiments then the user and time are usually considered irrelevant. In addition, the value of the "course provenance queries" above is that the user does not know exactly the data of interest (compared with asking the detailed "Where did this data come from, and is it up to date?")
Mark
PS This line of thinking raises one area of interest as the relationship between provenance metadata and other metadata.
2. Talk roadmap
3. my Context Knowledge-driven Middleware for data intensive in silico experiments in biology
http://www.mygrid.org.uk
4. A real bio provenance log
5. Any and every experimental item attracts provenance (so long as you can ID it). Experimental design components
workflow specifications; query specifications; notes describing objectives; applications; databases; relevant papers; the web pages of important workers, services
Experimental instances that are records of enacted experiments
data results; a history of services invoked by a workflow engine; instances of services invoked; parameters set for an application; notes commenting on the results
Experimental glue that groups and links design and instance components
a query and its results; a workflow linked with its outcome; links between a workflow and its previous and subsequent versions; a group of all these things linked to a document discussing the conclusions of the biologist Integrating components
Experimental design components: workflow specifications; query specifications; notes describing objectives; applications; databases; relevant papers; the web pages of important workers, and so on.
Experimental instances that are records of enacted experiments: data results; a history of services invoked by a workflow engine; instances of services invoked; parameters set for an application; notes commenting on the results and so on.
Experimental glue that groups and links design and instance components: a query and its results; a workflow linked with its outcome; links between a workflow and its previous and subsequent versions; a group of all these things linked to a document discussing the conclusions of the biologist and so on. Integrating components
Experimental design components: workflow specifications; query specifications; notes describing objectives; applications; databases; relevant papers; the web pages of important workers, and so on.
Experimental instances that are records of enacted experiments: data results; a history of services invoked by a workflow engine; instances of services invoked; parameters set for an application; notes commenting on the results and so on.
Experimental glue that groups and links design and instance components: a query and its results; a workflow linked with its outcome; links between a workflow and its previous and subsequent versions; a group of all these things linked to a document discussing the conclusions of the biologist and so on.
6. Provenance is metadata … intended for sharing, retrieving, integrating, aggregating and processing.
generated with the hope that it is comprehensive enough to be future-proofed.
recorded for those who we do not yet know will use the object and who will likely use it in a different way.
machine computational: free text of limited help.
Provenance is the knowledge that makes
An item interpretable and reusable within a context
An item reproducible or at least repeatable.
Its part of the information model of any system
8. Provenance is contextual metadata We look at the same things in different ways and different things in the same way
Our data alone does not describe our work
We have to capture this context. hero knowledge management
HEROINE’s mission
Describing and sharing how different
people, in different places and at different
times, conceive of elements of human environment
interaction
Enabling the discovery of new concepts
and relationships that aid these
descriptions
Developing tools and approaches for
hero knowledge management
HEROINE’s mission
Describing and sharing how different
people, in different places and at different
times, conceive of elements of human environment
interaction
Enabling the discovery of new concepts
and relationships that aid these
descriptions
Developing tools and approaches for
9. Provenance forms Derivations
A path like a workflow, script or query.
Linking items, usually in a directed graph.
An explanation of when, who, how something produced.
Execution Process-centric
Annotations
Attached to items or collections of items, in a structured, semi-structured or free text form.
Annotations on one item or linking items.
An explanation of why, when, where, who, what, how.
Data-centric A workflow or a derivation graph
A history/audit
A recipe/plan
Play forward
How do I get from here.
Play backwards
How do I get to here. Annotations and notes
about experiments by scientists (recording the process of biological experiments for e-Science, the purpose and results of experiments).
Derivation paths
A workflow or database query between data inputs and data outputs
A program & its parameters to a data result
A path between an original and a refined workflow template etc
Every item in the myGrid Information Repository has (Dublin Core based) provenance attributes
A workflow or a derivation graph
A history/audit
A recipe/plan
Play forward
How do I get from here.
Play backwards
How do I get to here. Annotations and notes
about experiments by scientists (recording the process of biological experiments for e-Science, the purpose and results of experiments).
Derivation paths
A workflow or database query between data inputs and data outputs
A program & its parameters to a data result
A path between an original and a refined workflow template etc
Every item in the myGrid Information Repository has (Dublin Core based) provenance attributes
10. Workflows as in silico experiments Freefluo workflow enactment engine
WSFL
Scufl
Semantic Workflow discovery
Finding workflows that others have done, and that I have done myself
Semantic service discovery
Finding classes of services
Guiding service composition
(We don’t do automated composition)
Dynamic workflow enactment service discovery and invocation
Choose services instances when running workflow
User involvement
Soaplab SOAP-based Analysis Web Service Soaplab is a set of Web Services providing a programatic access to some applications on remote computers. Because such applications, especially in the scientific environment where Soaplab was born, usually analyze data, Soaplab is often referred to as an Analysis (Web) Service.
Soaplab was developed in the European Bioinformatics Institute (EBI), within the eScience initiative, as a component of the myGrid project.
Soaplab is both a specification for an Analysis Service (based on other approved specifications, see the Architecture Guide) and its implementation. It is freely available for downloading - but bear in mind that the installation of this Web Service does not give you any analyses - they are not part of the Soaplab. The EBI has Soaplab service running on top of several tens of analyses (most of them coming from EMBOSS, an independent package of high-quality FREE Open Source software for sequence analysis) - but it is an experimental service which may not have 24/7 availability.
Soaplab SOAP-based Analysis Web Service Soaplab is a set of Web Services providing a programatic access to some applications on remote computers. Because such applications, especially in the scientific environment where Soaplab was born, usually analyze data, Soaplab is often referred to as an Analysis (Web) Service.
Soaplab was developed in the European Bioinformatics Institute (EBI), within the eScience initiative, as a component of the myGrid project.
Soaplab is both a specification for an Analysis Service (based on other approved specifications, see the Architecture Guide) and its implementation. It is freely available for downloading - but bear in mind that the installation of this Web Service does not give you any analyses - they are not part of the Soaplab. The EBI has Soaplab service running on top of several tens of analyses (most of them coming from EMBOSS, an independent package of high-quality FREE Open Source software for sequence analysis) - but it is an experimental service which may not have 24/7 availability.
11. Semantic discovery – services & workflows
12. Provenance forms in myGrid Derivations
FreeFluo Workflow Enactment Engine provides a detailed provenance record stored in the myGrid Information Repository (mIR) describing what was done, with what services and when
XML document, soon to be an RDF model
Annotations
Every mIR object has Dublin Core provenance properties described in an attribute value model Annotations and notes
about experiments by scientists (recording the process of biological experiments for e-Science, the purpose and results of experiments).
Derivation paths
A workflow or database query between data inputs and data outputs
A program & its parameters to a data result
A path between an original and a refined workflow template etc
Every item in the myGrid Information Repository has (Dublin Core based) provenance attributes
Annotations and notes
about experiments by scientists (recording the process of biological experiments for e-Science, the purpose and results of experiments).
Derivation paths
A workflow or database query between data inputs and data outputs
A program & its parameters to a data result
A path between an original and a refined workflow template etc
Every item in the myGrid Information Repository has (Dublin Core based) provenance attributes
13. Provenance of data Operational execution trail > The provenance of connections/relationships
>
> An Example:
> In the Graves' disease example, the annotation pipeline workflow takes in a Affymetrix probe id (indicating a gene) and its outputs include: an embl id, a swissprot id, a set of OMIN ids, a set of GO terms.
>
> For simplicity let us just take the swissprot id. In the user's environment there is now a new data item the swissprot id. In one respect we can say that the origin of that id, was the instance of the annotation workflow, and it depended on the probe id (gene id) given as input. (The detailed provenance record of the workflow instance will identify the "transformations" that led from the input to the output.)
>
> However, the annotation pipeline did not create the swissprot id (just the copy of it within the myGrid environment). The important thing is that the annotation workflow creates a relationship between its input (probe id/gene id) and the swissprot id. If a user has another workflow (using different services) that can create this relationship, she or he will want to know if it identifies the same swissprot id. Having two pieces of evidence to support some knowledge is better than one.
>
> (The argument is essentially the same if the workflow result is the swissprot record rather than just the swissprot id.)
>
> The provenance of connections/relationships
>
> An Example:
> In the Graves' disease example, the annotation pipeline workflow takes in a Affymetrix probe id (indicating a gene) and its outputs include: an embl id, a swissprot id, a set of OMIN ids, a set of GO terms.
>
> For simplicity let us just take the swissprot id. In the user's environment there is now a new data item the swissprot id. In one respect we can say that the origin of that id, was the instance of the annotation workflow, and it depended on the probe id (gene id) given as input. (The detailed provenance record of the workflow instance will identify the "transformations" that led from the input to the output.)
>
> However, the annotation pipeline did not create the swissprot id (just the copy of it within the myGrid environment). The important thing is that the annotation workflow creates a relationship between its input (probe id/gene id) and the swissprot id. If a user has another workflow (using different services) that can create this relationship, she or he will want to know if it identifies the same swissprot id. Having two pieces of evidence to support some knowledge is better than one.
>
> (The argument is essentially the same if the workflow result is the swissprot record rather than just the swissprot id.)
>
14. Provenance of knowledge Particularly in Biology where you are finding LINKS between items rather than generating new items. Generating relationships.Particularly in Biology where you are finding LINKS between items rather than generating new items. Generating relationships.
15. Provenance of knowledge
16. Provenance of knowledge
17. 20,000 feet and ground level Top Down provenance
What is going on?
Unification and summaries of collective provenance knowledge.
Collaborative, Awareness, Experience base, Scientific Corporate memory.
“What projects have something to do with human SNPs?”
“What experiments use the PSI-BLAST service regardless of version?” Bottom Up provenance
Where did this data object http://doh.dah.ac.uk/… come from?
Which version of Swiss-Prot was run in workflow http:/blah.ac.uk/…?
18. Provenance for People and Machines Explicitly capture context and
Gather and integrate provenance from different places, times, services, locations etc etc And the domain “knowledge” provenance with the execution provenance.Explicitly capture context and
Gather and integrate provenance from different places, times, services, locations etc etc And the domain “knowledge” provenance with the execution provenance.
19. 1. Explicitly capture Context Reuse methods and strategies (e.g., protocols)
Make explicit the situational bias that is normally implicit
Enable future generations of scientists to follow our work
To capture meaning, we must devise a way of representing concepts and their relationships
20. 1. Explicitly capture Context Using models and terms
that can be shared and interpreted
that are extensible and preclude premature restrictions
that are navigable and computationally processable
21. 2. Bridge islands of exported provenance Common data model
Universal identification mechanism
Shared conceptualisation through an ontology
Inference and reasoningCommon data model
Universal identification mechanism
Shared conceptualisation through an ontology
Inference and reasoning
22. Not all exports are the same But we want to link together anyway.
Common data model
Universal identification mechanism
Shared conceptualisation through an ontology
Inference and reasoningBut we want to link together anyway.
Common data model
Universal identification mechanism
Shared conceptualisation through an ontology
Inference and reasoning
23. So we need to… Uniquely identify items through URIs and Life Science Identifiers (GSH/GSR/Handle.net…)
Explicitly expose provenance by assertions in a common data model…
Publish and share consensually agreed ontologies so we can share the provenance metadata and add in background knowledge…
Then we can query, filter, integrate and aggregate the provenance metadata …
and reason over it to infer more provenance metadata using rules …
and attribute trust to the provenance …
Flexibly so that do not cast in stone models and terms, and so can cope with different degrees of description.
24. W3C Metadata language/model Resource Description Framework Common model for metadata
Assertions as triples (subject, predicate, object) forming graphs.
Associate URIs (LSIDs) with other URIs (LSIDs).
Associate URIs with OWL concepts (which are URIs).
RDQL, repositories, integration tools, presentation tools
Query over, Link together, Aggregate, Integrate assertions.
Avoids pre-commitment
Self-describing
Incremental
Extensible
Advantage and drawback.
25. Bridging islands Common data model
Universal identification mechanism
Shared conceptualisation through an ontology
Inference and reasoningCommon data model
Universal identification mechanism
Shared conceptualisation through an ontology
Inference and reasoning
26. Bridging islands: Concepts and LSID Common data model
Universal identification mechanism
Shared conceptualisation through an ontology
Inference and reasoningCommon data model
Universal identification mechanism
Shared conceptualisation through an ontology
Inference and reasoning
27. Continuum of expressivity
Concepts, roles, individuals, axioms
From simple frames to description logics
Sound and complete formal semantics
Compositional and property based
Reasoning to infer classification
Eas(ier) to extend and evolve and merge ontologies
A web language
Tools, tools, tools! W3C Ontology language/model: OWL
28. Bridging islands: Concepts and LSIDs Common data model
Universal identification mechanism
Shared conceptualisation through an ontology
Inference and reasoningCommon data model
Universal identification mechanism
Shared conceptualisation through an ontology
Inference and reasoning
29. Bridging islands: Concepts and LSIDs Common data model
Universal identification mechanism
Shared conceptualisation through an ontology
Inference and reasoningCommon data model
Universal identification mechanism
Shared conceptualisation through an ontology
Inference and reasoning
30. Layers of Knowledge Languages Complexity could be a problem
Complexity could be a problem
31. myGrid everything has a concept & LSID
32. Linking objects to objects via URIs and LSIDs Just like DiscoveryNet.Just like DiscoveryNet.
34. Annotating a workflow log with concepts
35. Generating provenance A ws-info document which contains ontological descriptions associated with the inputs, outputs and services in a workflow, similar to the DAML-S profile.
This document has two roles: the myGrid registry uses it to advertise and hence discover workflows based on their semantics; the myGrid workbench environment uses it when interrogating the mIR for data inputs that semantically match a workflow, for providing configuration and default information for service parameters, and for identifying the data type of the workflow data results.
Figure 8 gives a screenshot of a ws-info document.A ws-info document which contains ontological descriptions associated with the inputs, outputs and services in a workflow, similar to the DAML-S profile.
This document has two roles: the myGrid registry uses it to advertise and hence discover workflows based on their semantics; the myGrid workbench environment uses it when interrogating the mIR for data inputs that semantically match a workflow, for providing configuration and default information for service parameters, and for identifying the data type of the workflow data results.
Figure 8 gives a screenshot of a ws-info document.
37. Figure 3. Portal interface to a user.s personal, group, and community workspaces.
Figure 3. Portal interface to a user.s personal, group, and community workspaces.
38. Two views of a gravity model conceptfrom the Hero CODEX web tool An ontological description shows how one geoscientist constructs a model a social network reveals which users favour different instances of the model, with edge length suggesting the degree of support. Figure 2. Two views of a gravity model concept (red nodes). An ontological description (left) shows how one geoscientist
constructs such a model; a social network (right) reveals which users favor different instances of the model, with edge length
suggesting the degree of support. (Concept graphing in Codex modified from open-source Touchgraph [www.touchgraph.com])
Figure 2. Two views of a gravity model concept (red nodes). An ontological description (left) shows how one geoscientist
constructs such a model; a social network (right) reveals which users favor different instances of the model, with edge length
suggesting the degree of support. (Concept graphing in Codex modified from open-source Touchgraph [www.touchgraph.com])
39. Collaboratory for Multi-Scale ChemicalScience Central panes of the CMCS Pedigree Browser showing the metadata and relationships of the selected data set.
4. CMCS “Pedigree Graph” portlet showing provenance
relationships between resources (color coded by original
relationship type).
Central panes of the CMCS Pedigree Browser showing the metadata and relationships of the selected data set.
4. CMCS “Pedigree Graph” portlet showing provenance
relationships between resources (color coded by original
relationship type).
40. Provenance dimensions connected by concepts and identifiers Infrastructure for the coordinated sharing of data and knowledge.
Developers create a distributed knowledge or data base for their particular domain-oriented applications.
The representation language, the communication protocols, and the access control and authentication are handled by the Semantic Web.
Infrastructure for the coordinated sharing of data and knowledge.
Developers create a distributed knowledge or data base for their particular domain-oriented applications.
The representation language, the communication protocols, and the access control and authentication are handled by the Semantic Web.
41. Reflections: annotations Annotation metadata model for myGrid holdings are a Graph
If it waddles like RDF and quacks like RDF, its RDF
Experiments in RDF scalability
Co-existence of RDF and other data models (relational)
Acquisition of annotations and adverts
Automated by mining WSDL docs, mining ws-info docs
Deep annotation works ok for bioinformatic service concepts (it’s an EMBL record) but…
Annotating with biologically meaningful concepts is harder
Data in the mIR (it’s a lymphocyte)
Manual annotation cost is high!
Service/workflow publication tools
Dealing with change
Ontology changes; service changes; annotations change. Finding a service / workflow that will fulfil some task e.g. aligning of biological sequences.
Finding a service / workflow that will accept or produce some kind of data.
Type management when forming workflows
Finding a service / workflow that will fulfil some task e.g. aligning of biological sequences.
Finding a service / workflow that will accept or produce some kind of data.
Type management when forming workflows
42. Random Thoughts Where does the knowledge come from (see Luc)?
How do we model trust (see Luc)?
Scalability of Semantic Web technologies?
Visualisation of knowledge (see monica)?
What’s the lifecycle of provenance?
Different knowledge models for different disciplines?
Layers of provenance
Provenance that is domain knowledge
Provenance for context vs execution
People vs machine
Different models for different items but still needs to be integrated
Technologies for sharing and integrating that are flexible.
43. Talk provenance myGrid http://www.mygrid.org.uk
Jun Zhao, Mark Greenwood, Chris Wroe, Phil Lord, Chris Greenhalgh, Luc Moreau, Robert Stevens
Hero http://hero.geog.psu.edu/
William Pike, Ola Ahlqvist, Mark Gahegan, Sachin Oswal
Collaboratory for Multi-Scale ChemicalScience CMSC
James D. Myers, Carmen Pancerella, Carina Lansing, Karen L. Schuchardt, Brett Didier
Chimera
Michael Wilde, Ian Foster
Knowledge Space
Novartis
And special thanks to Ian Cottam for heroic support when my laptop died yesterday. Afternoon. Hero codex web toolHero codex web tool