1 / 43

Knowledge and Provenance: A knowledge model perspective

Talk roadmap . . . Knowledge for Provenance. The Provenance of Knowledge. Knowledge technologies. Where do knowledge assertions come from?. What is this provenance about and for?. How do we represent knowledge for and about provenance?. my Context. Knowledge-driven Middleware for data intensive

tamra
Download Presentation

Knowledge and Provenance: A knowledge model perspective

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Knowledge and Provenance: A knowledge model perspective Carole Goble, University of Manchester, UK It’s a framework for application integration Integrative encycloepedia of life – it’s the same as myGrid. Basic strategy is annotation, integration and presentation. It’s a middleware project – you don’t have to buy into everything. Its toolkits. Provenance and knowledge - musing on what users might want. Given the emphasis on "characteristics of provenance-in-the-large": >From a user's perspective one of the main uses of provenance data could be to bring the user up to date with the current working context, or summarise the areas of activity in a given period. - what has happened (within this project) since I last logged on to my experimental environment (virtual organisation)? - what have been the new experiments between date X and date Y? For situations such "what has happened of interest to me", a provenance record of the relevant user proxy could provide the basis for answers. For general queries, where a user may be exploring an environment from different perspectives, there is no reason for the user to distinguish between provenance metadata and other metadata. A key aspect is that this gives a requirement for queries over large volumes of metadata (and data) and possible outputs include: workflow_486 has been run 37 times in the last month workflow_486 has been run 7 times by user_34 workflow_486 has been run 7 times by user_34 with lsid:3213 as input_geneId workflow_486 has been run 5 times by user_21 ... Affymetrix_probe_ids have been used 1041 times this month lsid:3213 has been used 45 times lsid:3213 has been input to workflow_486 7 times ... This contrast with provenance that concetrates on the lineage or origin of a specific data item. If the lineage is to allow others to verify in-silico experiments then the user and time are usually considered irrelevant. In addition, the value of the "course provenance queries" above is that the user does not know exactly the data of interest (compared with asking the detailed "Where did this data come from, and is it up to date?") Mark PS This line of thinking raises one area of interest as the relationship between provenance metadata and other metadata. It’s a framework for application integration Integrative encycloepedia of life – it’s the same as myGrid. Basic strategy is annotation, integration and presentation. It’s a middleware project – you don’t have to buy into everything. Its toolkits. Provenance and knowledge - musing on what users might want. Given the emphasis on "characteristics of provenance-in-the-large": >From a user's perspective one of the main uses of provenance data could be to bring the user up to date with the current working context, or summarise the areas of activity in a given period. - what has happened (within this project) since I last logged on to my experimental environment (virtual organisation)? - what have been the new experiments between date X and date Y? For situations such "what has happened of interest to me", a provenance record of the relevant user proxy could provide the basis for answers. For general queries, where a user may be exploring an environment from different perspectives, there is no reason for the user to distinguish between provenance metadata and other metadata. A key aspect is that this gives a requirement for queries over large volumes of metadata (and data) and possible outputs include: workflow_486 has been run 37 times in the last month workflow_486 has been run 7 times by user_34 workflow_486 has been run 7 times by user_34 with lsid:3213 as input_geneId workflow_486 has been run 5 times by user_21 ... Affymetrix_probe_ids have been used 1041 times this month lsid:3213 has been used 45 times lsid:3213 has been input to workflow_486 7 times ... This contrast with provenance that concetrates on the lineage or origin of a specific data item. If the lineage is to allow others to verify in-silico experiments then the user and time are usually considered irrelevant. In addition, the value of the "course provenance queries" above is that the user does not know exactly the data of interest (compared with asking the detailed "Where did this data come from, and is it up to date?") Mark PS This line of thinking raises one area of interest as the relationship between provenance metadata and other metadata.

    2. Talk roadmap

    3. my Context Knowledge-driven Middleware for data intensive in silico experiments in biology http://www.mygrid.org.uk

    4. A real bio provenance log

    5. Any and every experimental item attracts provenance (so long as you can ID it). Experimental design components workflow specifications; query specifications; notes describing objectives; applications; databases; relevant papers; the web pages of important workers, services Experimental instances that are records of enacted experiments data results; a history of services invoked by a workflow engine; instances of services invoked; parameters set for an application; notes commenting on the results Experimental glue that groups and links design and instance components a query and its results; a workflow linked with its outcome; links between a workflow and its previous and subsequent versions; a group of all these things linked to a document discussing the conclusions of the biologist Integrating components Experimental design components: workflow specifications; query specifications; notes describing objectives; applications; databases; relevant papers; the web pages of important workers, and so on. Experimental instances that are records of enacted experiments: data results; a history of services invoked by a workflow engine; instances of services invoked; parameters set for an application; notes commenting on the results and so on. Experimental glue that groups and links design and instance components: a query and its results; a workflow linked with its outcome; links between a workflow and its previous and subsequent versions; a group of all these things linked to a document discussing the conclusions of the biologist and so on. Integrating components Experimental design components: workflow specifications; query specifications; notes describing objectives; applications; databases; relevant papers; the web pages of important workers, and so on. Experimental instances that are records of enacted experiments: data results; a history of services invoked by a workflow engine; instances of services invoked; parameters set for an application; notes commenting on the results and so on. Experimental glue that groups and links design and instance components: a query and its results; a workflow linked with its outcome; links between a workflow and its previous and subsequent versions; a group of all these things linked to a document discussing the conclusions of the biologist and so on.

    6. Provenance is metadata … intended for sharing, retrieving, integrating, aggregating and processing. generated with the hope that it is comprehensive enough to be future-proofed. recorded for those who we do not yet know will use the object and who will likely use it in a different way. machine computational: free text of limited help. Provenance is the knowledge that makes An item interpretable and reusable within a context An item reproducible or at least repeatable. Its part of the information model of any system

    8. Provenance is contextual metadata We look at the same things in different ways and different things in the same way Our data alone does not describe our work We have to capture this context. hero knowledge management HEROINE’s mission Describing and sharing how different people, in different places and at different times, conceive of elements of human environment interaction Enabling the discovery of new concepts and relationships that aid these descriptions Developing tools and approaches for hero knowledge management HEROINE’s mission Describing and sharing how different people, in different places and at different times, conceive of elements of human environment interaction Enabling the discovery of new concepts and relationships that aid these descriptions Developing tools and approaches for

    9. Provenance forms Derivations A path like a workflow, script or query. Linking items, usually in a directed graph. An explanation of when, who, how something produced. Execution Process-centric Annotations Attached to items or collections of items, in a structured, semi-structured or free text form. Annotations on one item or linking items. An explanation of why, when, where, who, what, how. Data-centric A workflow or a derivation graph A history/audit A recipe/plan Play forward How do I get from here. Play backwards How do I get to here. Annotations and notes about experiments by scientists (recording the process of biological experiments for e-Science, the purpose and results of experiments). Derivation paths A workflow or database query between data inputs and data outputs A program & its parameters to a data result A path between an original and a refined workflow template etc Every item in the myGrid Information Repository has (Dublin Core based) provenance attributes A workflow or a derivation graph A history/audit A recipe/plan Play forward How do I get from here. Play backwards How do I get to here. Annotations and notes about experiments by scientists (recording the process of biological experiments for e-Science, the purpose and results of experiments). Derivation paths A workflow or database query between data inputs and data outputs A program & its parameters to a data result A path between an original and a refined workflow template etc Every item in the myGrid Information Repository has (Dublin Core based) provenance attributes

    10. Workflows as in silico experiments Freefluo workflow enactment engine WSFL Scufl Semantic Workflow discovery Finding workflows that others have done, and that I have done myself Semantic service discovery Finding classes of services Guiding service composition (We don’t do automated composition) Dynamic workflow enactment service discovery and invocation Choose services instances when running workflow User involvement  Soaplab  SOAP-based Analysis Web Service Soaplab is a set of Web Services providing a programatic access to some applications on remote computers. Because such applications, especially in the scientific environment where Soaplab was born, usually analyze data, Soaplab is often referred to as an Analysis (Web) Service. Soaplab was developed in the European Bioinformatics Institute (EBI), within the eScience initiative, as a component of the myGrid project. Soaplab is both a specification for an Analysis Service (based on other approved specifications, see the Architecture Guide) and its implementation. It is freely available for downloading - but bear in mind that the installation of this Web Service does not give you any analyses - they are not part of the Soaplab. The EBI has Soaplab service running on top of several tens of analyses (most of them coming from EMBOSS, an independent package of high-quality FREE Open Source software for sequence analysis) - but it is an experimental service which may not have 24/7 availability.  Soaplab  SOAP-based Analysis Web Service Soaplab is a set of Web Services providing a programatic access to some applications on remote computers. Because such applications, especially in the scientific environment where Soaplab was born, usually analyze data, Soaplab is often referred to as an Analysis (Web) Service. Soaplab was developed in the European Bioinformatics Institute (EBI), within the eScience initiative, as a component of the myGrid project. Soaplab is both a specification for an Analysis Service (based on other approved specifications, see the Architecture Guide) and its implementation. It is freely available for downloading - but bear in mind that the installation of this Web Service does not give you any analyses - they are not part of the Soaplab. The EBI has Soaplab service running on top of several tens of analyses (most of them coming from EMBOSS, an independent package of high-quality FREE Open Source software for sequence analysis) - but it is an experimental service which may not have 24/7 availability.

    11. Semantic discovery – services & workflows

    12. Provenance forms in myGrid Derivations FreeFluo Workflow Enactment Engine provides a detailed provenance record stored in the myGrid Information Repository (mIR) describing what was done, with what services and when XML document, soon to be an RDF model Annotations Every mIR object has Dublin Core provenance properties described in an attribute value model Annotations and notes about experiments by scientists (recording the process of biological experiments for e-Science, the purpose and results of experiments). Derivation paths A workflow or database query between data inputs and data outputs A program & its parameters to a data result A path between an original and a refined workflow template etc Every item in the myGrid Information Repository has (Dublin Core based) provenance attributes Annotations and notes about experiments by scientists (recording the process of biological experiments for e-Science, the purpose and results of experiments). Derivation paths A workflow or database query between data inputs and data outputs A program & its parameters to a data result A path between an original and a refined workflow template etc Every item in the myGrid Information Repository has (Dublin Core based) provenance attributes

    13. Provenance of data Operational execution trail > The provenance of connections/relationships > > An Example: > In the Graves' disease example, the annotation pipeline workflow takes in a Affymetrix probe id (indicating a gene) and its outputs include: an embl id, a swissprot id, a set of OMIN ids, a set of GO terms. > > For simplicity let us just take the swissprot id. In the user's environment there is now a new data item the swissprot id. In one respect we can say that the origin of that id, was the instance of the annotation workflow, and it depended on the probe id (gene id) given as input. (The detailed provenance record of the workflow instance will identify the "transformations" that led from the input to the output.) > > However, the annotation pipeline did not create the swissprot id (just the copy of it within the myGrid environment). The important thing is that the annotation workflow creates a relationship between its input (probe id/gene id) and the swissprot id. If a user has another workflow (using different services) that can create this relationship, she or he will want to know if it identifies the same swissprot id. Having two pieces of evidence to support some knowledge is better than one. > > (The argument is essentially the same if the workflow result is the swissprot record rather than just the swissprot id.) > > The provenance of connections/relationships > > An Example: > In the Graves' disease example, the annotation pipeline workflow takes in a Affymetrix probe id (indicating a gene) and its outputs include: an embl id, a swissprot id, a set of OMIN ids, a set of GO terms. > > For simplicity let us just take the swissprot id. In the user's environment there is now a new data item the swissprot id. In one respect we can say that the origin of that id, was the instance of the annotation workflow, and it depended on the probe id (gene id) given as input. (The detailed provenance record of the workflow instance will identify the "transformations" that led from the input to the output.) > > However, the annotation pipeline did not create the swissprot id (just the copy of it within the myGrid environment). The important thing is that the annotation workflow creates a relationship between its input (probe id/gene id) and the swissprot id. If a user has another workflow (using different services) that can create this relationship, she or he will want to know if it identifies the same swissprot id. Having two pieces of evidence to support some knowledge is better than one. > > (The argument is essentially the same if the workflow result is the swissprot record rather than just the swissprot id.) >

    14. Provenance of knowledge Particularly in Biology where you are finding LINKS between items rather than generating new items. Generating relationships.Particularly in Biology where you are finding LINKS between items rather than generating new items. Generating relationships.

    15. Provenance of knowledge

    16. Provenance of knowledge

    17. 20,000 feet and ground level Top Down provenance What is going on? Unification and summaries of collective provenance knowledge. Collaborative, Awareness, Experience base, Scientific Corporate memory. “What projects have something to do with human SNPs?” “What experiments use the PSI-BLAST service regardless of version?” Bottom Up provenance Where did this data object http://doh.dah.ac.uk/… come from? Which version of Swiss-Prot was run in workflow http:/blah.ac.uk/…?

    18. Provenance for People and Machines Explicitly capture context and Gather and integrate provenance from different places, times, services, locations etc etc And the domain “knowledge” provenance with the execution provenance.Explicitly capture context and Gather and integrate provenance from different places, times, services, locations etc etc And the domain “knowledge” provenance with the execution provenance.

    19. 1. Explicitly capture Context Reuse methods and strategies (e.g., protocols) Make explicit the situational bias that is normally implicit Enable future generations of scientists to follow our work To capture meaning, we must devise a way of representing concepts and their relationships

    20. 1. Explicitly capture Context Using models and terms that can be shared and interpreted that are extensible and preclude premature restrictions that are navigable and computationally processable

    21. 2. Bridge islands of exported provenance Common data model Universal identification mechanism Shared conceptualisation through an ontology Inference and reasoningCommon data model Universal identification mechanism Shared conceptualisation through an ontology Inference and reasoning

    22. Not all exports are the same But we want to link together anyway. Common data model Universal identification mechanism Shared conceptualisation through an ontology Inference and reasoningBut we want to link together anyway. Common data model Universal identification mechanism Shared conceptualisation through an ontology Inference and reasoning

    23. So we need to… Uniquely identify items through URIs and Life Science Identifiers (GSH/GSR/Handle.net…) Explicitly expose provenance by assertions in a common data model… Publish and share consensually agreed ontologies so we can share the provenance metadata and add in background knowledge… Then we can query, filter, integrate and aggregate the provenance metadata … and reason over it to infer more provenance metadata using rules … and attribute trust to the provenance … Flexibly so that do not cast in stone models and terms, and so can cope with different degrees of description.

    24. W3C Metadata language/model Resource Description Framework Common model for metadata Assertions as triples (subject, predicate, object) forming graphs. Associate URIs (LSIDs) with other URIs (LSIDs). Associate URIs with OWL concepts (which are URIs). RDQL, repositories, integration tools, presentation tools Query over, Link together, Aggregate, Integrate assertions. Avoids pre-commitment Self-describing Incremental Extensible Advantage and drawback.

    25. Bridging islands Common data model Universal identification mechanism Shared conceptualisation through an ontology Inference and reasoningCommon data model Universal identification mechanism Shared conceptualisation through an ontology Inference and reasoning

    26. Bridging islands: Concepts and LSID Common data model Universal identification mechanism Shared conceptualisation through an ontology Inference and reasoningCommon data model Universal identification mechanism Shared conceptualisation through an ontology Inference and reasoning

    27. Continuum of expressivity Concepts, roles, individuals, axioms From simple frames to description logics Sound and complete formal semantics Compositional and property based Reasoning to infer classification Eas(ier) to extend and evolve and merge ontologies A web language Tools, tools, tools! W3C Ontology language/model: OWL

    28. Bridging islands: Concepts and LSIDs Common data model Universal identification mechanism Shared conceptualisation through an ontology Inference and reasoningCommon data model Universal identification mechanism Shared conceptualisation through an ontology Inference and reasoning

    29. Bridging islands: Concepts and LSIDs Common data model Universal identification mechanism Shared conceptualisation through an ontology Inference and reasoningCommon data model Universal identification mechanism Shared conceptualisation through an ontology Inference and reasoning

    30. Layers of Knowledge Languages Complexity could be a problem Complexity could be a problem

    31. myGrid everything has a concept & LSID

    32. Linking objects to objects via URIs and LSIDs Just like DiscoveryNet.Just like DiscoveryNet.

    34. Annotating a workflow log with concepts

    35. Generating provenance A ws-info document which contains ontological descriptions associated with the inputs, outputs and services in a workflow, similar to the DAML-S profile. This document has two roles: the myGrid registry uses it to advertise and hence discover workflows based on their semantics; the myGrid workbench environment uses it when interrogating the mIR for data inputs that semantically match a workflow, for providing configuration and default information for service parameters, and for identifying the data type of the workflow data results. Figure 8 gives a screenshot of a ws-info document.A ws-info document which contains ontological descriptions associated with the inputs, outputs and services in a workflow, similar to the DAML-S profile. This document has two roles: the myGrid registry uses it to advertise and hence discover workflows based on their semantics; the myGrid workbench environment uses it when interrogating the mIR for data inputs that semantically match a workflow, for providing configuration and default information for service parameters, and for identifying the data type of the workflow data results. Figure 8 gives a screenshot of a ws-info document.

    37. Figure 3. Portal interface to a user.s personal, group, and community workspaces. Figure 3. Portal interface to a user.s personal, group, and community workspaces.

    38. Two views of a gravity model concept from the Hero CODEX web tool An ontological description shows how one geoscientist constructs a model a social network reveals which users favour different instances of the model, with edge length suggesting the degree of support. Figure 2. Two views of a gravity model concept (red nodes). An ontological description (left) shows how one geoscientist constructs such a model; a social network (right) reveals which users favor different instances of the model, with edge length suggesting the degree of support. (Concept graphing in Codex modified from open-source Touchgraph [www.touchgraph.com]) Figure 2. Two views of a gravity model concept (red nodes). An ontological description (left) shows how one geoscientist constructs such a model; a social network (right) reveals which users favor different instances of the model, with edge length suggesting the degree of support. (Concept graphing in Codex modified from open-source Touchgraph [www.touchgraph.com])

    39. Collaboratory for Multi-Scale Chemical Science Central panes of the CMCS Pedigree Browser showing the metadata and relationships of the selected data set. 4. CMCS “Pedigree Graph” portlet showing provenance relationships between resources (color coded by original relationship type). Central panes of the CMCS Pedigree Browser showing the metadata and relationships of the selected data set. 4. CMCS “Pedigree Graph” portlet showing provenance relationships between resources (color coded by original relationship type).

    40. Provenance dimensions connected by concepts and identifiers Infrastructure for the coordinated sharing of data and knowledge. Developers create a distributed knowledge or data base for their particular domain-oriented applications. The representation language, the communication protocols, and the access control and authentication are handled by the Semantic Web. Infrastructure for the coordinated sharing of data and knowledge. Developers create a distributed knowledge or data base for their particular domain-oriented applications. The representation language, the communication protocols, and the access control and authentication are handled by the Semantic Web.

    41. Reflections: annotations Annotation metadata model for myGrid holdings are a Graph If it waddles like RDF and quacks like RDF, its RDF Experiments in RDF scalability Co-existence of RDF and other data models (relational) Acquisition of annotations and adverts Automated by mining WSDL docs, mining ws-info docs Deep annotation works ok for bioinformatic service concepts (it’s an EMBL record) but… Annotating with biologically meaningful concepts is harder Data in the mIR (it’s a lymphocyte) Manual annotation cost is high! Service/workflow publication tools Dealing with change Ontology changes; service changes; annotations change. Finding a service / workflow that will fulfil some task e.g. aligning of biological sequences. Finding a service / workflow that will accept or produce some kind of data. Type management when forming workflows Finding a service / workflow that will fulfil some task e.g. aligning of biological sequences. Finding a service / workflow that will accept or produce some kind of data. Type management when forming workflows

    42. Random Thoughts Where does the knowledge come from (see Luc)? How do we model trust (see Luc)? Scalability of Semantic Web technologies? Visualisation of knowledge (see monica)? What’s the lifecycle of provenance? Different knowledge models for different disciplines? Layers of provenance Provenance that is domain knowledge Provenance for context vs execution People vs machine Different models for different items but still needs to be integrated Technologies for sharing and integrating that are flexible.

    43. Talk provenance myGrid http://www.mygrid.org.uk Jun Zhao, Mark Greenwood, Chris Wroe, Phil Lord, Chris Greenhalgh, Luc Moreau, Robert Stevens Hero http://hero.geog.psu.edu/ William Pike, Ola Ahlqvist, Mark Gahegan, Sachin Oswal Collaboratory for Multi-Scale Chemical Science CMSC James D. Myers, Carmen Pancerella, Carina Lansing, Karen L. Schuchardt, Brett Didier Chimera Michael Wilde, Ian Foster Knowledge Space Novartis And special thanks to Ian Cottam for heroic support when my laptop died yesterday. Afternoon. Hero codex web toolHero codex web tool

More Related