1 / 45

All the King's Horses, and All the King's Men

All the King's Horses, and All the King's Men. Defragmenting Biological Information Through BioMOBY Mark Wilkinson Plant Biotechnology Institute National Research Council of Canada mwilkinson@gene.pbi.nrc.ca. The Holy Grail:.

gasha
Download Presentation

All the King's Horses, and All the King's Men

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. All the King's Horses,and All the King's Men Defragmenting Biological Information Through BioMOBY Mark Wilkinson Plant Biotechnology Institute National Research Council of Canada mwilkinson@gene.pbi.nrc.ca

  2. The Holy Grail: Align the promoters of all serine threonine kinases involved exclusively in some hypothetical signaling cascade which leads to nitrogen fixation. Retrieve and align 2000nt 5' from every serine/threonine kinase in Fabacaea expressed exclusively in the root cortex whose expression increases 5X or more upon infection by Rhizobium but is not affected by osmotic or heavy-metal stresses and is <40% homologous in the active site to kinases known to be involved in cell-cycle regulation in any other species.

  3. Yeah… we wish! ;-) Unfortunately, the current situation is…

  4. What's wrong with this picture??

  5. So how do we get there? There are already some landmarks of successful data integration projects we can use as a guide They have several things in common. In particular they define: • A question format • A response format • How to describe objects in the question • A common location to ask questions

  6. Successful Integration Tools~ completeand in-progress : • DAS: "Distributed Annotation System" • ISYS: Integration of Desktop Tools • GO: Gene Ontology Consortium • Continues to expand • …and numerous other controlled vocabulary and ontology projects… • BioMOBY: Integration of online biological databases and analysis services

  7. Problems left over fromGO, DAS & ISYS? • Many biological data types are not "covered" • e.g. microarray, two-hybrid, citations • How to find this information • Generally must do each query by hand • Each database has a different "look and feel". • Often one at a time. • Information returned is often still in willy-nilly formats • Cut-n-paste, cut-n-paste, cut-n-paste…

  8. What new 'tool' is needed? • A mechanism by which a researcher is able to simultaneously interact with multiple sources of highly disparate biological data regardless of the underlying data format or schema. • The mechanism must also allow for the automated, dynamic identification of data, and the relationships between data from different sources.

  9. What lessons do we learn from GO, DAS & ISYS • Be Open Source!! • Include wide variety of organisms in development • Use a central registry • Pass predictable data objects • Define query/response formats • Use/Abuse ontologies until it hurts • ‘KOSH’ – Keep Objects Simple Honey!

  10. Do You MOBY? • Open-source interoperability research project • “MOBY-DIC” Conference, Emma Lake, 2001. • "Model Organism Bring Your-own Database Interface Conference“ • Currently running as a prototype on a public server • Modeled after those successful integration projects, with great advice from them (thanks!) • Compatible with most other XML-based object model standards.

  11. Michael Ashburner, Jason Stewart, Damian Gessler and Midori Harris

  12. Suzanna Lewis Paul Gordon

  13. Lincoln Stein Martin Weems,Lukas Mueller

  14. AAAaaghhhh! Brian Gilman!

  15. Missing, but involved: • Judy Blake (Mouse) • Heiko Schoof (MIPS) • David Block (Novartis) • Richard Bruskiewich (IRRI) • Matthew Links (PBI-NRC) • Stanford Microarray • Expanding collaboration with myGrid & I3C • Mailing list > 55 “adherents”

  16. MOBY data hosts & services Align Phylogeny Primers Sequence Express. Protein Alleles … MOBY Central Gene names Overview: alignment Sequence

  17. Components of the MOBY system • MOBY Objectsthe data itself (XML) • MOBY Centralcentral registry of servers • MOBY Serverany compliant service • MOBY Clientthe bit that you see!

  18. MOBY- Objects • Desire (initially) ~200 biological data types • Sequence • GO_Term • Citation • XML objects, defined by a common set of W3C XML Schema (XSD) stored at MOBY-Central. • An Object minimally consists of a "MOBY-Triple": • ObjectType, Namespace, ID • e.g. Sequence, GenBank/Acc, AC34576

  19. MOBY ObjectsThe ISYS lesson in minimalism • A base MOBY object consists of a “Triple”: instance/namespace/id <Object namespace="GenBank/Acc" id="D21125.1"/> • A different instance of the same data <VirtualSequence namespace="GenBank/Acc" id="D21125.1"> <Length>562</Length> </VirtualSequence>

  20. MOBY-Objects • In addition to the Triple, two additional elements may appear in an object: • Cross Reference Information Block ("CRIB") • Payload • CRIB contains relevant cross-referencing Triples • Payloads carry the actual "data" • e.g. "Length" in the previous example • Should be as lightweight as possible • Vary according to the Instance of the object

  21. MOBY-Objects • Unlike many other web-services projects, MOBY Classes are used as both the Query, and the Response to a service transaction. • Query and Response ere enveloped in a "MOBY envelope" • Contains authority information about service provider • Collects multiple responses according to their source. • Gives biologists ability to judge data quality

  22. Envelope Class X-Refs Payload Response mock-up The Sequence object for Apetala3 might be: <MOBY Authority="ncbi.nlm.nih.gov" log="Query/ID" id = 1334543> <Sequence namespace="GenBank/Acc" id=D21125.1> <CrossReference> <Object namespace="PubMed/ID“ id="7948893"/> <Object namespace ="SwissProt/ID" id="BAA65.1"/> <Object namespace ="TAIR/Locus" id="AP3"/> <Object namespace ="GO/Acc” id="GO:0001835"/> <Object namespace ="EMBL/ID" id="AF056541"/> </CrossReference> <Length>876</Length> <SequenceString>gatcaatcca tgttagtttc taactgtggc caacttagtt …. </SequenceString> </Sequence> </MOBY>

  23. MOBY-Servers: • Accepts MOBY Object(s) as queries via SOAP • Returns MOBY Object(s) in response via SOAP • for the moment, only using HTTP protocol • Query/Response Object types + Protocol = "Port" • Service provider registers with MOBY Central • "Port" • Service Type (from service hierarchy) • URL of service • Human-readable description of service

  24. Post-Registration Specification of Services • WSDL service definition for each service • Web Service Description Language – XML based • WSDL has three main components: • Class : XSD template of the Objects' data structures • Port : Input/output object Class & protocol (e.g. http) • Service : the actual URL & associated "Port" • WSDL is auto-generated at MOBY-Central

  25. MOBY-Central (monolithic!…) • Repository for Object schemas • Repository for the Object Ontology • Repository for the Service Ontology • Repository of the URL for every service from every Server, and their input and output Object types • Answers Client queries based on Input, Output, Service Type and/or Service Provider • Builds, "on the fly", an appropriate WSDL service definition document in response to a Client query

  26. OBJECTS SERVICES Y 1 X 2 Z 3 X to Y via 1 @ URL_1 Y to Z via 2 @ URL_2 Basically, the Registration Process is: MOBY Central Service @ URL MOBY Service Service URL_1 Service URL_2

  27. Response Example again… <MOBY Authority="ncbi.nlm.nih.gov" log="Query/ID" id = 1334543> <Sequence namespace="GenBank/Acc" id=D21125.1> <CrossReference> <Object namespace ="PubMed/ID“ id="7948893"/> <Object namespace ="SwissProt/ID" id="BAA65.1"/> <Object namespace ="TAIR/Locus" id="AP3"/> <Object namespace ="GO/Acc” id="GO:0001835"/> <Object namespace ="EMBL/ID" id="AF056541"/> </CrossReference> <Length>876</Length> <SequenceString>gatcaatcca tgttagtttc taactgtggc caacttagtt …. </SequenceString> </Sequence> </MOBY>

  28. Related Data Available Locus Citation Primary_AA Primary_Seq GO_Annotation Homologues Expression profile Apetala3 cDNA_Sequence: gatcaatcca tgttagtttc aaaacaacag taactgtggc caacttagtt ttgaaacaac …… New Info! What a simple client might display

  29. "I can find functionally similar genes From GO_Ids" "I can get Expression data From EMBL ID" Medline SwissProt PubMed EMBL GO_Annotation Citation AA Locus DNA GO_Annot Homologues Expression MOBY Central Sequence attccg ggtcac What can I do with Medline, EMBL, GO …??? Where does new info come from? • “SERVICE AMPLIFICATION”

  30. The Holy Grail The Holy Grail: Retrieve and align 2000nt 5' from every serine/threonine kinase in Fabacaea expressed exclusively in the root cortex whose expression increases 5X or more upon infection by Rhizobium but is not affected by osmotic or heavy-metal stresses and is <40% homologous in the active site to kinases known to be involved in cell-cycle regulation in any other species.

  31. prints Pro site GO/Acc GO:0004674 Get Serine/Threonine Kinase MAGE/ML objects Expression objects BeanGene/Acc GO/Acc GO:0004674 ProteinDomain Object GO/Acc GO::0004674 GO Database MOBY Central BeanGene/Acc Keyword search Serine/Threonine Kinase Workflow to solve query: SMD GO BG

  32. And so on… • Get GO id's for the term "cell cycle" • Use GO id's to retrieve sequences • Use some homology or alignment service to do the homology testing… (Emboss via MOBY) • End up with a set of valid sequence Accession number's • Obtain the sequence coordinates for those Accession's (MOBY interface to DAS) • Use DAS over MOBY to retrieve the 2000nt upstream • Use an alignment service to align them

  33. Integration of the various tools to solve this problem: • MOBY finds the data for you, creates the queries, and retrieves the responses. • CRIB provides entry into related data-sources • GO allows you to move from organism to organism based on guaranteed biological descriptor. • DAS enables sequence retrieval from genome databases in a generic, coordinate-based manner.

  34. Why have Class and Service Ontologies? • Discovery of new sub-classes of services • “BLAST” Finds tblastx, tblastn, psi-blast, and marks_super_blast. • “Alignment” Finds ClustalW, Blast, Smith-Waterman, Needleman-Wunsch • Allows backward-compatibility with simpler clients: • 'AnnotatedContig' object contains a Sequence object: AnnotatedContig IS A Sequence • Clients extracts only the bits that they understand • Expanded selection of services presented based on expansion of in-hand object

  35. What about the Object Models? • Are we duplicating effort? • The MOBY payload can and will be used to transport objects from other ontologies or systems (e.g. OMG, MAGE, free-text), thus we are fully compatible. • Whenever possible, the lightweight MOBY objects will be designed to be parents of one of the heavier objects from another ontology • We are aware of, and follow developments in other object-creation projects • The lightweight nature of the base MOBY objects necessitates the creation of our own object models.

  36. What about the I3C? Why not use UDDI? • It is far to early in the evolution of biological web-services for anyone to claim to have the 'best solution‘. Research is a Good Thing ™ • BioMOBY attempts to approach data discovery from a unique "biologists" perspective • UDDI is not open-source…really… • UDDI has a business-centric API, and (IMHO) is bulky and awkward. • Nothing precludes us from using UDDI in the future. • I3C members do participate in BioMOBY development.

  37. How long will this take? • MOBY is a product of the host community. • high degree of acceptance & enthusiasm! • MOBY Services are exceedingly simple to generate. • Host institutions can gradually implement MOBY • MOBY is 'modular' • Difficult to be obstructive to implementation • In the short-term, proxies can be written around non-participating CGI-based interfaces

  38. Current Status: • MOBY Central prototype registry running • Midori (GO) & myGrid participants are working on unifying their Service ontologies. • Lukas (TAIR) and I are working on the Object ontology • Brian Gilman porting MOBY Central to Java • On June 5th, the first MOBY Service was discovered and automatically executed by a simple MOBY Client! • ~10 services exist spanning 3 institutions, and growing… … things will move quickly from here on! • Award from Genome Prairie to pursue BioMOBY research • CURRENTLY LOOKING FOR MATCHING FUNDS FROM EXTERNAL SPONSOR (hint hint :-) )

  39. Planned Research Themes (Canadian) • "Core" BioMOBY development • MOBY Central • Object definitions • Modification of existing UI’s and creation of new MOBY-specific client. • Creation of HPC services running over MOBY • Annotation pipelines • Analysis "Wizards“ • Automated Workflows • Artificial Intelligence and Machine Learning • Particularly in the discovery process • Integration with natural-language processors • Extensive ontology development

  40. MOBY Central Service @ URL MOBY Service Service URL_1 Service URL_2 X to Y via 1 @ URL_1 Y to Z via 2 @ URL_2 Machine Learning using MOBYDiscovery of ‘safe’ Service Paths

  41. The Holy Grail Retrieve and align 2000nt 5' from every serine/threonine kinase in Fabacaea expressed exclusively in the root cortex whose expression increases 5X or more upon infection by Rhizobium but is not affected by osmotic or heavy-metal stresses and is <40% homologous in the active site to kinases known to be involved in cell-cycle regulation in any other species. The Holiest Grail! Align the promoters of all serine/threonine kinases involved exclusively in a signaling cascade leading to nitrogen fixation

  42. Canadian Bioinformatics Integration Network Life Sciences Community Genomics Data Protein Data Data Type/Database Integration Integrated Knowledge Interaction & Pathway Data Ontology Development CBIN Tools & Application Development Phenotype Data Algorithm Development Biodiversity Data Knowledge Generation CBIN – Canadian Bioinformatics Integration Network NCE proposal

  43. Special Thanks To: • Genome Prairie/Genome Canada • Dr. William Crosby and the Crosby Lab: • Dave Block, Matthew Links • Gene Ontology Consortium • Michael Ashburner & Suzanna Lewis • Midori Harris & Chris Mungall • DAS project • Robin Dowell • Lincoln Stein • NCGR ISYS project • Damian Gessler • Whitehead Institute/OmniGene • Brian Gilman • OpenInformatics • Jason Stewart • All the other BioMOBY participants

  44. References: • MOBY: http://biomoby.org • DAS: http://biodas.org • GO: http://geneontology.org • ISYS: http://www.ncgr.org/isys • PBI Bioinformatics Lab: http://bioinfo.pbi.nrc.ca

More Related