450 likes | 626 Views
All the King's Horses, and All the King's Men. Defragmenting Biological Information Through BioMOBY Mark Wilkinson Plant Biotechnology Institute National Research Council of Canada mwilkinson@gene.pbi.nrc.ca. The Holy Grail:.
E N D
All the King's Horses,and All the King's Men Defragmenting Biological Information Through BioMOBY Mark Wilkinson Plant Biotechnology Institute National Research Council of Canada mwilkinson@gene.pbi.nrc.ca
The Holy Grail: Align the promoters of all serine threonine kinases involved exclusively in some hypothetical signaling cascade which leads to nitrogen fixation. Retrieve and align 2000nt 5' from every serine/threonine kinase in Fabacaea expressed exclusively in the root cortex whose expression increases 5X or more upon infection by Rhizobium but is not affected by osmotic or heavy-metal stresses and is <40% homologous in the active site to kinases known to be involved in cell-cycle regulation in any other species.
Yeah… we wish! ;-) Unfortunately, the current situation is…
So how do we get there? There are already some landmarks of successful data integration projects we can use as a guide They have several things in common. In particular they define: • A question format • A response format • How to describe objects in the question • A common location to ask questions
Successful Integration Tools~ completeand in-progress : • DAS: "Distributed Annotation System" • ISYS: Integration of Desktop Tools • GO: Gene Ontology Consortium • Continues to expand • …and numerous other controlled vocabulary and ontology projects… • BioMOBY: Integration of online biological databases and analysis services
Problems left over fromGO, DAS & ISYS? • Many biological data types are not "covered" • e.g. microarray, two-hybrid, citations • How to find this information • Generally must do each query by hand • Each database has a different "look and feel". • Often one at a time. • Information returned is often still in willy-nilly formats • Cut-n-paste, cut-n-paste, cut-n-paste…
What new 'tool' is needed? • A mechanism by which a researcher is able to simultaneously interact with multiple sources of highly disparate biological data regardless of the underlying data format or schema. • The mechanism must also allow for the automated, dynamic identification of data, and the relationships between data from different sources.
What lessons do we learn from GO, DAS & ISYS • Be Open Source!! • Include wide variety of organisms in development • Use a central registry • Pass predictable data objects • Define query/response formats • Use/Abuse ontologies until it hurts • ‘KOSH’ – Keep Objects Simple Honey!
Do You MOBY? • Open-source interoperability research project • “MOBY-DIC” Conference, Emma Lake, 2001. • "Model Organism Bring Your-own Database Interface Conference“ • Currently running as a prototype on a public server • Modeled after those successful integration projects, with great advice from them (thanks!) • Compatible with most other XML-based object model standards.
Michael Ashburner, Jason Stewart, Damian Gessler and Midori Harris
Missing, but involved: • Judy Blake (Mouse) • Heiko Schoof (MIPS) • David Block (Novartis) • Richard Bruskiewich (IRRI) • Matthew Links (PBI-NRC) • Stanford Microarray • Expanding collaboration with myGrid & I3C • Mailing list > 55 “adherents”
MOBY data hosts & services Align Phylogeny Primers Sequence Express. Protein Alleles … MOBY Central Gene names Overview: alignment Sequence
Components of the MOBY system • MOBY Objectsthe data itself (XML) • MOBY Centralcentral registry of servers • MOBY Serverany compliant service • MOBY Clientthe bit that you see!
MOBY- Objects • Desire (initially) ~200 biological data types • Sequence • GO_Term • Citation • XML objects, defined by a common set of W3C XML Schema (XSD) stored at MOBY-Central. • An Object minimally consists of a "MOBY-Triple": • ObjectType, Namespace, ID • e.g. Sequence, GenBank/Acc, AC34576
MOBY ObjectsThe ISYS lesson in minimalism • A base MOBY object consists of a “Triple”: instance/namespace/id <Object namespace="GenBank/Acc" id="D21125.1"/> • A different instance of the same data <VirtualSequence namespace="GenBank/Acc" id="D21125.1"> <Length>562</Length> </VirtualSequence>
MOBY-Objects • In addition to the Triple, two additional elements may appear in an object: • Cross Reference Information Block ("CRIB") • Payload • CRIB contains relevant cross-referencing Triples • Payloads carry the actual "data" • e.g. "Length" in the previous example • Should be as lightweight as possible • Vary according to the Instance of the object
MOBY-Objects • Unlike many other web-services projects, MOBY Classes are used as both the Query, and the Response to a service transaction. • Query and Response ere enveloped in a "MOBY envelope" • Contains authority information about service provider • Collects multiple responses according to their source. • Gives biologists ability to judge data quality
Envelope Class X-Refs Payload Response mock-up The Sequence object for Apetala3 might be: <MOBY Authority="ncbi.nlm.nih.gov" log="Query/ID" id = 1334543> <Sequence namespace="GenBank/Acc" id=D21125.1> <CrossReference> <Object namespace="PubMed/ID“ id="7948893"/> <Object namespace ="SwissProt/ID" id="BAA65.1"/> <Object namespace ="TAIR/Locus" id="AP3"/> <Object namespace ="GO/Acc” id="GO:0001835"/> <Object namespace ="EMBL/ID" id="AF056541"/> </CrossReference> <Length>876</Length> <SequenceString>gatcaatcca tgttagtttc taactgtggc caacttagtt …. </SequenceString> </Sequence> </MOBY>
MOBY-Servers: • Accepts MOBY Object(s) as queries via SOAP • Returns MOBY Object(s) in response via SOAP • for the moment, only using HTTP protocol • Query/Response Object types + Protocol = "Port" • Service provider registers with MOBY Central • "Port" • Service Type (from service hierarchy) • URL of service • Human-readable description of service
Post-Registration Specification of Services • WSDL service definition for each service • Web Service Description Language – XML based • WSDL has three main components: • Class : XSD template of the Objects' data structures • Port : Input/output object Class & protocol (e.g. http) • Service : the actual URL & associated "Port" • WSDL is auto-generated at MOBY-Central
MOBY-Central (monolithic!…) • Repository for Object schemas • Repository for the Object Ontology • Repository for the Service Ontology • Repository of the URL for every service from every Server, and their input and output Object types • Answers Client queries based on Input, Output, Service Type and/or Service Provider • Builds, "on the fly", an appropriate WSDL service definition document in response to a Client query
OBJECTS SERVICES Y 1 X 2 Z 3 X to Y via 1 @ URL_1 Y to Z via 2 @ URL_2 Basically, the Registration Process is: MOBY Central Service @ URL MOBY Service Service URL_1 Service URL_2
Response Example again… <MOBY Authority="ncbi.nlm.nih.gov" log="Query/ID" id = 1334543> <Sequence namespace="GenBank/Acc" id=D21125.1> <CrossReference> <Object namespace ="PubMed/ID“ id="7948893"/> <Object namespace ="SwissProt/ID" id="BAA65.1"/> <Object namespace ="TAIR/Locus" id="AP3"/> <Object namespace ="GO/Acc” id="GO:0001835"/> <Object namespace ="EMBL/ID" id="AF056541"/> </CrossReference> <Length>876</Length> <SequenceString>gatcaatcca tgttagtttc taactgtggc caacttagtt …. </SequenceString> </Sequence> </MOBY>
Related Data Available Locus Citation Primary_AA Primary_Seq GO_Annotation Homologues Expression profile Apetala3 cDNA_Sequence: gatcaatcca tgttagtttc aaaacaacag taactgtggc caacttagtt ttgaaacaac …… New Info! What a simple client might display
"I can find functionally similar genes From GO_Ids" "I can get Expression data From EMBL ID" Medline SwissProt PubMed EMBL GO_Annotation Citation AA Locus DNA GO_Annot Homologues Expression MOBY Central Sequence attccg ggtcac What can I do with Medline, EMBL, GO …??? Where does new info come from? • “SERVICE AMPLIFICATION”
The Holy Grail The Holy Grail: Retrieve and align 2000nt 5' from every serine/threonine kinase in Fabacaea expressed exclusively in the root cortex whose expression increases 5X or more upon infection by Rhizobium but is not affected by osmotic or heavy-metal stresses and is <40% homologous in the active site to kinases known to be involved in cell-cycle regulation in any other species.
prints Pro site GO/Acc GO:0004674 Get Serine/Threonine Kinase MAGE/ML objects Expression objects BeanGene/Acc GO/Acc GO:0004674 ProteinDomain Object GO/Acc GO::0004674 GO Database MOBY Central BeanGene/Acc Keyword search Serine/Threonine Kinase Workflow to solve query: SMD GO BG
And so on… • Get GO id's for the term "cell cycle" • Use GO id's to retrieve sequences • Use some homology or alignment service to do the homology testing… (Emboss via MOBY) • End up with a set of valid sequence Accession number's • Obtain the sequence coordinates for those Accession's (MOBY interface to DAS) • Use DAS over MOBY to retrieve the 2000nt upstream • Use an alignment service to align them
Integration of the various tools to solve this problem: • MOBY finds the data for you, creates the queries, and retrieves the responses. • CRIB provides entry into related data-sources • GO allows you to move from organism to organism based on guaranteed biological descriptor. • DAS enables sequence retrieval from genome databases in a generic, coordinate-based manner.
Why have Class and Service Ontologies? • Discovery of new sub-classes of services • “BLAST” Finds tblastx, tblastn, psi-blast, and marks_super_blast. • “Alignment” Finds ClustalW, Blast, Smith-Waterman, Needleman-Wunsch • Allows backward-compatibility with simpler clients: • 'AnnotatedContig' object contains a Sequence object: AnnotatedContig IS A Sequence • Clients extracts only the bits that they understand • Expanded selection of services presented based on expansion of in-hand object
What about the Object Models? • Are we duplicating effort? • The MOBY payload can and will be used to transport objects from other ontologies or systems (e.g. OMG, MAGE, free-text), thus we are fully compatible. • Whenever possible, the lightweight MOBY objects will be designed to be parents of one of the heavier objects from another ontology • We are aware of, and follow developments in other object-creation projects • The lightweight nature of the base MOBY objects necessitates the creation of our own object models.
What about the I3C? Why not use UDDI? • It is far to early in the evolution of biological web-services for anyone to claim to have the 'best solution‘. Research is a Good Thing ™ • BioMOBY attempts to approach data discovery from a unique "biologists" perspective • UDDI is not open-source…really… • UDDI has a business-centric API, and (IMHO) is bulky and awkward. • Nothing precludes us from using UDDI in the future. • I3C members do participate in BioMOBY development.
How long will this take? • MOBY is a product of the host community. • high degree of acceptance & enthusiasm! • MOBY Services are exceedingly simple to generate. • Host institutions can gradually implement MOBY • MOBY is 'modular' • Difficult to be obstructive to implementation • In the short-term, proxies can be written around non-participating CGI-based interfaces
Current Status: • MOBY Central prototype registry running • Midori (GO) & myGrid participants are working on unifying their Service ontologies. • Lukas (TAIR) and I are working on the Object ontology • Brian Gilman porting MOBY Central to Java • On June 5th, the first MOBY Service was discovered and automatically executed by a simple MOBY Client! • ~10 services exist spanning 3 institutions, and growing… … things will move quickly from here on! • Award from Genome Prairie to pursue BioMOBY research • CURRENTLY LOOKING FOR MATCHING FUNDS FROM EXTERNAL SPONSOR (hint hint :-) )
Planned Research Themes (Canadian) • "Core" BioMOBY development • MOBY Central • Object definitions • Modification of existing UI’s and creation of new MOBY-specific client. • Creation of HPC services running over MOBY • Annotation pipelines • Analysis "Wizards“ • Automated Workflows • Artificial Intelligence and Machine Learning • Particularly in the discovery process • Integration with natural-language processors • Extensive ontology development
MOBY Central Service @ URL MOBY Service Service URL_1 Service URL_2 X to Y via 1 @ URL_1 Y to Z via 2 @ URL_2 Machine Learning using MOBYDiscovery of ‘safe’ Service Paths
The Holy Grail Retrieve and align 2000nt 5' from every serine/threonine kinase in Fabacaea expressed exclusively in the root cortex whose expression increases 5X or more upon infection by Rhizobium but is not affected by osmotic or heavy-metal stresses and is <40% homologous in the active site to kinases known to be involved in cell-cycle regulation in any other species. The Holiest Grail! Align the promoters of all serine/threonine kinases involved exclusively in a signaling cascade leading to nitrogen fixation
Canadian Bioinformatics Integration Network Life Sciences Community Genomics Data Protein Data Data Type/Database Integration Integrated Knowledge Interaction & Pathway Data Ontology Development CBIN Tools & Application Development Phenotype Data Algorithm Development Biodiversity Data Knowledge Generation CBIN – Canadian Bioinformatics Integration Network NCE proposal
Special Thanks To: • Genome Prairie/Genome Canada • Dr. William Crosby and the Crosby Lab: • Dave Block, Matthew Links • Gene Ontology Consortium • Michael Ashburner & Suzanna Lewis • Midori Harris & Chris Mungall • DAS project • Robin Dowell • Lincoln Stein • NCGR ISYS project • Damian Gessler • Whitehead Institute/OmniGene • Brian Gilman • OpenInformatics • Jason Stewart • All the other BioMOBY participants
References: • MOBY: http://biomoby.org • DAS: http://biodas.org • GO: http://geneontology.org • ISYS: http://www.ncgr.org/isys • PBI Bioinformatics Lab: http://bioinfo.pbi.nrc.ca