410 likes | 439 Views
Learn about myGrid's objectives in coordinating bioinformatics web services, personalized environments for data-intensive experiments, and workflow processes for in silico biology. This system aims to enhance the quality and traceability of knowledge evolution while enabling personalized data sets, views, and workflows. Explore topics like provenance, event notification, and service coordination to optimize scientific discovery and collaboration.
E N D
myGrid:Using Workflow and Ontologies to coordinate Bioinformatics Web Services Carole Goble http://www.mygrid.org.uk Wellcome Trust/eScience Programme UK BioGrid Meeting 1-3rd October, 2002, Hinxton Genome Campus
Roadmap • myGrid’s objectives • Seeking services • Taste of myGrid 0. • Observations
myGrid:personalised extensible environments fordata-intensivein silico experiments in biology • EPSRC eScience pilot project • official start 01/10/01 • actual 01/01/02 • end 30/03/05 • 16 RAs, 9 studentships (start 09/03)
Information Weaving • Large amounts of different kinds of data & many applications. • Highly heterogeneous. • Different types, algorithms, forms, implementations, communities, service providers • High autonomy. • Highly complex and inter-related, & volatile.
Circadian Rhythms • Has anyone else studied the effect of neurotransmitters on the circadian rhythms in Drosophila? • I’ve got a cluster of proteins from my experiment. How do their functions interrelate? And what are the proteins with a particular function? • Is a structure known for my protein? What other proteins have a similar structure? • Publish my results by adding to some annotation in a database. 1 2 3 4
Workflow • Know how. • Associate base resources with derived data. • Keep, describe, find, compare, protect, share. • Repeat/reuse/re-enact • Specialise/Customise/Personalise • Evolution – notification, knowledge • Quality & best practice • It would be good if the workflows were good. • = good experimental practice. 1 2 3 4
Personalisation • Dynamic creation of personal data sets. • Personal views over repositories. • Personalisation of workflows. • Personal notification • Annotation of datasets and workflows. • Personalisation of service descriptions – what I think the service does. 1 2 3 4
Provenance • Who, what, where, why, when, (w)how? • The tracability of knowledge as it is evolves and as it is derived. • Identity – the Life Sciences ID • The Lab Book. Methods in papers. • Immutable Metadata • Migration – travels with its data but may not be stored with it. • Aggregates as data aggregates • Private vs Shared provenance records. • Ownership => success -> being sued? • Credit. 1 2 3 4
3 3 4 4 Event Notification • Has PDB changed since I last ran this? • Has the record I derived my record from changed? • Has the workflow I adapted my workflow from changed? • Did the provenance record change? • Has a service I am using right now gone? Has an equivalent one sprung up? • Event notification service myGrid 0.1 1 2
myGrid in silico experimentation • Resource Interoperation. • Workflow Coordination & Database Integration -> MALCOLM • Provenance & Change Propagation. • Improving quality of experiments & data. • Personalisation & Collaborative working. • Scientific discovery is personal & global. • Security, ownership -> valuable assets • Service based architecture (formally known as agents) • Publication, discovery, interoperation, composition, decommissioning of myGrid services • Metadata. • Describing stuff, using ontologies, Semantic Web.
Who is myGrid for? myGrid users IS specialists biologists systems administrators tool builders infrequent problem specific service provider bioinformaticians bioinformatics tool builders
Knowledge (ontologies) Security Personalisaion Provenance Metadata A marketitecture diagram Portals Applications Data mining, PRINTS annotation workbench BioMedical Services Library: DAS, Talisman, workflow sets Upper level knowledge-based Grid Common Services: Semantic integration, knowledge based querying, workflow composition, visualisation, provenance mgt, semantic service discovery Middle level Grid Common Services: Database access, distributed query processing, service discovery, workflow enactment, event notification Low level Grid Common Services (OGSI) Co-scheduling, data shipping, authentication, job execution, resource monitoring …
User Agent Custom Application Presentation Services Collaboration Support Management Tools Portal Client Framework Semantic Data Integration Semantic Aspect Semantic Workflow Design Information Extraction Provenance Validation & Assessment Semantic Discovery Ontology Service Preferences Metadata Aspect Versioning Availability Preferences Third-party Metadata QoS QoS QoS Provenance Coordination Services Distributed Query Workflow Enactment Syntactic Discovery Event Notification ‘White Pages’ & ‘Yellow Pages’ Discovery Networked Services Personal Repository Database Access JobExecution Device Access Device Access Security: Authentication & Authorization Database Distributed Resources resources: data and tools
Current programme • Use case scenarios. • Rolling programme of prototyping. • April myGrid 0.0, October myGrid 0.1 … • Identifying the most important services. • Agreeing consistent interfaces. • Integrating with other Grid services. • Implementing core services. • Describing services. • Connecting with other efforts.
Service based architecture • Each bio resource is a service • Database, archive, analysis, tool, person, instrument, a workflow … • Each myGrid architectural component is a service • Workflow enactment engine, event notification, registry, scheduler… • Web services • Grid services (OGSA)
Service based architecture • Find them • Publication, registration, discovery, matchmaking, deregistration. • Run them. • Execution, monitoring, exception handling. • Organise them. • Interoperation, composition, substitution.
Service Discovery • Find appropriate type of services • sequence alignment • Find appropriate instances of that service • BLAST (an algorithm for sequence alignment), as delivered by NCBI • Assist in forming an appropriate assembly of discovered services. • Find, select and execute instances of services while the workflow is being enacted. Knowledge in the head of expert bioinformatian
1. User selects values from a drop down list to create a property based description of their required service. Values are constrained to provide only sensible alternatives. 2. Once the user has entered a partial description they submit it for matching. The results are displayed below. 3. The user adds the operation to the growing workflow. 4. The workflow specification is complete and ready to match against those in the workflow repository.
Client framework myGrid 0.0 Portal Repository Client Workflow Client Ontology Client Personal Repository Workflow Repository (Meta Data) Ontology Server DAML+OIL Reasoner (FaCT) (Meta Data) Service Type Directory Workflow enactment Matcher and Ranker Service instance directory Bioinformatics services REGISTRY
Metadata & Ontologies • Metadata – computationally accessible data about the services • Ontologies – the shared and common understanding of a domain • A vocabulary of terms • Definition of what those terms mean. • A shared understanding for people and machines • Usually organised into a taxonomy.
Why ontologies for services? • A shared vocabulary for describing a service • Service classifications • “BLAST” Finds tblastx, tblastn, psi-blast, and marks_super_blast. • “Alignment” Finds ClustalW, Blast, Smith-Waterman, Needleman-Wunsch 3. Guiding service composition • Blastn compares a nucleotide query sequence against a nucleotide sequence database (usually – intelligent misuse of services…) • Not the only way to find a service.
Four tiered service descriptions Domain “semantic” • Class of service: • a protein sequence alignment, a protein sequence database. • Specific example of an abstract service: • BLASTn is a tool for computing sequence homology that uses the BLAST algorithm over nucleotides; • Instance service description of a specific service: • BLASTn service by the NCBI is 80% reliable. • Invoked instance service description: • BLAST as offered by the EBI on a particular date, with particular parameters when a service was actually enacted. Business “operational”
W3C: DAML+OIL/OWL • From the Semantic Web community • DAML+ OIL / OWL designed to describe ontologies • Information about classes, properties, and individuals as a sequence of axioms and facts & inclusion references to other ontologies, each of which can have an ID which is URI reference. • OWL ontologies are web documents referenced a URI • Ontologies also reference XML Schema datatypes. • Automated reasoning for inferring classification lattice and checking concepts are consistent • OWL Web Ontology Language 1.0 Reference • W3C Working Draft 29 July 2002 • http://www.w3.org/TR/owl-ref/
W3C -> Lots of Tools! http://oiled.man.ac.uk/
class-defdefined pairwise_sequence_alignment_service subclass-of atomic_service_operation has_Classperforms_task (aligning has_Class has_feature local has_Class has_feature pairwise) has_Classproduces_result (report has_Class is_report_of sequence_alignment) has_Classuses_resource (database has_Class contains (data has_Class encodes (sequencehas_Class is_sequence_of nucleic_acid_molecule))) has_Classrequires_input (data has_Class encodes (sequencehas_Class is_sequence_of nucleic_acid_molecule)) has_Class is_function_of (BLAST_application)
class-defdefined BLAST-n_service_operation subclass-of atomic_service_operation has_Classperforms_task (aligning has_Class has_feature local has_Class has_feature pairwise) has_Classproduces_result (report has_Class is_report_of sequence_alignment) has_Classuses_resource (database has_Class contains (data has_Class encodes (sequencehas_Class is_sequence_of nucleic_acid_molecule))) has_Classrequires_input (data has_Class encodes (sequencehas_Class is_sequence_of nucleic_acid_molecule)) has_Class is_function_of (BLAST_application)
Reasoning in DAML+OIL • Consistency— check if knowledge is meaningful • Subsumption— structure knowledge, compute classification • Equivalence— check if two classes denote same set of instances • Instantiation— check if individual i instance of class C • Retrieval— retrieve set of individuals that instantiate C
Suite Specialises. All concepts are subclassed from those in the more general ontology. Contributes concepts to form definitions. Upper level ontology Informatics ontology Molecularbiology ontology Publishing ontology Organisationontology Task ontology Bioinformatics ontology Web serviceontology
Uses of ontology • Labelling data items in databases. • Semantic typing for controlling inputs and outputs of workflows • Use by distributed query processing. • Workflow, database classification. • Linking & browsing XML-based components • COHSE • Soft build of portals. • Link with the Life Science Identifier (I3C) • BioMOBY Central service classification
(some) Registry Issues • Find services based on name, signature, types, a word (not just using the ontology). • Registry management – weeding, authorisation, decommissioning. • Publishing of services. Keeping their descriptions up to date and faithful. • Alternative descriptions of services. • Staged descriptions. • Maintenance and evolution of the ontology • Multiple registries – personal, local, enterprise
We are not alone Open Source Open Bio Foundation BioJava, BioPerl … Other Projects Astrogrid, Geodise, CLEF, Comb-e-chem, BIRN, OGSA-DAI (DeFacto) Standards OMG LSR, I3C, MGED, Gene Ontology Semantic Web RDF, RDFS, DAML+OIL Bioinformatics integration platforms DAS,OpenBSA, ISYS, OpenMMS, Kleisli, Ensembl, AppLab, SRS, BioNavigator, DiscoveryLink, K1 TAMBIS. BioMOBY … Web Services XML, SOAP, WSDL, UDDI Distributed Computing Environments CORBA, RMI, JavaOne GRID Globus/SRB/Condor/Sun Grid Engine
What about other efforts? • Integration • DAS: Distributed Annotation System • ISYS: Integration of Desktop Tool • DiscoveryLink: wrapper and distributed query environment • GO: Gene Ontology etc… • Service discovery and common typing • BioMOBY: Integration of online biological databases and analysis services • Tackling parts of the problem. • myGrid is a framework for a platform.
Top 10 thoughts • Application driven by use cases • Open Source • Data object types, APIs, protocols, ontologies have longer life span that s/w • Components are useful – don’t have to buy into the whole shooting match. • Don’t reinvent the wheel • Get others to build services / applications • Lower barriers of entry • Keep it simple. • It’s distributed and global • One solution won’t work
The myGrid team • Carole Goble • Norman Paton • Brian Warboys • Stephen Pettifer • Alvaro Fernandes • Luc Moreau • Dave De Roure • Chris Greenhalgh • Tom Rodden • John Brooke • Paul Watson • Alan Robinson • Rob Gaizauskas • Robert Stevens • Ian Horrocks • Neil Wipat • Matthew Addis • Nick Sharman • Rich Cawley • Simon Harper • Karon Mee • Simon Miles • (Vijay Dailani) • Xiaojian Liu • Tom Oinn • Martin Senger • Milena Radenkovic • Kevin Glover • (Angus Roberts) • Chris Wroe • Mark Greenwood • Phil Lord • Neil Davis • Darren Marvin • Justin Ferris • Peter Li • Nedim Alpdemir • Luca Toldo • Robin McEntire • Anne Westcott • Tony Storey • Bernard Horan • Paul Smart • Robert Haynes