360 likes | 619 Views
myGrid. Grid Services for Data-Intensive Bioinformatics Alvaro A A Fernandes Department of Computer Science University of Manchester, UK OiBC 2002, Washington, DC, USA, 19 November 2002. outline. myGrid: an e-science project distributed queries in myGrid. acknowledgements.
E N D
myGrid Grid Services for Data-Intensive Bioinformatics Alvaro A A Fernandes Department of Computer Science University of Manchester, UK OiBC 2002, Washington, DC, USA, 19 November 2002
outline • myGrid: an e-science project • distributed queries in myGrid
acknowledgements • most of the slides used in this presentation were originally designed by other members of the myGrid and Polar* teams (on which, see later) than me (Alvaro A. A. Fernandes • my stated authorship of this talk should therefore be qualified in light of this fact
why grids? • large-scale science and engineering requires the dynamic, on-demand interactionof people, heterogeneous data and computing resources and instruments, which are geographically and organizationally dispersed. • a grid is aimed at facilitating such interactions so as to make them routine thereby supporting large-scale science and engineering.
what is a grid? • resource sharing and coordinated problem solving in dynamic, multi-institutional virtual organizations • on-demand, ubiquitous access to data and computing services • new capabilities constructed dynamically and transparently from distributed services • no central location, no central control, no existing trust relationships, minimal predetermination
myGrid open grid services architecture web services grid protocols grids are evolving • classical grids run as low-level middleware for discovery, certification, allocation, sharing, transport, etc., of primary resources • latest grids are service-based:
myGridas a project • one of 6 pilot projects • funded by the UK EPSRC • in the first round of the UK DTI e-science programme • [Jan 02, Mar 05] 16 post-docs, [Mar 03] 9 PhD students • a collaboration of researchers from the universities of Manchester, Southampton, Newcastle, Nottingham and Sheffield, and EMBL-EBI (Hinxton) • with industrial partnerships with
myGrid • users, aims and outcomes • concept: an e-science workbench • role: an open, extensible, standards-compliant upper-middleware • functionality: an ontology-driven orchestration tool for integrating data and processes • architecture: a collection of Grid services
myGrid users myGrid users IS specialists biologists systems administrators infrequent tool builders problem specific service provider bioinformaticians bioinformatics tool builders
myGrid aims • to facilitate personalized discovery, interoperation, integration and sharing of knowledge, materials and methods • to enable coordination of experimental steps into distributed workflows and distributed queries • to improve quality control of in silico experiments through provenance and notification of change
myGrid outcomes • for e-scientists: • a grid-enabled workbench • applications (e.g., model organism gene expression analysis, GPCR fingerprints database annotation, etc.) • for developers: a myGrid-in-a-box kit • for the specification of services • over service ontologies and models • implemented APIs and protocols • linking in existing integration platforms for the life sciences
myGridas an e-science workbench • a personalized collaborative problem-solving platform for scientists • interoperating in a heterogeneous, distributed environment, where they can • specify, as well as • discover, • re-use, • adapt, • evolve, and ultimately • perform long-lived in silico experiments, then • publish findings in shared repositories • with appropriate records of provenance and recency of tools and data used
myGridas upper-middleware • myGrid is an extensible open platform • for data and tools interoperability • in the form of upper-middleware • embedded in the Open Grid Services Architecture • inspired by technologies meant for Web Services and the Semantic Web
myGridas software functionality • ontology-driven orchestration • of access and integration tasks • for heterogeneous, autonomous, distributed • processes and data resources
User Agent Portal Custom Applications Presentation Services Collaboration Support Management Tools Client Framework Semantic Data Integration Provenance and Validation Semantic Workflow Design Information Extraction Semantic Discovery Metadata and Ontology Services Semantic Aspect Distributed Query Processing Workflow Enactment Syntactic Discovery semantic description and discovery Coordination Services biology informatics Event Notification publication Personal Repository Database Services Device Access Task Enactment White/Yellow Pages Discovery Networked Services Scientists can search for workflows and services semantically, using ontologies interpreted by description logics Authentication and Authorization text extraction repositories Distributed Resources workflows resources: data and tools distributed queries provenance personal repository Text extraction features allow the scientist to query information from semi-structured and unstructured data sources Workflows model in silico experiments that can evolve, be published and replayed over the evolving underlying resources A personal repository allows the scientist to save, search and share queries and data, workflows and outcomes, along with provenance data and notification needs Distributed queries allow the scientist to access, query and integrate heterogeneous, autonomous, distributed data sources Workflow and query results can be annotated with provenance information to give some indication of value Many algorithms Many implementations Many service providers Many types and representations event notification Event notifications inform scientists and applications of database updates and changes in service status myGrid layered services
IF-1 Apr 02 IF-2 Oct 02 Feb 03 IF-3 IF-5 Oct 03 IF-4 June 03 myGrid phased releases Kick-off meeting Nov 01 Pre-Prototype Consolidation & Architecture Prototype Demonstrator Pre-Release 1.0 Release 1.0
myGrid 0.0, 10.02 Client framework Portal Service Selection Client Repository Client Workflow Client Ontology Client Personal Repository Workflow Repository (Metadata) Ontology Server DAML+OIL Reasoner (FaCT) (Metadata) Service Type Directory Workflow enactment Matcher and Ranker Service Instance directory bioservices
distributed querying (DQ)in myGrid • as a scalable, loosely-coupled approach to data integration • as a grid service • as a declarative approach to service orchestration • based on the ODMG-compliant Polar parallel query engine • whose distributed successor is called Polar*
myGrid’s Polar* distributed query compiler OQL query single-node optimiser parser logical optimiser physical optimiser partitioner scheduler multi-node optimiser evaluators query results
distributed query processing (DQP) • DQP involves a single query referencing data stored at multiple sites • no replication as in warehousing • no tight-coupling as in federating • the locations of the data may be transparent to the author of the query • using ODMG OQL • allows external calls • elegant nesting and unnesting of collections • clean any-valued relationships
a distributed queryintegration data and tools select p.proteinId, Blast(p.sequence) from protein p, proteinTerm t where t.termId = ‘S92’ and p.proteinId = t.proteinId • two distributed database queries: • proteinTerm to a GO Gene Ontology running as a mySQL DB • protein to a GIMS Genome Warehouse running as a Polar parallel ODMG-compliant DBMS • and one external tool access: Blast
logical optimisation reduce • plan is expressed using a logical algebra • by heuristic-based application of equivalence laws • multiple equivalent plans are generated op_call (Blast) join (proteinId) reduce reduce scan (protein) scan termID=S92 (proteinTerm)
physical optimisation reduce • plan is expressed using a physical algebra • logical operators are replaced with physical operators • by cost-based ranking of equivalent plans • a probably, acceptably efficient execution plan is chosen op_call (Blast) hash_join (proteinId) reduce reduce table_scan (protein) table_scan termID=S92 (proteinTerm)
partitioning reduce • plan is expressed in a parallel algebra • parallel algebra = physical algebra + data exchange • exchange operators are placed where data movement between evaluator nodes is required op_call (Blast) exchange hash_join (proteinId) exchange exchange reduce reduce table_scan (protein) table_scan termID=S92 (proteinTerm)
4,5 reduce op_call (Blast) 3,6 exchange hash_join (proteinId) exchange exchange reduce reduce 2,3 6 table_scan (protein) table_scan termID=S92 (proteinTerm) scheduling • plan is expressed by decorating parallel algebra expression with allocated Grid nodes • using a heuristic algorithm based on memory use and network costs • an allocation of sub-plans to nodes is decided
all algebraic operators are implemented using the iterator model the iterator model supports three standard operators: open() next() close() remote data sources and computational resources are accessed through iterator-based wrappers calls propagate through the tree many nodes may be active at any one time since the algebra includes an exchange operator partitioned and pipelined parallelism are supported evaluation
parallelisms 4,5 reduce partitioned parallelism 5 op_call (Blast) 4 3,6 exchange hash_join (proteinId) pipelined parallelism exchange exchange reduce reduce 2,3 6 table_scan (protein) table_scan termID=S92 (proteinTerm)
Operators evaluation approach Subqueries are sent to nodes Subqueries are installed on nodes Subqueries are executed as iterators Exchange moves data at runtime Execution Communication Communication – MPICH-G
evaluation on the grid • the current version of the myGrid Polar* evaluator is implemented using MPICH-G • a parallel query runs as a parallel MPI program over distributed machines • MPICH-G shields the Polar* evaluator from direct use of low-level Grid (Globus) complexities • the metadata repository knows where resources named in queries and where query evaluators reside
future internal features of the myGrid Polar* DQP • an extended set of supported operators • more extensive and sophisticated use of system environment information in query optimisation • better and more comprehensive empirically-validated cost models (e.g., economic costs will be considered)
future external features of the myGrid Polar* DQP • OGSA-compliance • access to data sources using Grid data service interfaces • the DQP also becomes a Grid service • use of adaptive query processing technologies to cope resiliently and robustly with violation of query optimisation assumptions
grids benefit from DQP: grids gain a declarative, high-level programming model with implicit parallelism DQP-based access and integration should in principle run faster than those manually coded DQP benefits from grids: DQP gains systematic access to remote data and computational resources with dynamic resource discovery and allocation synergybetween DQP and grids
the myGrid/Polar* teams • Carole Goble • Norman W Paton • Brian Warboys • Stephen Pettifer • Alvaro A A Fernandes • Robert Stevens • Ian Horrocks • John Brooke • Luc Moreau • Dave De Roure • Chris Greenhalgh • Tom Rodden • Paul Watson • Anil Wipat • Rob Gaizauskas • Alan Robinson • Martin Senger • Tom Oinn • Matthew Addis • Simon Miles • (Vijay Dialani) • Xiaojian Liu • Milena Radenkovic • Kevin Glover • (Angus Roberts) • Chris Wroe • Mark Greenwood • Phil Lord • Nick Sharman • Rich Cawley • Simon Harper • Karon Mee • M Nedim Alpdemir • Darren Marvin • Justin Ferris • Peter Li • Neil Davis • Luca Toldo • Robin McEntire • Anne Westcott • Tony Storey • Bernard Horan • Paul Smart • Robert Haynes • Jim Smith • Arijit Mukherjee • Anastasios Gounaris • Rizos Sakellariou
further information • on myGrid • website:http://www.mygrid.org.uk/ • e-mail: [mygrid@cs.man.ac.uk] • Prof Carole Goble (project director) [Carole.Goble@cs.man.ac.uk] • Nick Sharman (project manager) [Nick.Sharman@cs.man.ac.uk] • on the myGrid Polar* distributed query engine • http://www.ncl.ac.uk/polarstar • Dr Paul Watson [Paul.Watson@ncl.ac.uk] • Prof Norman W Paton [norm@cs.man.ac.uk] • Dr Alvaro A A Fernandes [alvaro@cs.man.ac.uk]