470 likes | 613 Views
Grids and Biology. Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October 2002. Grids and Biology. A take on the Grid Issues in Bioinformatics for Grid Various BioGrids Applicability of Grid to Biology
E N D
Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28th October 2002
Grids and Biology A take on the Grid Issues in Bioinformatics for Grid Various BioGrids Applicability of Grid to Biology Reality check
What is the Grid? “ Grid computing [is] distinguished from conventional distributed computing by its focus on large-scale resource sharing, innovative applications, and, in some cases, high-performance orientation...we review the "Grid problem", which we define as flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions, and resources - what we refer to as virtual organizations." From "The Anatomy of the Grid: Enabling Scalable Virtual Organizations" by Foster, Kesselman and Tuecke
What is the Grid? • Resource sharing & coordinated problem solving in dynamic, multi-institutional virtual organizations • On-demand, ubiquitous access to computing, data, and services • New capabilities constructed dynamically and transparently from distributed services • No central location, No central control, No existing trust relationships, Little predetermination • Uniformityfor Pooling Resources • Virtual pools of resources: databases, clusters….
Biology as a Grid Application • Informational Science • Large Scale • Distributed • No one organisation owns it all
ESTs Motivation Metabolic Pathways Pharmacogenomics Human Genome Combinatorial Chemistry Computational Load Genome Data Moores Law 1990 2000 2010
BioMedical Computation [Rick Stevens, Argonne Labs]
Proteins sequence 2º structure 3º structure DNA sequences alignments Biomedical Data: High Complexity and Large Scale [Rick Stevens, Argonne Labs] billions Protein-Protein Interactions metabolism pathways receptor-ligand 4º structure Physiology Cellular biology Biochemistry Neurobiology Endocrinology etc. Polymorphism and Variants genetic variants individual patients epidemiology millions millions Hundredthousands ESTs Expression patterns Large-scale screens Genetics and Maps Linkage Cytogenetic Clone-based MPMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYT... billions ...atcgaattccaggcgtcacattctcaattcca... millions
myGrid BioGrid Projects • EUROGRID BioGRID • Asia Pacific BioGRID • North Carolina BioGrid • Bioinformatics Research Network • Osaka University BioGrid • Indiana University BioArchive BioGrid • myGrid • BioSim • e-Protein • ObiGrid
A Single System Image Transparent wide-area access to large data banks Transparent wide-area access to applications on heterogeneous platforms Transparent wide-area access to processing resources Security, certification, single sign-on authentication, AAA Grid Security Infrastructure, Data access,Transfer & Replication GridFTP, Giggle Computational resource discovery, allocation and process creation GRAAM, Unicore, Condor-G Today’s Grid
Immediate benefits • Uniform file views of directories, regardless of platform • Grid-based data transfer libraries for faster access to large files, reducing need for mirror-site servers. • Replication to support mirroring • Grid APIs provide a job manager with metadata about services to the user. Evaluate the quality of service providers based on factors that may include more than just server performance and availability. • Grid-aware applications -- split sequence reference libraries among several servers, where BLAST comparisons can be conducted in parallel. • Shielding from a variety of low-level computing problems would otherwise have to address themselves.
Grid Landscape Computationally Intensive Collaborative Visualisation Data Intensive Knowledge Intensive
Grid Landscape Computationally Intensive Collaborative Visualisation Data Intensive Knowledge Intensive
Classical Grids emphasise sharing of physical resources. Existing Grid middleware (e.g. Globus, Condor, Unicore) allows resource discovery, resource allocation, data movement, certification … Classical Grids
High Performance Bioinformatics Software [Jack da Silva, NCSC, Paracel]
Access portal for biomolecular modeling resources. • Interfaces to enable chemists and biologists to be able to submit work to HPC facilities • Visualization of electrostatic field generated by a molecule. dr Krzysztof Nowinski (ICM)
Biogrid system SCORE Management Station SCORE Management Station Myrinet-2000 Connected to Grid system3 Grid system 1 Express5800/ISS for PC-Cluster Xeon2.2G x 8 + Management node1 Flat Neighborhood networks 1000Base-SX Grid system 2 NEC Blade Server78node(156CPU) 1000Base-T x 12 Data Grid Disk Express5800/140Ra-4 x3
(Chicago) STAR TAP (UC San Diego) SDSC Osaka University Tokyo XP TransPACAPAN vBNS JGN UHVEM (Osaka, Japan) NCMIR (San Diego) Remote control of instruments • Sharing of UHVEM(Ultra High Voltage Electron Microscopy) in Osaka University with NCMIR (National Center for Microscopy and Imaging Research) • 3 Million electron volts • the most powerful microscopy
Home ComputersEvaluate AIDS Drugs • Community = • 1000s of home computer users • Philanthropic computing vendor (Entropia) • Research group (Scripps) • Common goal= advance AIDS research From Steve Tuecke 12 Oct. 01
Matlab Geodise releasein November 02 sjc@soton.ac.uk • Matlab and toolboxes for mathematical computation, analysis, visualization, and algorithm development: MATLAB is an intuitive language and a technical computing environment. It provides core mathematics and advanced graphical tools for data analysis, visualization, and algorithm and application development. With more than 600 mathematical, statistical, and engineering functions, engineers and scientists rely on the MATLAB environment for their technical computing needs.” (www.mathworks.com) CROSS PLATFORM/ OS
BioSim -- Molecular simulations as a tool for protein structure analysis [Sansom] synchrotron compute GRID MD database novel biology… • Overall vision – simulation as an integral component of structural genomics • Needs both capacity (many systems) and capability (large systems - HPCx) • Molecular Dynamics database (distributed)
Grid Landscape Computationally Intensive Collaborative Visualisation Data Intensive Knowledge Intensive
[Rick Stevens Argonne Labs] Visualization + Bioinformatics Visualization Environment Bioinformatic Analysis Tools Microbiology & Biochemistry Genome Visualization Tools Function Assignment Whole Genome Analysis Metabolic Reconstruction Enzymatic Constants Metabolic *** Network Visualization Tools Stoichiometric Representation & Flux Analysis Proteomics Interactive Stoichiometric Graphical Tools Dynamic Simulation Whole Cell Visualizations Image/Spectra Augmentations Laboratory Verification
X-ray microtomography • Scientific discovery can be enhanced by closely coupling computation and experiment. Simulation, visualization and data gathering coupled • X-ray microtomography produces 3D X-ray attenuation maps of specimens at a microscopic level • Expensive synchrotron beam time resources optimally used to obtain sufficient resolution for simulation
Interactive Steering • User steers calculation from laptop • Controlled steering on supercomputers • Visualization and computation use large scale machines accessed via Grid. Enables controlled simulation using knowledge and skills of trained scientist.
Scalable molecular dynamics • Structure of a protein in a fluid medium • Calculation takes into account forces between protein and ambient medium (in this case water molecules) • Run on world largest academic computer, LeMieux at PSC (6 Tflops theoretical peak)
Grid Landscape Computationally Intensive Collaborative Visualisation Data Intensive Knowledge Intensive
UCSF UIUC From Klaus Schulten, Center for Biomollecular Modeling and Bioinformatics, Urbana-Champaign
Grid Landscape: DATA!! Computationally Intensive Collaborative Visualisation Data Intensive Knowledge Intensive
Information Weaving and Question Answering • Large amounts of different kinds of data & many applications. • Highly heterogeneous. • Different types, algorithms, forms, implementations, communities, service providers • High autonomy. • Highly complex and inter-related, & volatile.
proteome sequences sequences SCOP CATH PDB NRPROT INTERPRO TM, CC, LC, SIG & MOTIFS PSIBLAST & HHMs PDB hit noPDB hit 3D modelling x 2 fold recognition x 2 structure-based function prediction structural and functional annotation [Mike Sternberg] Annotation Pipeline
myGrid RASMOL • Personalised extensible environments for data-intensivein silico experiments in biology • Straightforward discovery, interoperation, deployment & sharing of services • Service-oriented architecture • Integration and Information • Workflow & Databases • Experimentation • Provenance, propagating change, personalisation For bioinformaticians who are building tools and using or providing services
DiscoveryNet • Bio Chip Applications Protein-folding chips: SNP chips, Diff. Gene chips using LFII Protein-based fluorescent micro arrays 1-1000 10-1000 >10000 Data Quality Visualisation Structuring Clustering Distributed Dynamic Knowledge Management http://www.discovery-on-the.net/ High Throughput Sensing (HTS) Applications Large-scale Dynamic Real- time Decision support Large-scale Dynamic System Knowledge Discovery Based on Kensington Discovery Platform Grid-based Knowledge Discovery Grid-based Data Mining, Collaborative Visualisation Information Structuring Information Integration & Composition, Semantics & Domain-based Ontologies, Sharing Distributed Data Engineering Data Registration, Data Normalisation, Data Quality Based on Globus & ORB Infrastructure High Throughput Computing Services Utilising Grid Infrastructure for HT Computing Grid Basic Infrastructure Globus/Condor/SRB
Grid Evolution • 1st Generation Grid • Computationally intensive, file access/transfer • Bag of various heterogeneous protocols & toolkits • Recognises internet, Ignores Web • Academic teams • 2nd Generation Grid • Data intensive -> knowledge intensive • Services-based architecture • Recognises Web and Web services • Global Grid Forum • Industry participation We are here!
A Grid of resources, not just compute resources but databases, digital libraries, instruments, workflows, documents … A Grid vs The Grid NovartisGrid BioSimGrid MouseGrid Logical Grid Middleware These configurations are dynamic Resources discovered, combined, used and disbanded as and when needed or available. Gigabit IP Network Physical Node Node Node Geographically (e.g. UKGrid) Node
A configuration of resources services • Not just compute services but databases, digital libraries, instruments, workflows, documents … Open Grid Service Architecture OGSA Grid Services Web Services Grid Technology
Bio Services • Drug Discovery • Microbial Engineering • Molecular Ecology • Oncology Research Domain Oriented Services • Integrated Databases • Sequence Analysis • Protein Interactions • Cell Simulation Basic BioGrid Services Grid Resource Services • Compute Services • Pipeline Services • Data Archive Service • Database Hosting • Workflow Enactment • Event notification Common Services Base Services Fabric Services
What We Need to Create • Grid Bio applications enablement software layer • Provide application’s access to Grid services • Provides OS independent services • Grid enabled version of bioinformatics data management tools (e.g. DL, SRS, etc.) • Need to support virtual databases via Grid services • Grid support for commercial databases • Bioinformatics applications “plug-in” modules • End user tools for a variety of domains • Support major existing Bio IT platforms
Requirements for the BioGrid • Open and extendable architecture • Enable tie in to service stack at appropriate points • Not just access via Portals • Leverage scripting tools in wide use for Bioinformatics • Create BioGrid services bindings for PERL and Python • Address data federation and integration • Leverage work of IBM, Lion BioSciences, DAS, BioMOBY, etc. • Match the biology workflow and tool chain • Create high-level BioGrid services to address critical stages in existing workflow • Support composibility of new BioGrid tools with existing tool chain elements
Some BioGrid Challenges • Scalable human bioinformatics expertise • Best people working on the important problems • Exploit collaboration technology to create world class teams • Robust local bioinformatics computing environment • Best systems administrators and high-end technologies • Embed local resources into the Grid via portal technologies • Access to leading edge bioinformatics software and databases customized to user needs • Core content from top scientists and developers • Integrated access to biological databases • Worldwide access to robust computing and database infrastructure • Leverage Grid technology to provide worldwide access • Integrate purpose built systems and service providers
Reality Checks!! • The Technology is Ready • Not true — its emerging • Building middleware, Advancing Standards, Developing, Dependability • Building demonstrators. • The computational grid is in advance of the data intensive middleware • Integration and curation are probably the obstacles • But!! It doesn’t have to be all there to be useful. • We know how we will use grid services • No — Disruptive technology • Lower the barriers of entry.
Reality Checks!! • It’s the only game • Not true — I3C, BioMOBY, bioDAS, OMG LSR • Grid and Web service merge makes integration likely. • One Size Fits All • Not true • Addressed by a minimum set of composable virtual services, But starting with Globus • It’s only for “big” science • No — “small” science collaborates too! • Biology is not unique! • AstroGrid
Not a silver bullet! Its just middleware not magic • Data quality • Content management of databases (controlled vocabularies) • Provenance and versioning policies • Appropriate use of tools • Computational inaccessibility of free text annotation • Database accessibility through means other than point and click web interfaces. Independent of the Grid!
Life Sciences Grid (LSG) http://people.cs.uchicago.edu/~dangulo/LSG/