1 / 47

Grids and Biology

Grids and Biology. Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October 2002. Grids and Biology. A take on the Grid Issues in Bioinformatics for Grid Various BioGrids Applicability of Grid to Biology

kamil
Download Presentation

Grids and Biology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28th October 2002

  2. Grids and Biology A take on the Grid Issues in Bioinformatics for Grid Various BioGrids Applicability of Grid to Biology Reality check

  3. What is the Grid? “ Grid computing [is] distinguished from conventional distributed computing by its focus on large-scale resource sharing, innovative applications, and, in some cases, high-performance orientation...we review the "Grid problem", which we define as flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions, and resources - what we refer to as virtual organizations." From "The Anatomy of the Grid: Enabling Scalable Virtual Organizations" by Foster, Kesselman and Tuecke

  4. What is the Grid? • Resource sharing & coordinated problem solving in dynamic, multi-institutional virtual organizations • On-demand, ubiquitous access to computing, data, and services • New capabilities constructed dynamically and transparently from distributed services • No central location, No central control, No existing trust relationships, Little predetermination • Uniformityfor Pooling Resources • Virtual pools of resources: databases, clusters….

  5. Biology as a Grid Application • Informational Science • Large Scale • Distributed • No one organisation owns it all

  6. ESTs Motivation Metabolic Pathways Pharmacogenomics Human Genome Combinatorial Chemistry Computational Load Genome Data Moores Law 1990 2000 2010

  7. BioMedical Computation [Rick Stevens, Argonne Labs]

  8. Proteins sequence 2º structure 3º structure DNA sequences alignments Biomedical Data: High Complexity and Large Scale [Rick Stevens, Argonne Labs] billions Protein-Protein Interactions metabolism pathways receptor-ligand 4º structure Physiology Cellular biology Biochemistry Neurobiology Endocrinology etc. Polymorphism and Variants genetic variants individual patients epidemiology millions millions Hundredthousands ESTs Expression patterns Large-scale screens Genetics and Maps Linkage Cytogenetic Clone-based MPMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYT... billions ...atcgaattccaggcgtcacattctcaattcca... millions

  9. myGrid BioGrid Projects • EUROGRID BioGRID • Asia Pacific BioGRID • North Carolina BioGrid • Bioinformatics Research Network • Osaka University BioGrid • Indiana University BioArchive BioGrid • myGrid • BioSim • e-Protein • ObiGrid

  10. A Single System Image Transparent wide-area access to large data banks Transparent wide-area access to applications on heterogeneous platforms Transparent wide-area access to processing resources Security, certification, single sign-on authentication, AAA Grid Security Infrastructure, Data access,Transfer & Replication GridFTP, Giggle Computational resource discovery, allocation and process creation GRAAM, Unicore, Condor-G Today’s Grid

  11. Immediate benefits • Uniform file views of directories, regardless of platform • Grid-based data transfer libraries for faster access to large files, reducing need for mirror-site servers. • Replication to support mirroring • Grid APIs provide a job manager with metadata about services to the user. Evaluate the quality of service providers based on factors that may include more than just server performance and availability. • Grid-aware applications -- split sequence reference libraries among several servers, where BLAST comparisons can be conducted in parallel. • Shielding from a variety of low-level computing problems would otherwise have to address themselves.

  12. Grid Landscape Computationally Intensive Collaborative Visualisation Data Intensive Knowledge Intensive

  13. Grid Landscape Computationally Intensive Collaborative Visualisation Data Intensive Knowledge Intensive

  14. Classical Grids emphasise sharing of physical resources. Existing Grid middleware (e.g. Globus, Condor, Unicore) allows resource discovery, resource allocation, data movement, certification … Classical Grids

  15. High Performance Bioinformatics Software [Jack da Silva, NCSC, Paracel]

  16. European DataGrid

  17. Managed access to specialist remote resources

  18. Access portal for biomolecular modeling resources. • Interfaces to enable chemists and biologists to be able to submit work to HPC facilities • Visualization of electrostatic field generated by a molecule. dr Krzysztof Nowinski (ICM)

  19. Biogrid system SCORE Management Station SCORE Management Station Myrinet-2000 Connected to Grid system3 Grid system 1 Express5800/ISS for PC-Cluster Xeon2.2G x 8 + Management node1 Flat Neighborhood networks 1000Base-SX Grid system 2 NEC Blade Server78node(156CPU) 1000Base-T x 12 Data Grid Disk Express5800/140Ra-4 x3

  20. (Chicago) STAR TAP (UC San Diego) SDSC Osaka University Tokyo XP TransPACAPAN vBNS JGN UHVEM (Osaka, Japan) NCMIR (San Diego) Remote control of instruments • Sharing of UHVEM(Ultra High Voltage Electron Microscopy) in Osaka University with NCMIR (National Center for Microscopy and Imaging Research) • 3 Million electron volts • the most powerful microscopy

  21. Home ComputersEvaluate AIDS Drugs • Community = • 1000s of home computer users • Philanthropic computing vendor (Entropia) • Research group (Scripps) • Common goal= advance AIDS research From Steve Tuecke 12 Oct. 01

  22. Matlab Geodise releasein November 02 sjc@soton.ac.uk • Matlab and toolboxes for mathematical computation, analysis, visualization, and algorithm development: MATLAB is an intuitive language and a technical computing environment. It provides core mathematics and advanced graphical tools for data analysis, visualization, and algorithm and application development. With more than 600 mathematical, statistical, and engineering functions, engineers and scientists rely on the MATLAB environment for their technical computing needs.” (www.mathworks.com) CROSS PLATFORM/ OS

  23. BioSim -- Molecular simulations as a tool for protein structure analysis [Sansom] synchrotron compute GRID MD database novel biology… • Overall vision – simulation as an integral component of structural genomics • Needs both capacity (many systems) and capability (large systems - HPCx) • Molecular Dynamics database (distributed)

  24. Grid Landscape Computationally Intensive Collaborative Visualisation Data Intensive Knowledge Intensive

  25. [Rick Stevens Argonne Labs] Visualization + Bioinformatics Visualization Environment Bioinformatic Analysis Tools Microbiology & Biochemistry Genome Visualization Tools Function Assignment Whole Genome Analysis Metabolic Reconstruction Enzymatic Constants Metabolic *** Network Visualization Tools Stoichiometric Representation & Flux Analysis Proteomics Interactive Stoichiometric Graphical Tools Dynamic Simulation Whole Cell Visualizations Image/Spectra Augmentations Laboratory Verification

  26. X-ray microtomography • Scientific discovery can be enhanced by closely coupling computation and experiment. Simulation, visualization and data gathering coupled • X-ray microtomography produces 3D X-ray attenuation maps of specimens at a microscopic level • Expensive synchrotron beam time resources optimally used to obtain sufficient resolution for simulation

  27. Interactive Steering • User steers calculation from laptop • Controlled steering on supercomputers • Visualization and computation use large scale machines accessed via Grid. Enables controlled simulation using knowledge and skills of trained scientist.

  28. Scalable molecular dynamics • Structure of a protein in a fluid medium • Calculation takes into account forces between protein and ambient medium (in this case water molecules) • Run on world largest academic computer, LeMieux at PSC (6 Tflops theoretical peak)

  29. Grid Landscape Computationally Intensive Collaborative Visualisation Data Intensive Knowledge Intensive

  30. UCSF UIUC From Klaus Schulten, Center for Biomollecular Modeling and Bioinformatics, Urbana-Champaign

  31. http://www.ks.uiuc.edu/Research/biocore/

  32. Grid Landscape: DATA!! Computationally Intensive Collaborative Visualisation Data Intensive Knowledge Intensive

  33. Information Weaving and Question Answering • Large amounts of different kinds of data & many applications. • Highly heterogeneous. • Different types, algorithms, forms, implementations, communities, service providers • High autonomy. • Highly complex and inter-related, & volatile.

  34. proteome sequences sequences SCOP CATH PDB NRPROT INTERPRO TM, CC, LC, SIG & MOTIFS PSIBLAST & HHMs PDB hit noPDB hit 3D modelling x 2 fold recognition x 2 structure-based function prediction structural and functional annotation [Mike Sternberg] Annotation Pipeline

  35. myGrid RASMOL • Personalised extensible environments for data-intensivein silico experiments in biology • Straightforward discovery, interoperation, deployment & sharing of services • Service-oriented architecture • Integration and Information • Workflow & Databases • Experimentation • Provenance, propagating change, personalisation For bioinformaticians who are building tools and using or providing services

  36. DiscoveryNet • Bio Chip Applications Protein-folding chips: SNP chips, Diff. Gene chips using LFII Protein-based fluorescent micro arrays 1-1000 10-1000 >10000 Data Quality Visualisation Structuring Clustering Distributed Dynamic Knowledge Management http://www.discovery-on-the.net/ High Throughput Sensing (HTS) Applications Large-scale Dynamic Real- time Decision support Large-scale Dynamic System Knowledge Discovery Based on Kensington Discovery Platform Grid-based Knowledge Discovery Grid-based Data Mining, Collaborative Visualisation Information Structuring Information Integration & Composition, Semantics & Domain-based Ontologies, Sharing Distributed Data Engineering Data Registration, Data Normalisation, Data Quality Based on Globus & ORB Infrastructure High Throughput Computing Services Utilising Grid Infrastructure for HT Computing Grid Basic Infrastructure Globus/Condor/SRB

  37. Grid Evolution • 1st Generation Grid • Computationally intensive, file access/transfer • Bag of various heterogeneous protocols & toolkits • Recognises internet, Ignores Web • Academic teams • 2nd Generation Grid • Data intensive -> knowledge intensive • Services-based architecture • Recognises Web and Web services • Global Grid Forum • Industry participation We are here!

  38. A Grid of resources, not just compute resources but databases, digital libraries, instruments, workflows, documents … A Grid vs The Grid NovartisGrid BioSimGrid MouseGrid Logical Grid Middleware These configurations are dynamic Resources discovered, combined, used and disbanded as and when needed or available. Gigabit IP Network Physical Node Node Node Geographically (e.g. UKGrid) Node

  39. A configuration of resources services • Not just compute services but databases, digital libraries, instruments, workflows, documents … Open Grid Service Architecture OGSA Grid Services Web Services Grid Technology

  40. Bio Services • Drug Discovery • Microbial Engineering • Molecular Ecology • Oncology Research Domain Oriented Services • Integrated Databases • Sequence Analysis • Protein Interactions • Cell Simulation Basic BioGrid Services Grid Resource Services • Compute Services • Pipeline Services • Data Archive Service • Database Hosting • Workflow Enactment • Event notification Common Services Base Services Fabric Services

  41. What We Need to Create • Grid Bio applications enablement software layer • Provide application’s access to Grid services • Provides OS independent services • Grid enabled version of bioinformatics data management tools (e.g. DL, SRS, etc.) • Need to support virtual databases via Grid services • Grid support for commercial databases • Bioinformatics applications “plug-in” modules • End user tools for a variety of domains • Support major existing Bio IT platforms

  42. Requirements for the BioGrid • Open and extendable architecture • Enable tie in to service stack at appropriate points • Not just access via Portals • Leverage scripting tools in wide use for Bioinformatics • Create BioGrid services bindings for PERL and Python • Address data federation and integration • Leverage work of IBM, Lion BioSciences, DAS, BioMOBY, etc. • Match the biology workflow and tool chain • Create high-level BioGrid services to address critical stages in existing workflow • Support composibility of new BioGrid tools with existing tool chain elements

  43. Some BioGrid Challenges • Scalable human bioinformatics expertise • Best people working on the important problems • Exploit collaboration technology to create world class teams • Robust local bioinformatics computing environment • Best systems administrators and high-end technologies • Embed local resources into the Grid via portal technologies • Access to leading edge bioinformatics software and databases customized to user needs • Core content from top scientists and developers • Integrated access to biological databases • Worldwide access to robust computing and database infrastructure • Leverage Grid technology to provide worldwide access • Integrate purpose built systems and service providers

  44. Reality Checks!! • The Technology is Ready • Not true — its emerging • Building middleware, Advancing Standards, Developing, Dependability • Building demonstrators. • The computational grid is in advance of the data intensive middleware • Integration and curation are probably the obstacles • But!! It doesn’t have to be all there to be useful. • We know how we will use grid services • No — Disruptive technology • Lower the barriers of entry.

  45. Reality Checks!! • It’s the only game • Not true — I3C, BioMOBY, bioDAS, OMG LSR • Grid and Web service merge makes integration likely. • One Size Fits All • Not true • Addressed by a minimum set of composable virtual services, But starting with Globus • It’s only for “big” science • No — “small” science collaborates too! • Biology is not unique! • AstroGrid

  46. Not a silver bullet! Its just middleware not magic • Data quality • Content management of databases (controlled vocabularies) • Provenance and versioning policies • Appropriate use of tools • Computational inaccessibility of free text annotation • Database accessibility through means other than point and click web interfaces. Independent of the Grid!

  47. Life Sciences Grid (LSG) http://people.cs.uchicago.edu/~dangulo/LSG/

More Related