410 likes | 622 Views
Peter Rice. Bioinformatics and Grid: Progress and Potential. Peter Rice, EBI (pmr@ebi.ac.uk) ISGC, April 2005. European Bioinformatics Institute. Part of the European Molecular Biology Laboratory International organisation 18 member states Headquarters (EMBL) in Heidelberg, Germany
E N D
Peter Rice Bioinformatics and Grid: Progress and Potential Peter Rice, EBI (pmr@ebi.ac.uk) ISGC, April 2005
European Bioinformatics Institute • Part of the European Molecular Biology Laboratory • International organisation • 18 member states • Headquarters (EMBL) in Heidelberg, Germany • 4 Specialist "Outstations" • Hamburg, Germany - DESY - Structural biology • Grenoble, France - ESRF - Structural biology • Monterotondo, Italy - Mouse genetics • Hinxton, United Kingdom - Bioinformatics
European Bioinformatics Institute • On the Hinxton Genome Campus near Cambridge • Public biological database provider • EMBLbank - DNA sequence (human genome, etc.) • UniProt - Protein sequence • MSD - Protein structure • InterPro - Protein function • ArrayExpress - Gene expression data • GO - Gene ontology • Ensembl, Integr8 - Integrated data resources • Scientific literature • Bioinformatics research • Bioinformatics services • Data retrieval (SRS, etc.) • Sequence searches (BLAST etc.) • Open source analysis tools
European Bioinformatics Institute RFCGR (HGMP) Sanger EBI
eScience at EBI • Tool and data integration • Creating services and service standards • Building services into workflows • Semantic web/grid technologies • Grid computing • Currently web service implementations • Databases produced by EBI, mirrored across Europe • Data additions and modifications 1/sec • Search services and analysis tools • Simple for laboratory biologists • Complex for expert bioinformaticians • Managing local and mirrored data and services • The bioinformatics view of the Grid is hard to define
Data flow in particle physics Interest in 1-10,000 per billion events 40 million collisions/sec 10 Gbytes data/sec 100 Kbytes per event 10 Kbytes per event Raw data 2Mbytes/event Data stored at 1.25 Gbytes/sec Expecting: Tape 20 Pbytes/year Disk 5 Pbytes/year
Data flow in bioinformatics - data sources • DNA sequence data • Major sequencing centres ... • ... and small laboratories • Three major databases (EMBL, GenBank, DDBJ) • Data volume doubles every year • Associated databases: • Gene expression • Genetics • ... • Protein data • Sequence translated from DNA sequence • Annotation automated and by experts • Associated databases: • Protein 3D structure • Protein families and domains • Protein function • Protein expression • ...
Data flow in bioinformatics - user communities • Bioinformatics research - software/database developers • Bioinformatics research - expert users • Academic biological research • Molecular biology • Biochemistry • Microbiology • Clinical medicine • Crop research • Physiology (Systems biology) • Industry • Pharmaceutical industry • Small biotechnology companies • Bioinformatics software/database providers • Integrated solutions providers • General public • Interest in human genome and other data
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa Filling a genomic gap in Silico Services published on the web, many without programmatic interfaces
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa Filling a genomic gap in Silico Services published on the web, many without programmatic interfaces Public and local databases and data sets Protein-protein interaction algorithms Sequence alignment algorithms Visualisation tools Ontology services Stochastic models for clustering gene expression data Protein folding simulations Gene prediction algorithms Literature searches
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa Filling a genomic gap in Silico
Analysis tools - EMBOSS • European Molecular Biology Open Software Suite • Analysis of DNA and protein sequence data • Open source project started in 1996 • With Rosalind Franklin Centre for Genomic Research • Over 20,000 unique downloads • Reads and writes many data formats • Command line driven • Over 50 alternative user interfaces • Web • GUI • Integrated applications • Web services (e.g. SoapLab) • Command definition language (ACD)
SoapLab Web Services • Web service wrappers for EMBOSS ... • ... and legacy applications • ... and CGI web pages • Implements OMG standard: Biomolecular Sequence Analysis • Application is defined in the EMBOSS ACD style • Definition converted into input and output port types • SoapLab server provides stateful and stateless job control • Inherits multiple data formats from EMBOSS
Taverna Workbench Taverna workflows and the myGrid Project Scufl language parser Freefluo Workflow Enactor Core Processor Processor Processor Processor Processor Processor Processor Bio MOBY Bio MART Seq Hound Plain Web Service Soap lab Local App Enactor
Practical example: Williams-Beuren Syndrome • Genetic disease • 1/20,000 children affected • Multiple phenotypes: • Characteristic facial features • Muscle, nervous system and circulation • Mental retardation • Mapped to human chromosome 7 • Deletion (missing DNA) in all cases • Also missing in the draft human genome sequence (2000)
Williams-Beuren Syndrome Microdeletion FKBP6 FZD9 BAZ1B BCL7B TBL2 WBSCR14 STX1A CLDN4 CLDN3 ELN LIMK1 LAB EIF4H RFC2 CYCLN2 GTF2IRD1 GTF2I NCF1 GTF2IRD2 * CTA-315H11 gap 7q11.23 ~1.4 Mb CTB-51J22 * SVAS WBS Chr 7 ~155 Mb Physical contig Patient deletions
Query nucleotide sequence RepeatMasker BLASTwrapper Pink: Outputs/inputs of a service Purple: Tailor-made services Green: Emboss soaplab services Yellow: Manchester soaplab services GenBank Accession No Promotor Prediction URL inc GB identifier TF binding Prediction Translation/sequence file. Good for records and publications prettyseq Regulation Element Prediction GenBank Entry Amino Acid translation Sort for appropriate Sequences only Identifies PEST seq epestfind Identify regulatory elements in genomic sequence Seqret Identifies FingerPRINTS pscan MW, length, charge, pI, etc Nucleotide seq (Fasta) pepstats 6 ORFs Predicts Coiled-coil regions RepeatMasker pepcoil tblastn Vs nr, est, est_mouse, est_human databases. Blastp Vs nr Coding sequence GenScan BlastWrapper Restriction enzyme map restrict SignalP TargetP PSORTII sixpack Predicts cellular location transeq CpG Island locations and % cpgreport Identifies functional and structural domains/motifs InterPro RepeatMasker Repetitive elements ORFs Hydrophobic regions Pepwindow? Octanol? Blastn Vs nr, est databases. ncbiBlastWrapper
Query nucleotide sequence RepeatMasker BLASTwrapper Pink: Outputs/inputs of a service Purple: Tailor-made services Green: Emboss soaplab services Yellow: Manchester soaplab services GenBank Accession No Promotor Prediction URL inc GB identifier TF binding Prediction Translation/sequence file. Good for records and publications prettyseq Regulation Element Prediction GenBank Entry Amino Acid translation Sort for appropriate Sequences only Identifies PEST seq epestfind Identify regulatory elements in genomic sequence Seqret Identifies FingerPRINTS pscan MW, length, charge, pI, etc Nucleotide seq (Fasta) pepstats 6 ORFs Predicts Coiled-coil regions RepeatMasker pepcoil tblastn Vs nr, est, est_mouse, est_human databases. Blastp Vs nr Coding sequence GenScan BlastWrapper Restriction enzyme map restrict SignalP TargetP PSORTII sixpack Predicts cellular location transeq CpG Island locations and % cpgreport Identifies functional and structural domains/motifs InterPro RepeatMasker Repetitive elements ORFs Hydrophobic regions Pepwindow? Octanol? Blastn Vs nr, est databases. ncbiBlastWrapper
Williams-Beuren Workflows Identification of overlapping sequence Characterisation of protein sequence Characterisation of nucleotide sequence
Third-party tools Tavernae-Science workbench LSID Launchpad Haystack Applications Web portals Utopia e-Science process patterns LSID support myGrid information model e-Science mediator e-Science coordination Metadata Management Data Management e-Science events KAVE metadata store Service & workflowdiscovery mIRmyGrid information repository Fetasemantic discovery KAVE provenance capture Core Services Pedro semantic publication Workflow enactment Pedro semantic publication Freefluoworkflow engine GRIMOIRES federated UDDI+ registry Notification service myGrid ontology Web Service (Grid Service) communication fabric External Services Java applications Soaplab AMBITtext extraction service OGSA-DAI DQP service Executable codes with an IDL Gowlab Legacy applications Web Services OGSA-DAI databases Web Sites
Results Management • Taverna/Freefluo Workflow Enactment Engine is agnostic about the data flowing through it. • As objects progress through, they are tagged with terms from ontologies, free text descriptions and MIME types, and which may contain arbitrary collection structures. • Using the metadata hints we can locate and launch pluggable view components. One WBS workflow can produce ~130 files. (intermediate) results management and presentation can be a major problem.
Workflow environment • Taverna API acts as an intermediate layer between user level applications and workflow enactors such as FreeFluo. • Includes object models for both workflow definitions and data objects in a workflow • Implicit iteration and data flow • Data sets and nested flows • Configurable failure handling • Life Science ID resolution • Plug-in framework • Event notification • Provenance and status reporting • Permissive type management • Graphical display • Data entry wizard
Bioinformatics standards: Life Science Identifiers • OMG standard proposal – IBM, EBI, I3C • Standard identifier for biological entities • Uniform Resource Name (URN) format • Example: • URN:LSID:ebi.ac.uk:SWISS-PROT.accession:P34355:3 • Authority: ebi.ac.uk (can be any string, e.g. emboss.org) • Namespace: SWISS-PROT.accession • Object: P34355 • (Optional) revision: 3 • Also used for internal objects in Taverna
Bioinformatics standards: Life Science Analysis Engine • Reuses OMG Biomolecular Sequence Analysis components • Describes SoapLab2 ... Already partly implemented • Platform-independent model • Platform-specific models for: • Web services • Java • Defines metadata for analysis services • Input and outputs: • Syntactic type - data format • Semantic type - data type
Bioinformatics standards: ACD command definitions • Developed for the EMBOSS project • Applied to general command-line controlled applications • Write in a simple text format, convertible to XML etc. • Validation tools provided by EMBOSS • Easy to extend: • Syntactic types (wrappers can choose one or more) • Semantic type • Relating output to inputs • "... Is an alignment of input1 and input2" • "... Is sequence feature positions in input1" • Application metadata • Hints for service wrappers and GUIs • Potential split into multiple simpler services
EMBRACE: Putting this into practice • European Union "Network of Excellence" 5-year project • Coordinated by Graham Cameron at EBI • Test cases requiring integration of data content and tools • Application interface standards for data content: • DNA and protein sequence data • Structure and image data • Gene and protein expression • Literature and text mining • Analysis tools using data content standards • Sequence analysis tools (EMBOSS etc.) • Structure analysis tools • ... and tools for all the other data types • Taverna as an example user interface.
Scientific Content The User doesn’t care
Layers User interface Application
Databases User interface Application
Interconnectivity Application interface User interface Application
Communicate objects and their identities Application interface User interface Application
Using standard protocols Application interface User interface Application
ComparaGrid: What the biologist really needs • UK BBSRC-funded project • Integrating data across species • Vertebrates • Invertebrates • Plants • Fungi • Micro-organisms • Detailed knowledge is in the model organisms (genome projects etc.) • Biologists need to use this knowledge to understand other species. • This is difficult: • Need to understand the data resources • Need to understand the biology • There is a strong overlap with the EMBRACE project • EMBRACE: interface standards for data and tools • ComparaGrid: how to explore the data using these standards
Web vs Grid services:strengths and weaknesses now Web services Grid services Information world Infrastructure world EMBRACEgrid Requires: Data management Data replication Service discovery Computing Lack of infrastructure providing low-level services OK ?? OK KO Instability and lack of robustness KO ?? KO OK Standards still evolving, and implementations lying behind
Acknowlegements myGrid: Carole Goble, Chris Wroe, Hannah Tipney (Manchester), Anil Wipat (Newcastle) ... and the rest of the myGrid team Tom Oinn, Martin Senger (EBI) Taverna: Tom Oinn (EBI) and his many collaborators SoapLab, LSID1, LSAE2: Martin Senger (EBI), Sean Martin1 (IBM), Mike Niemi1 (I3C), Richard Scott2 (deNovo) EMBOSS: Alan Bleasby, Jon Ison, Gary Williams, Claude Beezley, Hugh Morgan (RFCGR) Tim Carver (Sanger) Lisa Mullan (EBI) EMBRACE: Graham Cameron, Kerstin Nyberg (EBI) Alan Bleasby (RFCGR), Vincent Breton (CNRS France), Erik Bongcam-Rudloff (LCB Sweden), Gert Vriend (CMBI, Netherlands) COMPARAGRID: Andy Law (Roslin), Anil Wipat (Newcastle) ... and the rest of the team