420 likes | 534 Views
Pathway/Genome Databases and Software Tools. Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International pkarp@ai.sri.com http://ecocyc.DoubleTwist.com/ecocyc/. Overview. Overview of bioinformatics Motivations for the EcoCyc project EcoCyc demo
E N D
Pathway/Genome Databases and Software Tools Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International pkarp@ai.sri.com http://ecocyc.DoubleTwist.com/ecocyc/
Overview • Overview of bioinformatics • Motivations for the EcoCyc project • EcoCyc demo • Description of EcoCyc database and Pathway Tools software • Underlying technologies • Ocelot object database • GKB Editor • X-windows to WWW translator
Definition of Bioinformatics • Computational techniques for management and analysis of biological data and knowledge • Methods for disseminating, archiving, interpreting, and mining scientific information
Motivations for Bioinformatics • Growth in molecular-biology knowledge • Industrialization of biological experimentation • High-throughput biology • Genome sequences • Gene and protein expression data • Protein-protein interaction data • Protein 3-D structures • ….
A E
Motivations for EcoCyc -- E. coli Encyclopedia • Integrate E. coli information dispersed in the literature • New paradigm of scientific publishing • Model the full metabolic network of an organism • Integrate genomic data with functional data • Develop algorithms for computing with function • Provide a challenging domain for computer-science research
Definitions A C E • A chemical reaction interconverts chemical compounds • An enzyme is a protein that accelerates chemical reactions • A pathway is a linked set of reactions • A conceptual unit of cell’s biochemical machine A + B = C + D
Organism-Specific Pathway/Genome Databases • Layer functional information above the genome • Rich ontology to encode biological information with high fidelity • Chromosomes, genes, operons, gene products, reactions, pathways • Curated by experts for that organism • Integrate literature and computational predictions
Pathway Tools Software • Pathway/Genome Navigator • WWW publishing of PGDBs • Graphic depictions of pathways, chromosomes, operons • Pathway visualization of gene-expression data • Pathway/Genome Editors • Distributed curation of genome annotations • Distributed object database system • Interactive editing tools • PathoLogic • Prediction of metabolic network from genome
EcoCyc = E.coli Dataset + Pathway/Genome Navigator Operons: 375 Metabolic Network Pathways: 158 Reactions: 1,117 Compounds: 1,887 Gene Products: 4,393 Genes: 4,393 http://ecocyc.DoubleTwist.com/ecocyc/
EcoCyc • Collaborative development via internet • Karp -- Bioinformatics architect • Riley -- Metabolic pathways, signal transduction • Saier and Paulsen -- Transport • Collado -- Regulation of gene expression • Ontology of 1000 biological classes • 14,000 instances • Over 2,600 registered users
Pathway Tools Software Pathway/Genome Navigator Pathway/ Genome Databases PathoLogic Pathway Predictor Pathway/ Genome Editors
Creation of the Overview Graph • Run layout algorithms on individual pathway graphs • Automatically determine topology of pathway graph • Apply associated layout algorithm (linear, circular, tidy tree) • Use superpathways to create hierarchical layouts • Treat each individual pathway as a single node • Pathway connections are edges • Run appropriate layout algorithm • Manually position the resulting pathway clusters
Inference of Metabolic Pathways ANNOTATED GENOME Structured ASCII Text File List of Gene Products List of Genes/ORFs DNA Sequence Pathway/Genome Database MetaCyc Metabolic Network Pathway PathoLogic Compounds Reactions Gene Products Genes Reports Genomic Map
Summary of H. pylori Analysis • For 121 E. coli pathways, what is the evidence that each pathway occurs in H. pylori? • Strong evidence: 41 • Medium evidence: 29 • Little or no evidence: 51 • 31 reactions catalyzed by H. pylori but not by E. coli • H. pylori has partial abilities to synthesize cofactors and amino-acids, extremely limited carbohydrate catabolism, some amino acid utilization, and a reductive citric-acid pathway
Microbial Pathway/Genome DBs Literature-based Datasets: • MetaCyc • Escherichia coli PathoLogic-based Datasets: • Bacillus subtilis • Mycobacterium tuberculosis • Helicobacter pylori • Haemophilus influenzae • Mycoplasma pneumonia • Treponema pallidum • Chlamydia trachomatis • Saccharomyces cerevisiae
Pathway Tools Software Architecture • Implemented in Common Lisp • WWW server runs as a single Unix process with a separate thread to service each query • Grasper-CL graph manager • Ocelot object database • GKB Editor schema-driven editor
Pathway Tools Architecture --Development Configuration WWW Server X-Windows Graphics Object Editor Pathway Editor Reaction Editor GFP API Oracle Pathway Genome Navigator Ocelot DBMS
Ocelot Database System • Object Database Manager • Persistence via filesystem or relational DBMS • Demand and background faulting of objects from RDBMS • Two-level object caching • Extensive bioinformatics schema • Stored transaction history • Inspect object history
Ocelot Knowledge Server Architecture • Frame data model • Persistent storage via • Disk files • Oracle DBMS • Optimistic concurrency-control protocol • Schema evolution • Logging facility
The Frame Data Model • Frames are of two types: classes, instances • Frames have slots that define their properties, attributes, relationships • A slot has one or more values • Each value can be any Lisp datatype • Slotunits define metadata about slots: • Domain, range, inverse • Collection type, number of values, value constraints
Inference Capabilities • Inheritance of defaults • Slot values computed via attached procedures • Maintenance of inverse relationships • Constraint system • Deferred evaluation • Tolerant of nonconformant data
Storage System Architecture • Oracle KBs • DBMS is submerged within FRS • Relational schema is domain independent, supports multiple KBs simultaneously • Frames transferred from DBMS to Ocelot • On demand • By background prefetcher • Memory cache • Persistent disk cache to speed performance via Internet
Frame Faulting (get-slot-value gene ‘map-position) • Gene present in in-memory object cache? • Gene present in cache on local disk? • Query Oracle DBMS
Logging • Oracle DBMS stores: • The latest version of each frame • A history of all OKBC operations applied to KB • Reconstruct earlier versions of KB • View history of changes to an object • Update replicates • Concurrency control
Schema Management • FRSs store and process class and instance information similarly • Applications can query schema information as easily as they can query instances
GKB Editor • Browser and editor for KBs and ontologies • Four editing tools • GKB Editor reusable with multiple FRSs • All database queries via OKBC/GFP API • Interoperability achieved with Ocelot, LOOM, Ontolingua • All operations are schema driven • http://www.ai.sri.com/~gkb/overview.html
Editors • Taxonomy editor • Frame editor • Relationships editor • Spreadsheet editor
Results • Ocelot in use in the EcoCyc project for 5 years • Supports collaborative development of EcoCyc by four groups in North America • Distributed architecture • GKB Editor in active use • Supports development of 8 Pathway/Genome Databases
Summary • Pathway/Genome Databases • Pathway Tools software • Extract pathways from genomes • Distributed curation tools • Query, visualization, WWW publishing • Analysis algorithms
Computer Science Results • Extend scalability and multiuser access for knowledge representation systems • Reusable, schema-driven KB editor • Hierarchical graph layout algorithms • Dynamic translation from X-windows to HTML+GIF • Importance of ontologies and of content: • Discovery = Algorithm + Database
Problem Solving Depends onAlgorithms and Content Compute Time Algorithm Quality Solution Quality Database Size and Quality
Bioinformatics Results:Content • The EcoCyc database describes the full metabolic map of an organism • The MetaCyc database describes over 300 metabolic pathways • Ontology spans genome to pathway information
Bioinformatics Results:Algorithms • Software environment for genome and pathway information • Query and visualization • Distributed database development • PathoLogic algorithm predicts the metabolic network of an organism from its genome • Algorithms under development for qualitative modeling of the cell
Acknowledgements • Funding sources: • NIH National Center for Research Resources • Collaborators: • Monica Riley, Marine Biological Laboratory • Milton Saier, UC San Diego • Julio Collado, UNAM • Christos Ouzounis, European Bioinformatics Institute Peter D. Karp, Ph.D. http://www.ai.sri.com/pkarp/ pkarp@ai.sri.com