1 / 81

Pathway Tools / BioCyc Fundamentals

Pathway.Tools is a powerful bioinformatics software that allows for the creation and maintenance of organism databases, integrating genome, pathway, and regulatory information. It offers computational inference and interactive editing tools, query and visualization capabilities, as well as tools for metabolic network and comparative analysis. With Pathway.Tools, users can interpret omics data, export metabolic networks to SBML, and accelerate the creation of flux-balance models.

pamundson
Download Presentation

Pathway Tools / BioCyc Fundamentals

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pathway Tools / BioCycFundamentals Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International pkarp@ai.sri.com BioCyc.org EcoCyc.org, MetaCyc.org, HumanCyc.org

  2. Pathway Tools Capabilities • Create and maintain an organism database integrating genome, pathway, regulatory information • Computational inference tools • Interactive editing tools • Query and visualize that database • Use the database to interpret omics data • Metabolic network analysis tools • Comparative analysis tools • Export the metabolic network to SBML • Speed creation of flux-balance models by order of magnitude

  3. BioCyc • Hundreds of microbial genomes • Inferred operons and metabolic networks • Couples curated data with computational predictions • Supports analysis of omics data • Comparative analysis tools • Microbial emphasis. Exceptions: • HumanCyc, MouseCyc, CattleCyc

  4. Model Organism Databases /Organism Specific Databases • DBs that describe the genome and other information about an organism • Every sequenced organism with an active experimental community requires a MOD • Integrate genome data with information about the biochemical and genetic network of the organism • Integrate literature-based information with computational predictions • Curated by experts for that organism • No one group can curate all the world’s genomes • Distribute workload across a community of experts to create a community resource

  5. Rationale for MODs • Each “complete” genome is incomplete in several respects: • 40%-60% of genes have no assigned function • Roughly 7% of those assigned functions are incorrect • Many assigned functions are non-specific • Need continuous updating of annotations with respect to new experimental data and computational predictions • MODs are platforms for global analyses of an organism • Interpret omics data in a pathway context • In silico prediction of essential genes • Characterize systems properties of metabolic and genetic networks

  6. What is Curation? • Ongoing updating and refinement of a PGDB • Correcting false-positive and false-negative predictions • Incorporating information from experimental literature • Authoring of comments and citations • Updating database fields • Gene positions, names, synonyms • Protein functions, activators, inhibitors • Addition of new pathways, modification of existing pathways • Defining TF binding sites, promoters, regulation of transcription initiation and other processes

  7. Pathway/Genome Database Pathways Reactions Compounds Sequence Features Proteins RNAs Regulation Operons Promoters DNA Binding Sites Regulatory Interactions Genes Chromosomes Plasmids CELL

  8. BioCyc Collection of 507 Pathway/Genome Databases • Pathway/Genome Database (PGDB) – combines information about • Pathways, reactions, substrates • Enzymes, transporters • Genes, replicons • Transcription factors/sites, promoters, operons • Tier 1: Literature-Derived PGDBs • MetaCyc • EcoCyc -- Escherichia coli K-12 • Tier 2: Computationally-derived DBs, Some Curation -- 24 PGDBs • HumanCyc • Mycobacterium tuberculosis • Tier 3: Computationally-derived DBs, No Curation -- 481 DBs

  9. Pathway Tools Overview Annotated Genome MetaCyc Reference Pathway DB PathoLogic Pathway/Genome Database Pathway/Genome Navigator Pathway/Genome Editors

  10. Pathway Tools Software: PathoLogic • Computational creation of new Pathway/Genome Databases • Transforms genome into Pathway Tools schema and layers inferred information above the genome • Predicts operons • Predicts metabolic network • Predicts which genes code for missing enzymes in metabolic pathways • Infers transport reactions from transporter names Bioinformatics 18:S225 2002

  11. Pathway Tools Software:Pathway/Genome Editors • Interactively update PGDBs with graphical editors • Support geographically distributed teams of curators with object database system • Gene editor • Protein editor • Reaction editor • Compound editor • Pathway editor • Operon editor • Publication editor

  12. Pathway Tools Software:Pathway/Genome Navigator • Querying and visualization of: • Pathways • Reactions • Metabolites • Proteins • Genes • Chromosomes • Two modes of operation: • Web mode • Desktop mode • Most functionality shared, but each has unique functionality

  13. 1,700+ licensees: 75+ groups applying software to 300+ organisms Saccharomyces cerevisiae, SGD project, Stanford University 135 pathways / 565 publications Candida albicans, CGD project, Stanford University dictyBase, Northwestern University Mouse, MGD, Jackson Laboratory Under development: Drosophila, FlyBase C. elegans, WormBase Arabidopsis thaliana, TAIR, Carnegie Institution of Washington 288 pathways / 2282 publications PlantCyc,Carnegie Institution of Washington Six Solanaceae species, Cornell University GrameneDB, Cold Spring Harbor Laboratory Medicago truncatula, Samuel Roberts Noble Foundation Pathway Tools Software: PGDBs Created Outside SRI

  14. NIAID BRCs for Biodefense pathogens: BioHealthBase -- Mycobacterium tuberculosis, Francisella tuleremia Pathema -- 80+ PGDBs PATRIC – Brucella suis, Coxiella burnetii, Rickettsia typhi EuPathDB – Cryptosporidium, Plasmodium G. Xie, Los Alamos Lab, Dental pathogens F. Brinkman, Simon Fraser Univ, Pseudomonas aeruginosa V. Schachter, Genoscope, Acinetobacter M. Bibb, John Innes Centre, Streptomyces coelicolor G. Church, Harvard, Prochlorococcus marinus, multiple strains E. Uberbacher, ORNL and G. Serres, MBL, Shewanella onedensis R.J.S. Baerends, University of Groningen, Lactococcus lactis IL1403, Lactococcus lactis MG1363, Streptococcus pneumoniae TIGR4, Bacillus subtilis 168, Bacillus cereus ATCC14579 Matthew Berriman, Sanger Centre, Trypanosoma brucei, Leishmania major Sergio Encarnacion, UNAM, Sinorhizobium meliloti Mark van der Giezen, University of London, Entamoeba histolytica, Giardia intestinalis Michael Gottfert, Technische Universitat Dresden, Bradyrhizobium japonicum Artiva Maria Goudel, Universidade Federal de Santa Catarina, Brazil, Chromobacterium violaceum ATCC 12472 Pathway Tools Software: PGDBs Created Outside SRI

  15. Pathway Tools Software: PGDBs Created Outside SRI • Large scale users: • C. Medigue, Genoscope, 200+ PGDBs • G. Sutton, J. Craig Venter Institute, 80+ PGDBs • G. Burger, U Montreal, 60+ PGDBs • Bart Weimer, Utah State University, Lactococcus lactis, Brevibacterium linens, Lactobacillus acidophilus, Lactobacillus plantarum, Lactobacillus johnsonii, Listeria monocytogenes • Partial listing of outside PGDBs at BioCyc.org

  16. Obtaining a PGDB for Organism of Interest • Find existing curated PGDB • Find existing PGDB in BioCyc • Create your own

  17. EcoCyc Project – EcoCyc.org • E.coli Encyclopedia • Review-level Model-Organism Database for E. coli • Tracks evolving annotation of the E. coli genome and cellular networks • The two paradigms of EcoCyc • “Multi-dimensional annotation of the E. coli K-12 genome” • Positions of genes; functions of gene products – 76% / 66% exp • Gene Ontology terms; MultiFun terms • Gene product summaries and literature citations • Evidence codes • Multimeric complexes • Metabolic pathways • Cellular regulation Karp, Gunsalus, Collado-Vides, Paulsen Nuc. Acids Res. 35:7577 2007ASM News 70:25 2004 Science 293:2040

  18. EcoCyc = E.coli Dataset + Pathway/Genome Navigator URL: EcoCyc.org Pathways: 246 Reactions: Metabolic: 1394 Transport: 246 Compounds: 1,830 EcoCyc v13.6 Citations: 19,000 Proteins: 4,479 Complexes: 895 RNAs: 285 Gene Regulation: Operons: 3,369 Trans Factors: 196 Promoters: 1,796 TF Binding Sites: 2,205 Genes: 4,492

  19. Paradigm 1:EcoCyc as Textual Review Article • All gene products for which experimental literature exists are curated with a minireview summary • Found on protein and RNA pages, not gene pages! • 3257 gene products contain summaries • Summaries cover function, interactions, mutant phenotypes, crystal structures, regulation, and more • Additional summaries found in pages for operons, pathways • EcoCyc cites 17,300 publications

  20. Paradigm 2: EcoCyc as Computational Symbolic Theory • Highly structured, high-fidelity knowledge representation provides computable information • Each molecular species defined as a DB object • Genes, proteins, small molecules • Each molecular interaction defined as a DB object • Metabolic reactions • Transport reactions • Transcriptional regulation of gene expression • 220 database fields capture extensive properties and relationships

  21. EcoCyc Procedures • DB updates performed by 5 staff curators • Information gathered from biomedical literature • Enter data into structured database fields • Author extensive summaries • Update evidence codes • Corrections submitted by E. coli researchers • Four releases per year • Quality assurance of data and software • Evaluate database consistency constraints • Perform element balancing of reactions • Run other checking programs

  22. EcoCyc Accelerates Science • Experimentalists • E. coli experimentalists • Experimentalists working with other microbes • Analysis of expression data • Computational biologists • Biological research using computational methods • Genome annotation • Study connectivity of E. coli metabolic network • Study phylogentic extent of metabolic pathways and enzymes in all domains of life • Bioinformaticists • Training and validation of new bioinformatics algorithms – predict operons, promoters, protein functional linkages, protein-protein interactions, • Metabolic engineers • “Design of organisms for the production of organic acids, amino acids, ethanol, hydrogen, and solvents “ • Educators

  23. MetaCyc: Metabolic Encyclopedia • Describe a representative sample of every experimentally determined metabolic pathway • Describe properties of metabolic enzymes • Literature-based DB with extensive references and commentary • Pathways, reactions, enzymes, substrates • Jointly developed by • P. Karp, R. Caspi, C. Fulcher, SRI International • L. Mueller, A. Pujar, Boyce Thompson Institute • S. Rhee, P. Zhang, Carnegie Institution Nucleic Acids Research2008

  24. Applications of MetaCyc • Reference source on metabolic pathways • Metabolic engineering • Find enzymes with desired activities, regulatory properties • Determine cofactor requirements • Predict pathways from genomes • Systematic studies of metabolism • Computer-aided education

  25. MetaCyc Data -- Version 13.6

  26. Taxonomic Distribution ofMetaCyc Pathways – version 13.1

  27. Enzyme Data Available in MetaCyc • Reaction(s) catalyzed • Alternative substrates • Activators, inhibitors, cofactors, prosthetic groups • Subunit structure • Genes • Features on protein sequence • Cellular location • pI, molecular weight, Km, Vmax • Gene Ontology terms • Links to other bioinformatics databases

  28. MetaCyc Pathway Variants • Pathways that accomplish similar biochemical functions using different biochemical routes • Alanine biosynthesis I – E. coli • Alanine biosynthesis II – H. sapiens • Pathways that accomplish similar biochemical functions using similar sets of reactions • Several variants of TCA Cycle

  29. MetaCyc Super-Pathways • Groups of pathways linked by common substrates • Example: Super-pathway containing • Chorismate biosynthesis • Tryptophan biosynthesis • Phenylalanine biosynthesis • Tyrosine biosynthesis • Super-pathways defined by listing their component pathways • Multiple levels of super-pathways can be defined • Pathway layout algorithms accommodate super-pathways

  30. Comparison of BioCyc to KEGG • KEGG approach: Static collection of reference pathway diagrams are color-coded to produce organism-specific views • KEGG vs MetaCyc: Resource on literature-derived pathways • KEGG maps are not pathways Nuc Acids Res 34:3687 2006 • KEGG maps contain multiple biological pathways • KEGG maps are composites of pathways in many organisms -- do not identify what specific pathways elucidated in what organisms • KEGG has no literature citations, no comments, less enzyme detail • KEGG vs BioCyc organism-specific PGDBs • KEGG does not curate or customize pathway networks for each organism • Highly curated PGDBs now exist for important organisms such as E. coli, yeast, mouse, Arabidopsis • KEGG re-annotates entire genome for each organism

  31. Comparison of Pathway Tools to KEGG • Inference tools • KEGG does not predict presence or absence of pathways • KEGG lacks pathway hole filler, operon predictor • Curation tools • KEGG does not distribute curation tools • No ability to customize pathways to the organism • Pathway Tools schema much more comprehensive • Visualization and analysis • KEGG does not perform automatic pathway layout • KEGG metabolic-map diagram extremely limited • No comparative pathway analysis

  32. Pathway Tools Implementation Details • Platforms: • Macintosh, PC/Linux, and PC/Windows platforms • Same binary can run as desktop app or Web server • Production-quality software • Version control • Two regular releases per year • Extensive quality assurance • Extensive documentation • Auto-patch • Automatic DB-upgrade • 420,000 lines of Lisp code

  33. Ptools-support@ai.sri.com

  34. Pathway Tools Architecture Pathway Genome Navigator Web Mode Desktop Mode Lisp Perl Java Protein Editor Pathway Editor Reaction Editor GFP API Oracle or MySQL Disk File Ocelot DBMS

  35. Ocelot Knowledge Server Architecture • Frame data model • Minimizes size of schema relative to semantic complexity • Schema is stored within the DB • Schema is self documenting • Slot units define metadata about slots • Domain, range, inverse • Collection type, number of values, value constraints • Comment • Schema evolution facilitated by • Easy addition/removal of slots, or alteration of slot datatypes • Flexible data formats that do not require dumping/reloading of data

  36. Ocelot Storage System Architecture • Persistent storage via disk files or Oracle or MySQL • Concurrent development: Oracle or MySQL • Single-user development: disk files • Oracle/MySQL DBMS storage • DBMS is submerged within Ocelot, invisible to users • Frames transferred from DBMS to Ocelot • On demand • By background prefetcher • Memory cache • Persistent disk cache to speed performance via Internet • Transaction logging facility

  37. Why Do We Code in Common Lisp? • Gatt studied Lisp and Java implementation of 16 programs by 14 programmers (Intelligence 11:21 2000) • The average Lisp program ran 33 times faster than the average Java program • The average Lisp program was written 5 times faster than the average Java program • Roberts compared Java and Lisp implementations of a Domain Name Server (DNS) resolver • http://www.findinglisp.com/papers/case_study_java_lisp_dns.html • The Lisp version had ½ as many lines as code

  38. Common Lisp ProgrammingEnvironment • Interpreted and/or compiled execution • Fabulous debugging environment • High-level language • Interactive data exploration • Extensive built-in libraries • Dynamic redefinition • Find out more! • See ALU.org or • http://www.international-lisp-conference.org/

  39. PathoLogic Processing • Translate source genome to PGDB form • Predict operons • Predict metabolic pathways • Predict pathway hole fillers • Transport inference parser • Build metabolic overview diagram

  40. PathoLogic Step 1: Translate Genome to PGDB Gene Products Genes/ORFs DNA Sequences Pathways Reactions Compounds Annotated Genomic Sequence Pathway/Genome Database Pathways Reactions PathoLogic Software Integrates genome and pathway data to identify putative metabolic networks Compounds Multi-organism Pathway Database (MetaCyc) Gene Products Genes Genomic Map

  41. PathoLogic Step 3: Prediction of Metabolic Pathways • Infer reaction complement of organism • Match enzymes in source genome to MetaCyc reactions they catalyze • Match enzyme names and EC numbers to MetaCyc • Support user in manually matching additional enzymes • Computationally predict which MetaCyc metabolic pathways are present • For each MetaCyc pathway, evaluate which of its reactions are catalyzed by the organism

  42. Match Enzymes to Reactions 5.1.3.2 Gene product MetaCyc UDP-glucose-4-epimerase 2057 proteins matched by EC# 314 matched by name Match yes no Probable enzyme -ase 1320 Assign UDP-D-glucose  UDP-galactose no yes Manually search Not a metabolic enzyme no yes Assign Can’t Assign 625

  43. Import Pathways MetaCyc Containing pathways reactions Import All Prune? yes no Delete Manual Review yes no delete keep

  44. Pathway Prediction • Prediction is hard because • Enzyme naming is irregular • Some reactions present in multiple pathways • Pathway variants share many reactions in common • MetaCyc now has many pathways

  45. Pathway Scoring Criteria • Imported pathways must satisfy: • Pathways outside their taxonomic range must have enzymes for all reactions • If any reactions in a pathway are designated as “key,” an enzyme must be present for at least one • Pathway P is imported if any conditions satisfied: • One unique enzyme present for P • P missing at most one reaction • More reactions present than absent for P • P is not a superset of another pathway with the same number of enzymes present

  46. Pathway Evidence Report

  47. PathoLogic Step 4: Pathway Hole Filler • Definition: Pathway Holes are reactions in metabolic pathways for which no enzyme is identified 1.4.3.- quinolinate synthetase nadA iminoaspartate L-aspartate quinolinate holes n.n. pyrophosphorylase nadC NAD+ synthetase, NH3 -dependent CC3619 deamido-NAD nicotinate nucleotide 2.7.7.18 6.3.5.1 NAD

  48. Step 1: Query UniProt for all sequences having EC# of pathway hole Step 2: BLAST against target genome Step 3 & 4: Consolidate hits and evaluate evidence gene X organism 1 enzyme A organism 2 enzyme A organism 3 enzyme A organism 4 enzyme A 7 queries have high-scoring hits to sequence Y organism 5 enzyme A gene Y organism 6 enzyme A organism 7 enzyme A organism 8 enzyme A gene Z

  49. Pathway Hole Filler • Why should hole filler find things beyond the original genome annotation? • Reverse BLAST searches more sensitive • Reverse BLAST searches find second domains • Integration of multiple evidence types

More Related