810 likes | 834 Views
Pathway.Tools is a powerful bioinformatics software that allows for the creation and maintenance of organism databases, integrating genome, pathway, and regulatory information. It offers computational inference and interactive editing tools, query and visualization capabilities, as well as tools for metabolic network and comparative analysis. With Pathway.Tools, users can interpret omics data, export metabolic networks to SBML, and accelerate the creation of flux-balance models.
E N D
Pathway Tools / BioCycFundamentals Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International pkarp@ai.sri.com BioCyc.org EcoCyc.org, MetaCyc.org, HumanCyc.org
Pathway Tools Capabilities • Create and maintain an organism database integrating genome, pathway, regulatory information • Computational inference tools • Interactive editing tools • Query and visualize that database • Use the database to interpret omics data • Metabolic network analysis tools • Comparative analysis tools • Export the metabolic network to SBML • Speed creation of flux-balance models by order of magnitude
BioCyc • Hundreds of microbial genomes • Inferred operons and metabolic networks • Couples curated data with computational predictions • Supports analysis of omics data • Comparative analysis tools • Microbial emphasis. Exceptions: • HumanCyc, MouseCyc, CattleCyc
Model Organism Databases /Organism Specific Databases • DBs that describe the genome and other information about an organism • Every sequenced organism with an active experimental community requires a MOD • Integrate genome data with information about the biochemical and genetic network of the organism • Integrate literature-based information with computational predictions • Curated by experts for that organism • No one group can curate all the world’s genomes • Distribute workload across a community of experts to create a community resource
Rationale for MODs • Each “complete” genome is incomplete in several respects: • 40%-60% of genes have no assigned function • Roughly 7% of those assigned functions are incorrect • Many assigned functions are non-specific • Need continuous updating of annotations with respect to new experimental data and computational predictions • MODs are platforms for global analyses of an organism • Interpret omics data in a pathway context • In silico prediction of essential genes • Characterize systems properties of metabolic and genetic networks
What is Curation? • Ongoing updating and refinement of a PGDB • Correcting false-positive and false-negative predictions • Incorporating information from experimental literature • Authoring of comments and citations • Updating database fields • Gene positions, names, synonyms • Protein functions, activators, inhibitors • Addition of new pathways, modification of existing pathways • Defining TF binding sites, promoters, regulation of transcription initiation and other processes
Pathway/Genome Database Pathways Reactions Compounds Sequence Features Proteins RNAs Regulation Operons Promoters DNA Binding Sites Regulatory Interactions Genes Chromosomes Plasmids CELL
BioCyc Collection of 507 Pathway/Genome Databases • Pathway/Genome Database (PGDB) – combines information about • Pathways, reactions, substrates • Enzymes, transporters • Genes, replicons • Transcription factors/sites, promoters, operons • Tier 1: Literature-Derived PGDBs • MetaCyc • EcoCyc -- Escherichia coli K-12 • Tier 2: Computationally-derived DBs, Some Curation -- 24 PGDBs • HumanCyc • Mycobacterium tuberculosis • Tier 3: Computationally-derived DBs, No Curation -- 481 DBs
Pathway Tools Overview Annotated Genome MetaCyc Reference Pathway DB PathoLogic Pathway/Genome Database Pathway/Genome Navigator Pathway/Genome Editors
Pathway Tools Software: PathoLogic • Computational creation of new Pathway/Genome Databases • Transforms genome into Pathway Tools schema and layers inferred information above the genome • Predicts operons • Predicts metabolic network • Predicts which genes code for missing enzymes in metabolic pathways • Infers transport reactions from transporter names Bioinformatics 18:S225 2002
Pathway Tools Software:Pathway/Genome Editors • Interactively update PGDBs with graphical editors • Support geographically distributed teams of curators with object database system • Gene editor • Protein editor • Reaction editor • Compound editor • Pathway editor • Operon editor • Publication editor
Pathway Tools Software:Pathway/Genome Navigator • Querying and visualization of: • Pathways • Reactions • Metabolites • Proteins • Genes • Chromosomes • Two modes of operation: • Web mode • Desktop mode • Most functionality shared, but each has unique functionality
1,700+ licensees: 75+ groups applying software to 300+ organisms Saccharomyces cerevisiae, SGD project, Stanford University 135 pathways / 565 publications Candida albicans, CGD project, Stanford University dictyBase, Northwestern University Mouse, MGD, Jackson Laboratory Under development: Drosophila, FlyBase C. elegans, WormBase Arabidopsis thaliana, TAIR, Carnegie Institution of Washington 288 pathways / 2282 publications PlantCyc,Carnegie Institution of Washington Six Solanaceae species, Cornell University GrameneDB, Cold Spring Harbor Laboratory Medicago truncatula, Samuel Roberts Noble Foundation Pathway Tools Software: PGDBs Created Outside SRI
NIAID BRCs for Biodefense pathogens: BioHealthBase -- Mycobacterium tuberculosis, Francisella tuleremia Pathema -- 80+ PGDBs PATRIC – Brucella suis, Coxiella burnetii, Rickettsia typhi EuPathDB – Cryptosporidium, Plasmodium G. Xie, Los Alamos Lab, Dental pathogens F. Brinkman, Simon Fraser Univ, Pseudomonas aeruginosa V. Schachter, Genoscope, Acinetobacter M. Bibb, John Innes Centre, Streptomyces coelicolor G. Church, Harvard, Prochlorococcus marinus, multiple strains E. Uberbacher, ORNL and G. Serres, MBL, Shewanella onedensis R.J.S. Baerends, University of Groningen, Lactococcus lactis IL1403, Lactococcus lactis MG1363, Streptococcus pneumoniae TIGR4, Bacillus subtilis 168, Bacillus cereus ATCC14579 Matthew Berriman, Sanger Centre, Trypanosoma brucei, Leishmania major Sergio Encarnacion, UNAM, Sinorhizobium meliloti Mark van der Giezen, University of London, Entamoeba histolytica, Giardia intestinalis Michael Gottfert, Technische Universitat Dresden, Bradyrhizobium japonicum Artiva Maria Goudel, Universidade Federal de Santa Catarina, Brazil, Chromobacterium violaceum ATCC 12472 Pathway Tools Software: PGDBs Created Outside SRI
Pathway Tools Software: PGDBs Created Outside SRI • Large scale users: • C. Medigue, Genoscope, 200+ PGDBs • G. Sutton, J. Craig Venter Institute, 80+ PGDBs • G. Burger, U Montreal, 60+ PGDBs • Bart Weimer, Utah State University, Lactococcus lactis, Brevibacterium linens, Lactobacillus acidophilus, Lactobacillus plantarum, Lactobacillus johnsonii, Listeria monocytogenes • Partial listing of outside PGDBs at BioCyc.org
Obtaining a PGDB for Organism of Interest • Find existing curated PGDB • Find existing PGDB in BioCyc • Create your own
EcoCyc Project – EcoCyc.org • E.coli Encyclopedia • Review-level Model-Organism Database for E. coli • Tracks evolving annotation of the E. coli genome and cellular networks • The two paradigms of EcoCyc • “Multi-dimensional annotation of the E. coli K-12 genome” • Positions of genes; functions of gene products – 76% / 66% exp • Gene Ontology terms; MultiFun terms • Gene product summaries and literature citations • Evidence codes • Multimeric complexes • Metabolic pathways • Cellular regulation Karp, Gunsalus, Collado-Vides, Paulsen Nuc. Acids Res. 35:7577 2007ASM News 70:25 2004 Science 293:2040
EcoCyc = E.coli Dataset + Pathway/Genome Navigator URL: EcoCyc.org Pathways: 246 Reactions: Metabolic: 1394 Transport: 246 Compounds: 1,830 EcoCyc v13.6 Citations: 19,000 Proteins: 4,479 Complexes: 895 RNAs: 285 Gene Regulation: Operons: 3,369 Trans Factors: 196 Promoters: 1,796 TF Binding Sites: 2,205 Genes: 4,492
Paradigm 1:EcoCyc as Textual Review Article • All gene products for which experimental literature exists are curated with a minireview summary • Found on protein and RNA pages, not gene pages! • 3257 gene products contain summaries • Summaries cover function, interactions, mutant phenotypes, crystal structures, regulation, and more • Additional summaries found in pages for operons, pathways • EcoCyc cites 17,300 publications
Paradigm 2: EcoCyc as Computational Symbolic Theory • Highly structured, high-fidelity knowledge representation provides computable information • Each molecular species defined as a DB object • Genes, proteins, small molecules • Each molecular interaction defined as a DB object • Metabolic reactions • Transport reactions • Transcriptional regulation of gene expression • 220 database fields capture extensive properties and relationships
EcoCyc Procedures • DB updates performed by 5 staff curators • Information gathered from biomedical literature • Enter data into structured database fields • Author extensive summaries • Update evidence codes • Corrections submitted by E. coli researchers • Four releases per year • Quality assurance of data and software • Evaluate database consistency constraints • Perform element balancing of reactions • Run other checking programs
EcoCyc Accelerates Science • Experimentalists • E. coli experimentalists • Experimentalists working with other microbes • Analysis of expression data • Computational biologists • Biological research using computational methods • Genome annotation • Study connectivity of E. coli metabolic network • Study phylogentic extent of metabolic pathways and enzymes in all domains of life • Bioinformaticists • Training and validation of new bioinformatics algorithms – predict operons, promoters, protein functional linkages, protein-protein interactions, • Metabolic engineers • “Design of organisms for the production of organic acids, amino acids, ethanol, hydrogen, and solvents “ • Educators
MetaCyc: Metabolic Encyclopedia • Describe a representative sample of every experimentally determined metabolic pathway • Describe properties of metabolic enzymes • Literature-based DB with extensive references and commentary • Pathways, reactions, enzymes, substrates • Jointly developed by • P. Karp, R. Caspi, C. Fulcher, SRI International • L. Mueller, A. Pujar, Boyce Thompson Institute • S. Rhee, P. Zhang, Carnegie Institution Nucleic Acids Research2008
Applications of MetaCyc • Reference source on metabolic pathways • Metabolic engineering • Find enzymes with desired activities, regulatory properties • Determine cofactor requirements • Predict pathways from genomes • Systematic studies of metabolism • Computer-aided education
Enzyme Data Available in MetaCyc • Reaction(s) catalyzed • Alternative substrates • Activators, inhibitors, cofactors, prosthetic groups • Subunit structure • Genes • Features on protein sequence • Cellular location • pI, molecular weight, Km, Vmax • Gene Ontology terms • Links to other bioinformatics databases
MetaCyc Pathway Variants • Pathways that accomplish similar biochemical functions using different biochemical routes • Alanine biosynthesis I – E. coli • Alanine biosynthesis II – H. sapiens • Pathways that accomplish similar biochemical functions using similar sets of reactions • Several variants of TCA Cycle
MetaCyc Super-Pathways • Groups of pathways linked by common substrates • Example: Super-pathway containing • Chorismate biosynthesis • Tryptophan biosynthesis • Phenylalanine biosynthesis • Tyrosine biosynthesis • Super-pathways defined by listing their component pathways • Multiple levels of super-pathways can be defined • Pathway layout algorithms accommodate super-pathways
Comparison of BioCyc to KEGG • KEGG approach: Static collection of reference pathway diagrams are color-coded to produce organism-specific views • KEGG vs MetaCyc: Resource on literature-derived pathways • KEGG maps are not pathways Nuc Acids Res 34:3687 2006 • KEGG maps contain multiple biological pathways • KEGG maps are composites of pathways in many organisms -- do not identify what specific pathways elucidated in what organisms • KEGG has no literature citations, no comments, less enzyme detail • KEGG vs BioCyc organism-specific PGDBs • KEGG does not curate or customize pathway networks for each organism • Highly curated PGDBs now exist for important organisms such as E. coli, yeast, mouse, Arabidopsis • KEGG re-annotates entire genome for each organism
Comparison of Pathway Tools to KEGG • Inference tools • KEGG does not predict presence or absence of pathways • KEGG lacks pathway hole filler, operon predictor • Curation tools • KEGG does not distribute curation tools • No ability to customize pathways to the organism • Pathway Tools schema much more comprehensive • Visualization and analysis • KEGG does not perform automatic pathway layout • KEGG metabolic-map diagram extremely limited • No comparative pathway analysis
Pathway Tools Implementation Details • Platforms: • Macintosh, PC/Linux, and PC/Windows platforms • Same binary can run as desktop app or Web server • Production-quality software • Version control • Two regular releases per year • Extensive quality assurance • Extensive documentation • Auto-patch • Automatic DB-upgrade • 420,000 lines of Lisp code
Pathway Tools Architecture Pathway Genome Navigator Web Mode Desktop Mode Lisp Perl Java Protein Editor Pathway Editor Reaction Editor GFP API Oracle or MySQL Disk File Ocelot DBMS
Ocelot Knowledge Server Architecture • Frame data model • Minimizes size of schema relative to semantic complexity • Schema is stored within the DB • Schema is self documenting • Slot units define metadata about slots • Domain, range, inverse • Collection type, number of values, value constraints • Comment • Schema evolution facilitated by • Easy addition/removal of slots, or alteration of slot datatypes • Flexible data formats that do not require dumping/reloading of data
Ocelot Storage System Architecture • Persistent storage via disk files or Oracle or MySQL • Concurrent development: Oracle or MySQL • Single-user development: disk files • Oracle/MySQL DBMS storage • DBMS is submerged within Ocelot, invisible to users • Frames transferred from DBMS to Ocelot • On demand • By background prefetcher • Memory cache • Persistent disk cache to speed performance via Internet • Transaction logging facility
Why Do We Code in Common Lisp? • Gatt studied Lisp and Java implementation of 16 programs by 14 programmers (Intelligence 11:21 2000) • The average Lisp program ran 33 times faster than the average Java program • The average Lisp program was written 5 times faster than the average Java program • Roberts compared Java and Lisp implementations of a Domain Name Server (DNS) resolver • http://www.findinglisp.com/papers/case_study_java_lisp_dns.html • The Lisp version had ½ as many lines as code
Common Lisp ProgrammingEnvironment • Interpreted and/or compiled execution • Fabulous debugging environment • High-level language • Interactive data exploration • Extensive built-in libraries • Dynamic redefinition • Find out more! • See ALU.org or • http://www.international-lisp-conference.org/
PathoLogic Processing • Translate source genome to PGDB form • Predict operons • Predict metabolic pathways • Predict pathway hole fillers • Transport inference parser • Build metabolic overview diagram
PathoLogic Step 1: Translate Genome to PGDB Gene Products Genes/ORFs DNA Sequences Pathways Reactions Compounds Annotated Genomic Sequence Pathway/Genome Database Pathways Reactions PathoLogic Software Integrates genome and pathway data to identify putative metabolic networks Compounds Multi-organism Pathway Database (MetaCyc) Gene Products Genes Genomic Map
PathoLogic Step 3: Prediction of Metabolic Pathways • Infer reaction complement of organism • Match enzymes in source genome to MetaCyc reactions they catalyze • Match enzyme names and EC numbers to MetaCyc • Support user in manually matching additional enzymes • Computationally predict which MetaCyc metabolic pathways are present • For each MetaCyc pathway, evaluate which of its reactions are catalyzed by the organism
Match Enzymes to Reactions 5.1.3.2 Gene product MetaCyc UDP-glucose-4-epimerase 2057 proteins matched by EC# 314 matched by name Match yes no Probable enzyme -ase 1320 Assign UDP-D-glucose UDP-galactose no yes Manually search Not a metabolic enzyme no yes Assign Can’t Assign 625
Import Pathways MetaCyc Containing pathways reactions Import All Prune? yes no Delete Manual Review yes no delete keep
Pathway Prediction • Prediction is hard because • Enzyme naming is irregular • Some reactions present in multiple pathways • Pathway variants share many reactions in common • MetaCyc now has many pathways
Pathway Scoring Criteria • Imported pathways must satisfy: • Pathways outside their taxonomic range must have enzymes for all reactions • If any reactions in a pathway are designated as “key,” an enzyme must be present for at least one • Pathway P is imported if any conditions satisfied: • One unique enzyme present for P • P missing at most one reaction • More reactions present than absent for P • P is not a superset of another pathway with the same number of enzymes present
PathoLogic Step 4: Pathway Hole Filler • Definition: Pathway Holes are reactions in metabolic pathways for which no enzyme is identified 1.4.3.- quinolinate synthetase nadA iminoaspartate L-aspartate quinolinate holes n.n. pyrophosphorylase nadC NAD+ synthetase, NH3 -dependent CC3619 deamido-NAD nicotinate nucleotide 2.7.7.18 6.3.5.1 NAD
Step 1: Query UniProt for all sequences having EC# of pathway hole Step 2: BLAST against target genome Step 3 & 4: Consolidate hits and evaluate evidence gene X organism 1 enzyme A organism 2 enzyme A organism 3 enzyme A organism 4 enzyme A 7 queries have high-scoring hits to sequence Y organism 5 enzyme A gene Y organism 6 enzyme A organism 7 enzyme A organism 8 enzyme A gene Z
Pathway Hole Filler • Why should hole filler find things beyond the original genome annotation? • Reverse BLAST searches more sensitive • Reverse BLAST searches find second domains • Integration of multiple evidence types