350 likes | 453 Views
The EcoCyc and MetaCyc Pathway/Genome Databases. Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International pkarp@ai.sri.com http://www.ai.sri.com/pkarp/ http://EcoCyc.org/. Overview. Motivations and terminology Pathway/genome databases BioCyc collection EcoCyc, MetaCyc
E N D
The EcoCyc and MetaCyc Pathway/Genome Databases Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International pkarp@ai.sri.com http://www.ai.sri.com/pkarp/ http://EcoCyc.org/
Overview • Motivations and terminology • Pathway/genome databases • BioCyc collection • EcoCyc, MetaCyc • Pathway Tools software • Bioinformatics Database Warehouse project
A E
What to do When Theories BecomeLarger than Minds can Grasp? • Example: E. coli metabolic network • 160 pathways involving 744 reactions and 791 substrates • Example: E. coli genetic network • Control by 97 transcription factors of 1174 genes in 630 transcription units • Past solutions: • Partition theories across multiple minds • Encode theories in natural-language text • We cannot compute with theories in those forms • Evaluate theories for consistency with new data: microarrays • Refine theories with respect to new data • Compare theories describing different organisms
Solution: Biological Knowledge Bases • Store biological knowledge and theories in computers in a declarative form • Amenable to computational analysis and generative user interfaces • Establish ongoing efforts to curate (maintain, refine, embellish) these knowledge bases • Accepted to store data in computers, but not knowledge • Such knowledge bases are an integral part of the scientific enterprise
Pathway Definition • Chemical reactions interconvert chemical compounds • An enzyme is a protein that accelerates chemical reactions • A pathway is a linked set of reactions • Often regulated as a unit • A conceptual unit of cell’s biochemical machine A + B C + D A C E
Model Organism Database (MOD) – DB describing genome and other information about an organism Pathway/Genome Database (PGDB) – MOD that combines information about Pathways, reactions, substrates Enzymes, transporters Genes, replicons Transcription factors, promoters, operons, DNA binding sites BioCyc – Collection of 15 PGDBs at BioCyc.org EcoCyc, AgroCyc, YeastCyc Terminology
BioCyc Collection ofPathway/Genome DBs Computationally Derived Datasets: • Agrobacterium tumefaciens • Caulobacter crescentus • Chlamydia trachomatis • Bacillus subtilis • Helicobacter pylori • Haemophilus influenzae • Mycobacterium tuberculosis RvH37 • Mycobacterium tuberculosis CDC1551 • Mycoplasma pneumonia • Pseudomonas aeruginosa • Saccharomyces cerevisiae • Treponema pallidum • Vibrio cholerae • Yellow = Open Database • Literature-based Datasets: • MetaCyc • Escherichia coli (EcoCyc) http://BioCyc.org/
Terminology –Pathway Tools Software • PathoLogic • Prediction of metabolic network from genome • Computational creation of new Pathway/Genome Databases • Pathway/Genome Editors • Distributed curation of PGDBs • Distributed object database system, interactive editing tools • Pathway/Genome Navigator • WWW publishing of PGDBs • Querying, visualization of pathways, chromosomes, operons • Analysis operations • Pathway visualization of gene-expression data • Global comparisons of metabolic networks • Bioinformatics 18:S225 2002
Query, visualization and editing tools for these datatypes: Full Metabolic Map Paint gene expression data on metabolic network; compare metabolic networks Pathways Pathway prediction Reactions Balance checker Compounds Chemical substructure comparison Enzymes, Transporters, Transcription Factors Genes: Blast search Chromosomes Operons Operon prediction Pathway Tools Algorithms
Model Organism Databases • DBs that describe the genome and other information about an organism • Every sequenced organism with an active experimental community requires a MOD • Integrate genome data with information about the biochemical and genetic network of the organism • MODs are platforms for global analyses of an organism • Interpret gene expression data in a pathway context • Characterize systems properties of metabolic and genetic networks • Determine consistency of metabolic and transport networks • In silico prediction of essential genes
EcoCyc Project – EcoCyc.org • E.coli Encyclopedia • Model-Organism Database for E. coli • Computational symbolic theory of E. coli • Electronic review article for E. coli – over 3500 literature citations • Tracks the evolving annotation of the E. coli genome • Collaborative development via Internet • Karp (SRI) -- Bioinformatics architect • John Ingraham -- Advisor • (SRI) Metabolic pathways • Saier (UCSD) and Paulsen (TIGR)-- Transport • Collado (UNAM)-- Regulation of gene expression • Database content: 18,000 objects
EcoCyc = E.coli Dataset + Pathway/Genome Navigator Pathways: 165 Reactions: 2,760 Compounds: 774 Enzymes: 914 Transporters: 162 Promoters: 812 TransFac Sites: 956 Citations: 3,508 Proteins: 4,273 Transcription Units: 724 Factors: 110 Genes: 4,393 http://EcoCyc.org/
EcoCyc Procedures • All DB updates by 5 staff curators • Information gathered from biomedical literature • Corrections solicited from E. coli researchers • Review-level database • Four releases per year • Available through WWW site, as data files, as downloadable application • Quality assurance of data and software: • Evaluate database consistency constraints • Perform element balancing of reactions • Run other checking programs • Display every DB object
MetaCyc: Metabolic Encyclopedia • Nonredundant metabolic pathway database • Describe a representative sample of every experimentally determined metabolic pathway • Literature-based DB with extensive references and commentary • Pathways, reactions, enzymes, substrates • 460 pathways, 1267 enzymes, 4294 reactions • 172 E. coli pathways, 2735 citations • Nucleic Acids Research 30:59-61 2002. • Jointly developed by SRI and Carnegie Institution • New focus on plant pathways
Pathway Tools Implementation Details • Allegro Common Lisp • Sun and PC platforms • Ocelot object database • 250,000 lines of code • Lisp-based WWW server at BioCyc.org • Manages 15 PGDBs
Pathway Tools Architecture WWW Server X-Windows Graphics Object Editor Pathway Editor Reaction Editor GFP API Oracle Pathway Genome Navigator Object DBMS
Ocelot Knowledge Server Architecture • Frame data model • Classes, instances, inheritance • Frames have slots that define their properties, attributes, relationships • A slot has one or more values • Each value can be any Lisp datatype • Slotunits define metadata about slots: • Domain, range, inverse • Collection type, number of values, value constraints • Transaction logging facility • Schema evolution
Ocelot Storage System Architecture • Persistent storage via disk files, Oracle DBMS • Concurrent development: Oracle • Single-user development: disk files • Read-only delivery: bundle data into binary program • Oracle storage • DBMS is submerged within Ocelot, invisible to users • Relational schema is domain independent, supports multiple KBs simultaneously • Frames transferred from DBMS to Ocelot • On demand • By background prefetcher • Memory cache • Persistent disk cache to speed performance via Internet
The Common Lisp ProgrammingEnvironment • Gatt studied Lisp and Java implementation of 16 programs by 14 programmers (Intelligence 11:21 2000)
Plasmodium falciparum, Stanford University plasmocyc.stanford.edu Mycobacterium tuberculosis, Stanford University BioCyc.org Arabidopsis thaliana and Synechosistis, Carnegie Institution of Washington Arabidopsis.org:1555 Methanococcus janaschii, EBI Maine.ebi.ac.uk:1555 Other PGDBs in progress by 24 other users Software freely available Each PGDB owned by its creator Pathway/Genome DBs Created byExternal Users
Global Consistency Checking of Biochemical Network • Given: • A PGDB for an organism • A set of initial metabolites • Infer: • What set of products can be synthesized by the small-molecule metabolism of the organism • Can known growth medium yield known essential compounds? • Pacific Symposium on Biocomputing p471 2001
Algorithm:Forward Propagation Nutrient set Products PGDB reaction pool Transport “Fire” reactions Metabolite set Reactants
Results • Phase I: Forward propagation • 21 initial compounds yielded only half of 38 essential compounds for E. coli • Phase II: Manually identify • Bugs in EcoCyc (e.g., two objects for tryptophan) • Missing initial protein substrates (e.g., ACP) • Missing pathways in EcoCyc • Phase III: Forward propagation with 11 more initial metabolites • Yielded all 38 essential compounds
Nutrient-Related Analysis:Validation of the EcoCyc Database Results on EcoCyc: • Phase I: • Essential compounds • produced 19 • not produced 19 • Total compounds • produced: (28%) • Reactions • Fired (31%)
Missing Essential Compounds Due To • Bugs in EcoCyc • Narrow conceptualization of the problem • Protein substrates • Incomplete biochemical knowledge
Nutrient-Related Analysis:Validation of the EcoCyc Database Results on EcoCyc: • Phase II (After adding 11 extra metabolites): • Essential compounds • produced 38 • not produced 0 • Total compounds • produced: (49%) • not produced: (51%) • Reactions • Fired (58%) • Not fired (42%)
Pathway Tools Misconceptions • PathoLogic • Does not re-annotate genomes • Pathway Tools does not handle quantitative information • Pathway/Genome Editors do not work through the web
HumanCyc: Human Metabolic PathwayDatabase Consortium • Construct DB of human metabolic pathways using PathoLogic • Link to human genome web sites • Hire one curator to refine and curate with respect to literature over a 2 year period • Remove false-positive predictions • Insert known pathways missed by PathoLogic • Add comments and citations from pathways and enzymes to the literature • Add enzyme activators, inhibitors, cofactors, tissue information • Available as flatfiles and with Pathway/Genome Navigator • New versions to be released every 6 months
Summary • Pathway/Genome Databases • MetaCyc non-redundant DB of literature-derived pathways • 14 organism-specific PGDBs available through SRI at BioCyc.org • Computational theories of biochemical machinery • Pathway Tools software • Extract pathways from genomes • Morph annotated genome into structured ontology • Distributed curation tools for MODs • Query, visualization, WWW publishing
BioCyc and Pathway Tools Availability • WWW BioCyc freely available to all • BioCyc.org • Six BioCyc DBs openly available to all • BioCyc DBs freely available to non-profits • Flatfiles downloadable from BioCyc.org • Binary executable: • Sun UltraSparc-170 w/ 64MB memory • PC, 400MHz CPU, 64MB memory, Windows-98 or newer • PerlCyc API • Pathway Tools freely available to non-profits
SRI Suzanne Paley, Pedro Romero, John Pick, Cindy Krieger, Martha Arnaud EcoCyc Project Julio Collado-Vides, Ian Paulsen, Monica Riley, Milton Saier MetaCyc Project Sue Rhee, Lukas Mueller, Peifen Zhang, Chris Somerville Stanford Gary Schoolnik, Harley McAdams, Lucy Shapiro, Russ Altman, Iwei Yeh Funding sources: NIH National Center for Research Resources NIH National Institute of General Medical Sciences NIH National Human Genome Research Institute Department of Energy Microbial Cell Project DARPA BioSpice, UPC Acknowledgements BioCyc.org