480 likes | 600 Views
The Pathway Tools Ontology and Inferencing Layer. Peter D. Karp, Ph.D. SRI International. Overview. Definitions Ontologies ultimately exciting because of the inferences/computations they enable: Where are the ontology killer apps?
E N D
The Pathway Tools Ontology and Inferencing Layer Peter D. Karp, Ph.D. SRI International
Overview • Definitions • Ontologies ultimately exciting because of the inferences/computations they enable: Where are the ontology killer apps? • Adding more facets to an ontology increases inferences that can be made with it • Pathway Tools ontology and associated applications
Model Organism Database (MOD) – DB describing genome and other information about an organism Pathway/Genome Database (PGDB) – MOD that combines information about Pathways, reactions, substrates Enzymes, transporters Genes, replicons Transcription factors, promoters, operons, DNA binding sites BioCyc – Collection of 15 PGDBs at BioCyc.org EcoCyc, AgroCyc, YeastCyc Terminology
Terminology –Pathway Tools Software • PathoLogic • Prediction of metabolic network from genome • Computational creation of new Pathway/Genome Databases • Pathway/Genome Editors • Distributed curation of PGDBs • Distributed object database system, interactive editing tools • Pathway/Genome Navigator • WWW publishing of PGDBs • Querying, visualization of pathways, chromosomes, operons • Analysis operations • Pathway visualization of gene-expression data • Global comparisons of metabolic networks • Bioinformatics 18:S225 2002
Ontology • Ontology = Terms + Taxonomy + Slots + Constraints
Pathway Tools Ontology:Terms and Taxonomy • Pathway Tools ontology contains 916 classes • Define datatypes • Replicons, Genes, Operons, Promoters, Trans Fac Binding Sites • Proteins: Enzymes, Transporters, Transcription Factors • Small molecule compounds • Reactions, pathways • Define taxonomies • Taxonomy of chemical compounds • Riley’s gene ontology • Taxonomy of metabolic pathways • EC system • Bioinformatics 16:269 2000
Operations Enabled by Controlled Vocabulary • Equality testing: • Is the function of gene X in organism A the same as the function of gene Y in organism B? • Is location L1 in organism A the same as location L2 in organism B?
Operations Enabled byTaxonomy • Counting / Pie charts • How many genes of category “small molecule metabolism” are in organism A? • Intersecting sets • How many of these up-regulated genes are in class “cell cycle”? • User search via drill down • Applying rules • If the substrate of X is an amino acid, then XXX
Ontology • Ontology = Terms + Taxonomy + Slots + Constraints
Pathway Tools Ontology:Slots • Pathway Tools ontology contains 199 slots • Categories of slots: • Meta-data: Creator, Creation-Date • Textual data: Common-Name, Synonyms, Comment, Citations • Attributes: Molecular-Weight, pI • Relationships: Gene, Catalyzes, In-Reaction • Give stats on how many slots in each of these classes
Pathway Tools Ontology:Slots • Slots introduced at appropriate place in taxonomy • Child classes inherit the slot; parent classes do not • Examples: • Proteins: pI, MolWt, Component-Of • Polypeptides: Gene • Protein-Complexes: Components • Reactions: Left, Right, Keq, In-Pathway • Pathways: Reaction-List, Predecessor-List • Transcription Units: Components • Genes: Product, Component-Of
Operations Enabled by Slots • Store/retrieve attributes of an entity • Get pI of protein • Get citations associated with pathway • Traverse network of semantic relationships • Find all substrates of all reactions in pathway X • Find all genes that encode an enzyme that catalyzes a reaction in pathway X • Find all regulons encoding multiple metabolic pathways
Ontology • Ontology = Terms + Taxonomy + Slots + Constraints
Pathway Tools Ontology:Constraints • Every Pathway Tools slot has associated meta data: • Class(es) to which it pertains • Keq pertains to Reactions • Data type (number, string, frame, etc) • Keq data type is number • Collection type (list, bag) • Keq is not a collection • Documentation string • Cardinality constraints -- At most one Keq value • Range constraints • Taxonomy constraints • Values of Left slot of Reactions must be Chemicals
Operations Enabled by Constraints • Constraints make a system “intelligent” because they encode definitions in a machine-understandable fashion • Automated DB consistency checkers (batch or interactive) • Schema-driven data input tools • Subsumption – Compare two concept definitions
Pathway Tools Inference Layer • Commonly used queries implemented as stored procedures • Infer what is implicitly recorded in the KB
Compute Transitive Relationships succinate + FAD = fumarate + FADH2 Enzymatic-reaction Succinate dehydrogenase TCA Cycle left succinate in-pathway FAD fumarate reaction FADH2 right catalyzes component-of Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2 product sdhC sdhD Chrom sdhA sdhB
Pathway Tools Inference Layer • Enumerate reactions given alternative definitions of a reaction: all, enzyme, transport, small-mol, smm • All substrates, all cofactors, all transported chemicals • Protein tests: Is X a transcription factor, enzyme, transporter • Rather than force user to manually assign physiological roles, compute when possible from biochemical function • Transcription-unit-binding-sites • Compute in parts hierarchy: monomers-of-protein, components-of-protein, genes-of-protein, modified-forms • Complex: regulon-of-protein, regulator-proteins-of-transcription-unit
What Killer Apps haveOntologies Enabled? • What comes after pie charts and drill-down interfaces?
Terminology –Pathway Tools Software • PathoLogic • Prediction of metabolic network from genome • Computational creation of new Pathway/Genome Databases • Pathway/Genome Editors • Distributed curation of PGDBs • Distributed object database system, interactive editing tools • Pathway/Genome Navigator • WWW publishing of PGDBs • Querying, visualization of pathways, chromosomes, operons • Analysis operations • Pathway visualization of gene-expression data • Global comparisons of metabolic networks
BioCyc Collection ofPathway/Genome DBs Computationally Derived Datasets: • Agrobacterium tumefaciens • Caulobacter crescentus • Chlamydia trachomatis • Bacillus subtilis • Helicobacter pylori • Haemophilus influenzae • Mycobacterium tuberculosis RvH37 • Mycobacterium tuberculosis CDC1551 • Mycoplasma pneumonia • Pseudomonas aeruginosa • Saccharomyces cerevisiae • Treponema pallidum • Vibrio cholerae • Yellow Underlined = Open Database • Literature-based Datasets: • MetaCyc • Escherichia coli (EcoCyc) http://BioCyc.org/
Plasmodium falciparum, Stanford University plasmocyc.stanford.edu Arabidopsis thaliana and Synechosistis, Carnegie Institution of Washington Arabidopsis.org:1555 Methanococcus janaschii, EBI Maine.ebi.ac.uk:1555 Other PGDBs in progress by 20 other users Software freely available Each PGDB owned by its creator Pathway/Genome DBs Created byExternal Users
Ontology Reuse • A holy grail in AI since “ontology” became a buzz-word • Decrease knowledge acquisition bottleneck • GO qualifies as a large success in ontology reuse • Pathway Tools ontology reused across 18 PGDBs • Pathway Tools algorithms portable across all PGDBs
Visualization and editing tools for following datatypes Full Metabolic Map Paint gene expression data on metabolic network; compare metabolic networks Pathways Pathway prediction Reactions Balance checker Compounds Chemical substructure comparison Enzymes, Transporters, Transcription Factors Genes Chromosomes Operons Operon prediction; visualize genetic network Pathway Tools Algorithms
Inference of Metabolic Pathways Gene Products Genes/ORFs DNA Sequences Pathways Reactions Compounds Annotated Genomic Sequence Pathway/Genome Database Pathways Reactions PathoLogic Software Integrates genome and pathway data to identify putative metabolic networks Compounds Multi-organism Pathway Database (MetaCyc) Gene Products Genes Genomic Map
PathoLogic Analysis Phases A C G • Trial parsing of input data files [few days] • Initialize schema of new PGDB [3 min] • Create DB objects for replicons, genes, proteins [5 min] • Assign enzymes to reactions they catalyze • ferrochelatase [10 min / 1 week] • glutamate 1-semialdehyde 2,1-aminomutase • porphobilinogen deaminase E1 E2 B D E F
PathoLogic Analysis Phases • From assigned reactions, infer what pathways are present [5 min / few days] • Define metabolic overview diagram [1 day] • Define protein complexes [few days]
Killer App: Global Consistency Checking of Biochemical Network • Given: • A PGDB for an organism • A set of initial metabolites • Infer: • What set of products can be synthesized by the small-molecule metabolism of the organism • Can known growth medium yield known essential compounds? • Pacific Symposium on Biocomputing p471 2001
Algorithm:Forward Propagation Nutrient set Products PGDB reaction pool Transport “Fire” reactions Metabolite set Reactants
Results • Phase I: Forward propagation • 21 initial compounds yielded only half of 38 essential compounds for E. coli • Phase II: Manually identify • Bugs in EcoCyc (e.g., two objects for tryptophan) • Missing initial protein substrates (e.g., ACP) • Missing pathways in EcoCyc • Phase III: Forward propagation with 11 more initial metabolites • Yielded all 38 essential compounds
Aggregate Properties of the E. coli Metabolic Network • EcoCyc is not a complete picture of E. coli metabolism • 30% of E. coli genes remain unidentified • Analysis pertains to pathways of small-molecule metabolism • Computed with respect to EcoCyc v4.5 (Sep-1998) • Joint work with Christos Ouzounis of EBI • Genome Research 10:268 2001
Enzymes • 4391 genes in E. coli genome • 4288 code for proteins • 676 (15%) gene products form 607 enzymes • Of the 607 enzymes, 296 are monomers, 311 are multimers • 90% of genes for heteromultimers are linked
Reactions • 744 reactions of small-molecule metabolism • 582 assigned to at least one pathway
Compounds • 791 substrates in the 744 reactions • Each reaction contains 4.0 substrates on average • Each substrate appears in 2.1 reactions
Enzyme Modulation • 805 enzymatic-reaction objects in EcoCyc • 80 have physiological inhibitors • 22 have physiological activators • 17 have both • 43% have a modulator • 327 require a cofactor or prosthetic group
Enzyme-Reaction Associations • 585 reactions catalyzed by 1 enzyme • 55 reactions catalyzed by 2 enzymes • 12 reactions catalyzed by 3 enzymes • 1 reaction catalyzed by 4 enzymes • 483 reactions belong to a single pathway • 99 reactions belong to multiple pathways • 100 of the 607 E. coli enzymes are multifunctional
Pathway Tools Implementation • Allegro Common Lisp • Sun and PC platforms • Run as window application or WWW server • Ocelot object database • 250,000 lines of code • Lisp-based WWW server at BioCyc.org • Lisp process reads URLs from the network and generates GIF+HTML from PGDBs • Manages 15 PGDBs
Ocelot Knowledge Server Architecture • Frame data model • Classes, instances, inheritance • Persistent storage via disk files, Oracle DBMS • Concurrent development: Oracle • Single-user development: disk files • Read-only delivery: bundle data into binary program • Transaction logging facility • Schema evolution • Local disk cache to improve Internet performance • J. Intelligent Information Systems 1:155-94 1999
GKB Editor • Browser and editor for KBs and ontologies • Three editing tools: • Taxonomy editor • Frame editor • Relationships editor • All operations are schema driven • http://www.ai.sri.com/~gkb/user-man.html
The Common Lisp ProgrammingEnvironment • Gatt studied Lisp and Java implementation of 16 programs by 14 programmers (Intelligence 11:21 2000)
Peter Norvig’s Solution • “I wrote my version in Lisp. It took me about 2 hours (compared to a range of 2-8.5 hours for the other Lisp programmers in the study, 3-25 for C/C++ and 4-63 for Java) and I ended up with 45 non-comment non-blank lines (compared with a range of 51-182 for Lisp, and 107-614 for the other languages). (That means that some Java programmer was spending 13 lines and 84 minutes to provide the functionality of each line of my Lisp program.)” • http://www.norvig.com/java-lisp.html
Common Lisp ProgrammingEnvironment • Interpreted and/or compiled execution • Fabulous debugging environment • High-level language • Interactive data exploration • Extensive built-in libraries • Dynamic redefinition • Find out more! • ALU.org -- Association of Lisp Users • BioLisp.org
Pathway Exchange Ontology • BioPathways group developing ontology and format for exchange of pathway data • Metabolic pathways • Signaling pathways • Protein interactions • Moving upwards from chemicals, proteins, to reactions and pathways • Working to extend CML • Draft ontology at http://www.ai.sri.com/pkarp/misc/interactions.html
Summary • Pathway Tools apps: • Predict pathways and generate PGDBs • Visualization and editing tools • Paint gene expression data; compare entire pathway maps • Global consistency checking of metabolic network • Characterize metabolic and genetic networks • New killer apps: • Interoperability • Text mining • Bake-off for genome annotation pipelines
BioCyc and Pathway Tools Availability • WWW BioCyc freely available to all • BioCyc.org • Six BioCyc DBs openly available to all • BioCyc DBs freely available to non-profits • Flatfiles downloadable from BioCyc.org • Binary executable: • Sun UltraSparc-170 w/ 64MB memory • PC, 400MHz CPU, 64MB memory, Windows-98 or newer • PerlCyc API • Pathway Tools freely available to non-profits
SRI Suzanne Paley, Pedro Romero, John Pick, Cindy Krieger, Martha Arnaud EcoCyc Project Julio Collado-Vides, Ian Paulsen, Monica Riley, Milton Saier MetaCyc Project Sue Rhee, Lukas Mueller, Peifen Zhang, Chris Somerville Stanford Gary Schoolnik, Harley McAdams, Lucy Shapiro, Russ Altman, Iwei Yeh Funding sources: NIH National Center for Research Resources NIH National Institute of General Medical Sciences NIH National Human Genome Research Institute Department of Energy Microbial Cell Project DARPA BioSpice, UPC Acknowledgements BioCyc.org