1 / 48

The Pathway Tools Ontology and Inferencing Layer

The Pathway Tools Ontology and Inferencing Layer. Peter D. Karp, Ph.D. SRI International. Overview. Definitions Ontologies ultimately exciting because of the inferences/computations they enable: Where are the ontology killer apps?

Download Presentation

The Pathway Tools Ontology and Inferencing Layer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Pathway Tools Ontology and Inferencing Layer Peter D. Karp, Ph.D. SRI International

  2. Overview • Definitions • Ontologies ultimately exciting because of the inferences/computations they enable: Where are the ontology killer apps? • Adding more facets to an ontology increases inferences that can be made with it • Pathway Tools ontology and associated applications

  3. Model Organism Database (MOD) – DB describing genome and other information about an organism Pathway/Genome Database (PGDB) – MOD that combines information about Pathways, reactions, substrates Enzymes, transporters Genes, replicons Transcription factors, promoters, operons, DNA binding sites BioCyc – Collection of 15 PGDBs at BioCyc.org EcoCyc, AgroCyc, YeastCyc Terminology

  4. Terminology –Pathway Tools Software • PathoLogic • Prediction of metabolic network from genome • Computational creation of new Pathway/Genome Databases • Pathway/Genome Editors • Distributed curation of PGDBs • Distributed object database system, interactive editing tools • Pathway/Genome Navigator • WWW publishing of PGDBs • Querying, visualization of pathways, chromosomes, operons • Analysis operations • Pathway visualization of gene-expression data • Global comparisons of metabolic networks • Bioinformatics 18:S225 2002

  5. Ontology • Ontology = Terms + Taxonomy + Slots + Constraints

  6. Pathway Tools Ontology:Terms and Taxonomy • Pathway Tools ontology contains 916 classes • Define datatypes • Replicons, Genes, Operons, Promoters, Trans Fac Binding Sites • Proteins: Enzymes, Transporters, Transcription Factors • Small molecule compounds • Reactions, pathways • Define taxonomies • Taxonomy of chemical compounds • Riley’s gene ontology • Taxonomy of metabolic pathways • EC system • Bioinformatics 16:269 2000

  7. Operations Enabled by Controlled Vocabulary • Equality testing: • Is the function of gene X in organism A the same as the function of gene Y in organism B? • Is location L1 in organism A the same as location L2 in organism B?

  8. Operations Enabled byTaxonomy • Counting / Pie charts • How many genes of category “small molecule metabolism” are in organism A? • Intersecting sets • How many of these up-regulated genes are in class “cell cycle”? • User search via drill down • Applying rules • If the substrate of X is an amino acid, then XXX

  9. Ontology • Ontology = Terms + Taxonomy + Slots + Constraints

  10. Pathway Tools Ontology:Slots • Pathway Tools ontology contains 199 slots • Categories of slots: • Meta-data: Creator, Creation-Date • Textual data: Common-Name, Synonyms, Comment, Citations • Attributes: Molecular-Weight, pI • Relationships: Gene, Catalyzes, In-Reaction • Give stats on how many slots in each of these classes

  11. Pathway Tools Ontology:Slots • Slots introduced at appropriate place in taxonomy • Child classes inherit the slot; parent classes do not • Examples: • Proteins: pI, MolWt, Component-Of • Polypeptides: Gene • Protein-Complexes: Components • Reactions: Left, Right, Keq, In-Pathway • Pathways: Reaction-List, Predecessor-List • Transcription Units: Components • Genes: Product, Component-Of

  12. Operations Enabled by Slots • Store/retrieve attributes of an entity • Get pI of protein • Get citations associated with pathway • Traverse network of semantic relationships • Find all substrates of all reactions in pathway X • Find all genes that encode an enzyme that catalyzes a reaction in pathway X • Find all regulons encoding multiple metabolic pathways

  13. Ontology • Ontology = Terms + Taxonomy + Slots + Constraints

  14. Pathway Tools Ontology:Constraints • Every Pathway Tools slot has associated meta data: • Class(es) to which it pertains • Keq pertains to Reactions • Data type (number, string, frame, etc) • Keq data type is number • Collection type (list, bag) • Keq is not a collection • Documentation string • Cardinality constraints -- At most one Keq value • Range constraints • Taxonomy constraints • Values of Left slot of Reactions must be Chemicals

  15. Operations Enabled by Constraints • Constraints make a system “intelligent” because they encode definitions in a machine-understandable fashion • Automated DB consistency checkers (batch or interactive) • Schema-driven data input tools • Subsumption – Compare two concept definitions

  16. Pathway Tools Inference Layer • Commonly used queries implemented as stored procedures • Infer what is implicitly recorded in the KB

  17. Compute Transitive Relationships succinate + FAD = fumarate + FADH2 Enzymatic-reaction Succinate dehydrogenase TCA Cycle left succinate in-pathway FAD fumarate reaction FADH2 right catalyzes component-of Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2 product sdhC sdhD Chrom sdhA sdhB

  18. Pathway Tools Inference Layer • Enumerate reactions given alternative definitions of a reaction: all, enzyme, transport, small-mol, smm • All substrates, all cofactors, all transported chemicals • Protein tests: Is X a transcription factor, enzyme, transporter • Rather than force user to manually assign physiological roles, compute when possible from biochemical function • Transcription-unit-binding-sites • Compute in parts hierarchy: monomers-of-protein, components-of-protein, genes-of-protein, modified-forms • Complex: regulon-of-protein, regulator-proteins-of-transcription-unit

  19. What Killer Apps haveOntologies Enabled? • What comes after pie charts and drill-down interfaces?

  20. Terminology –Pathway Tools Software • PathoLogic • Prediction of metabolic network from genome • Computational creation of new Pathway/Genome Databases • Pathway/Genome Editors • Distributed curation of PGDBs • Distributed object database system, interactive editing tools • Pathway/Genome Navigator • WWW publishing of PGDBs • Querying, visualization of pathways, chromosomes, operons • Analysis operations • Pathway visualization of gene-expression data • Global comparisons of metabolic networks

  21. BioCyc Collection ofPathway/Genome DBs Computationally Derived Datasets: • Agrobacterium tumefaciens • Caulobacter crescentus • Chlamydia trachomatis • Bacillus subtilis • Helicobacter pylori • Haemophilus influenzae • Mycobacterium tuberculosis RvH37 • Mycobacterium tuberculosis CDC1551 • Mycoplasma pneumonia • Pseudomonas aeruginosa • Saccharomyces cerevisiae • Treponema pallidum • Vibrio cholerae • Yellow Underlined = Open Database • Literature-based Datasets: • MetaCyc • Escherichia coli (EcoCyc) http://BioCyc.org/

  22. Plasmodium falciparum, Stanford University plasmocyc.stanford.edu Arabidopsis thaliana and Synechosistis, Carnegie Institution of Washington Arabidopsis.org:1555 Methanococcus janaschii, EBI Maine.ebi.ac.uk:1555 Other PGDBs in progress by 20 other users Software freely available Each PGDB owned by its creator Pathway/Genome DBs Created byExternal Users

  23. Ontology Reuse • A holy grail in AI since “ontology” became a buzz-word • Decrease knowledge acquisition bottleneck • GO qualifies as a large success in ontology reuse • Pathway Tools ontology reused across 18 PGDBs • Pathway Tools algorithms portable across all PGDBs

  24. Visualization and editing tools for following datatypes Full Metabolic Map Paint gene expression data on metabolic network; compare metabolic networks Pathways Pathway prediction Reactions Balance checker Compounds Chemical substructure comparison Enzymes, Transporters, Transcription Factors Genes Chromosomes Operons Operon prediction; visualize genetic network Pathway Tools Algorithms

  25. Inference of Metabolic Pathways Gene Products Genes/ORFs DNA Sequences Pathways Reactions Compounds Annotated Genomic Sequence Pathway/Genome Database Pathways Reactions PathoLogic Software Integrates genome and pathway data to identify putative metabolic networks Compounds Multi-organism Pathway Database (MetaCyc) Gene Products Genes Genomic Map

  26. PathoLogic Analysis Phases A C G • Trial parsing of input data files [few days] • Initialize schema of new PGDB [3 min] • Create DB objects for replicons, genes, proteins [5 min] • Assign enzymes to reactions they catalyze • ferrochelatase [10 min / 1 week] • glutamate 1-semialdehyde 2,1-aminomutase • porphobilinogen deaminase E1 E2 B D E F

  27. PathoLogic Analysis Phases • From assigned reactions, infer what pathways are present [5 min / few days] • Define metabolic overview diagram [1 day] • Define protein complexes [few days]

  28. Killer App: Global Consistency Checking of Biochemical Network • Given: • A PGDB for an organism • A set of initial metabolites • Infer: • What set of products can be synthesized by the small-molecule metabolism of the organism • Can known growth medium yield known essential compounds? • Pacific Symposium on Biocomputing p471 2001

  29. Algorithm:Forward Propagation Nutrient set Products PGDB reaction pool Transport “Fire” reactions Metabolite set Reactants

  30. Results • Phase I: Forward propagation • 21 initial compounds yielded only half of 38 essential compounds for E. coli • Phase II: Manually identify • Bugs in EcoCyc (e.g., two objects for tryptophan) • Missing initial protein substrates (e.g., ACP) • Missing pathways in EcoCyc • Phase III: Forward propagation with 11 more initial metabolites • Yielded all 38 essential compounds

  31. How to Characterize theMetabolic Network of a Cell?

  32. Aggregate Properties of the E. coli Metabolic Network • EcoCyc is not a complete picture of E. coli metabolism • 30% of E. coli genes remain unidentified • Analysis pertains to pathways of small-molecule metabolism • Computed with respect to EcoCyc v4.5 (Sep-1998) • Joint work with Christos Ouzounis of EBI • Genome Research 10:268 2001

  33. Enzymes • 4391 genes in E. coli genome • 4288 code for proteins • 676 (15%) gene products form 607 enzymes • Of the 607 enzymes, 296 are monomers, 311 are multimers • 90% of genes for heteromultimers are linked

  34. Reactions • 744 reactions of small-molecule metabolism • 582 assigned to at least one pathway

  35. Compounds • 791 substrates in the 744 reactions • Each reaction contains 4.0 substrates on average • Each substrate appears in 2.1 reactions

  36. Enzyme Modulation • 805 enzymatic-reaction objects in EcoCyc • 80 have physiological inhibitors • 22 have physiological activators • 17 have both • 43% have a modulator • 327 require a cofactor or prosthetic group

  37. Enzyme-Reaction Associations • 585 reactions catalyzed by 1 enzyme • 55 reactions catalyzed by 2 enzymes • 12 reactions catalyzed by 3 enzymes • 1 reaction catalyzed by 4 enzymes • 483 reactions belong to a single pathway • 99 reactions belong to multiple pathways • 100 of the 607 E. coli enzymes are multifunctional

  38. Pathway Tools Implementation • Allegro Common Lisp • Sun and PC platforms • Run as window application or WWW server • Ocelot object database • 250,000 lines of code • Lisp-based WWW server at BioCyc.org • Lisp process reads URLs from the network and generates GIF+HTML from PGDBs • Manages 15 PGDBs

  39. Ocelot Knowledge Server Architecture • Frame data model • Classes, instances, inheritance • Persistent storage via disk files, Oracle DBMS • Concurrent development: Oracle • Single-user development: disk files • Read-only delivery: bundle data into binary program • Transaction logging facility • Schema evolution • Local disk cache to improve Internet performance • J. Intelligent Information Systems 1:155-94 1999

  40. GKB Editor • Browser and editor for KBs and ontologies • Three editing tools: • Taxonomy editor • Frame editor • Relationships editor • All operations are schema driven • http://www.ai.sri.com/~gkb/user-man.html

  41. The Common Lisp ProgrammingEnvironment • Gatt studied Lisp and Java implementation of 16 programs by 14 programmers (Intelligence 11:21 2000)

  42. Peter Norvig’s Solution • “I wrote my version in Lisp. It took me about 2 hours (compared to a range of 2-8.5 hours for the other Lisp programmers in the study, 3-25 for C/C++ and 4-63 for Java) and I ended up with 45 non-comment non-blank lines (compared with a range of 51-182 for Lisp, and 107-614 for the other languages). (That means that some Java programmer was spending 13 lines and 84 minutes to provide the functionality of each line of my Lisp program.)” • http://www.norvig.com/java-lisp.html

  43. Common Lisp ProgrammingEnvironment • Interpreted and/or compiled execution • Fabulous debugging environment • High-level language • Interactive data exploration • Extensive built-in libraries • Dynamic redefinition • Find out more! • ALU.org -- Association of Lisp Users • BioLisp.org

  44. Pathway Exchange Ontology • BioPathways group developing ontology and format for exchange of pathway data • Metabolic pathways • Signaling pathways • Protein interactions • Moving upwards from chemicals, proteins, to reactions and pathways • Working to extend CML • Draft ontology at http://www.ai.sri.com/pkarp/misc/interactions.html

  45. Summary • Pathway Tools apps: • Predict pathways and generate PGDBs • Visualization and editing tools • Paint gene expression data; compare entire pathway maps • Global consistency checking of metabolic network • Characterize metabolic and genetic networks • New killer apps: • Interoperability • Text mining • Bake-off for genome annotation pipelines

  46. BioCyc and Pathway Tools Availability • WWW BioCyc freely available to all • BioCyc.org • Six BioCyc DBs openly available to all • BioCyc DBs freely available to non-profits • Flatfiles downloadable from BioCyc.org • Binary executable: • Sun UltraSparc-170 w/ 64MB memory • PC, 400MHz CPU, 64MB memory, Windows-98 or newer • PerlCyc API • Pathway Tools freely available to non-profits

  47. SRI Suzanne Paley, Pedro Romero, John Pick, Cindy Krieger, Martha Arnaud EcoCyc Project Julio Collado-Vides, Ian Paulsen, Monica Riley, Milton Saier MetaCyc Project Sue Rhee, Lukas Mueller, Peifen Zhang, Chris Somerville Stanford Gary Schoolnik, Harley McAdams, Lucy Shapiro, Russ Altman, Iwei Yeh Funding sources: NIH National Center for Research Resources NIH National Institute of General Medical Sciences NIH National Human Genome Research Institute Department of Energy Microbial Cell Project DARPA BioSpice, UPC Acknowledgements BioCyc.org

More Related