1 / 113

Computing with Pathway/Genome Databases

This overview provides information on Pathway/Genome Databases, tools for data access and query mechanisms, schema motivations, and data exchange formats. Learn about the APIs, data import/export, and programmatic access to BioCyc databases.

presleyj
Download Presentation

Computing with Pathway/Genome Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computing with Pathway/Genome Databases

  2. Overview • Summary of Pathway Tools data access mechanisms and formats • Pathway Tools APIs • Overview of Pathway Tools schema

  3. Motivations to Understanding Schema • When writing complex queries to PGDBs, those queries must refer to classes and slots within the schema • Queries using Lisp, Perl, Java APIs • Queries using Structured Advanced Query Form • Queries using BioVelo

  4. More Information • Pathway Tools Web Site, Tutorial Slides • http://bioinformatics.ai.sri.com/ptools/ • http://brg.ai.sri.com/ptools/ptools-resources.html • Pathway Tools User’s Guide • Appendix: Guide to the Pathway Tools Schema • Curator's Guide • http://bioinformatics.ai.sri.com/ptools/curatorsguide.pdf

  5. References • Ontology Papers section of http://biocyc.org/publications.shtml • "An Evidence Ontology for use in Pathway/Genome Databases" • "An ontology for biological function based on molecular interactions" • "Representations of metabolic knowledge: Pathways" • "Representations of metabolic knowledge"

  6. Data Exchange • APIs: Lisp API, Java API, and Perl API • Read and modify access • Cyclone • Export to files • BioPAX Export Biopax.org • Export PGDB genome to Genbank format • Export entire PGDB as column-delimited and attribute-value file formats • Export PGDB reactions as SBML -- sbml.org • Import/Export of Pathways: between PGDBs • Import/Export of Selected Frames, for Spreadsheets • Import/Export of Compounds as Molfile, CML • BioWarehouse : Loader for Flatfiles, SQL access • http://bioinformatics.ai.sri.com/biowarehouse/ • BMC Bioinformatics 7:170 2006

  7. Pathway Tools Ontology / Schema • Ontology classes: 1621 • Datatype classes: Define objects from genomes to pathways • Classification systems for pathways, chemical compounds, enzymatic reactions (EC system) • Protein Feature ontology • Controlled vocabularies: • Cell Component Ontology • Evidence codes • Comprehensive set of 279 attributes and relationships

  8. Root Classes in the Pathway ToolsOntology • Chemicals -- All molecules • Polymer-Segments -- Regions of polymers • Protein-Features -- Features on proteins • Paralogous-Gene-Groups • Organisms • Generalized-Reactions -- Reactions and pathways • Enzymatic-Reactions -- Link enzymes to reactions they catalyze • Regulation -- Regulatory interactions • CCO -- Cell Component Ontology • Evidence -- Evidence ontology • Notes -- Timestamped, person-stamped notes • Organizations • People • Publications

  9. Learn to Learn About the Schema

  10. Use GKB Editor to Inspect thePathway Tools Ontology • GKB Editor = Generic Knowledge Base Editor • Type in Navigator window: (GKB) or • [Right-Click] Edit->Ontology Editor • View->Browse Class Hierarchy • [Middle-Click] to expand hierarchy • To view classes or instances, select them and: • Frame -> List Frame Contents • Frame -> Edit Frame

  11. Use the SAQP to Inspect the Schema

  12. Pathway Tools Schema • Appendix of Pathway Tools User’s Guide • Schema overview diagram

  13. Principal Classes • Class names are capitalized, plural, separated by dashes • Genetic-Elements, with subclasses: • Chromosomes • Plasmids • Genes • Transcription-Units • RNAs • rRNAs, snRNAs, tRNAs, Charged-tRNAs • Proteins, with subclasses: • Polypeptides • Protein-Complexes

  14. Principal Classes • Reactions, with subclasses: • Transport-Reactions • Enzymatic-Reactions • Pathways • Compounds-And-Elements

  15. Principal Classes • Regulation

  16. Slot Links TCA Cycle in-pathway Succinate + FAD = fumarate + FADH2 reaction Enzymatic-reaction catalyzes Succinate dehydrogenase component-of Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2 product sdhC sdhD sdhA sdhB

  17. Programmatic Access to BioCyc • Common LISP • Native language of Pathway Tools • Interactive & Mature Environment • Full Access to the Data & Many Utility Functions • Source code is available for academics • PerlCyc • API of Functions, Exposed to Perl • Communication through UNIX Socket • JavaCyc • API of Functions, Exposed to Java • Communication through UNIX Socket • Cyclone

  18. Cyclone • Developed by Schachter and colleagues from Genoscope • http://nemo-cyclone.sourceforge.net/archi.php • Cyclone is a Java-based system that: • Extracts data from a Pathway Tools PGDB • Converts it to an XML schema • Maps the data to Java objects and to a relational database • Changes made to the data on the Java side can be committed back to a Pathway Tools PGDB

  19. Lisp API • Accessible whenever you start Pathway Tools with the –lisp argument • Lisp queries evaluate against the running Pathway Tools binary and execute very fast

  20. Generic Frame Protocol (GFP) • A library of procedures for accessing Ocelot DBs • GFP specification: • http://www.ai.sri.com/~gfp/spec/paper/paper.html • A small number of GFP functions are sufficient for most complex queries

  21. Example of a Single GFP Call • The General Pattern: gfp-function(frame-ID slot-ID value ...) (gfp-function frame-ID slot-ID value …) • LISP (get-slot-values 'TRYPSYN-RXN 'LEFT) ==> (INDOLE-3-GLYCEROL-P SER)

  22. Generic Frame Protocol • get-class-all-instances (Class) • Returns the instances of Class • coercible-to-frame-p (Thing) • Is Thing a frame? Returns True if Thing is the name of a frame, or a frame object; else False

  23. Generic Frame Protocol • Notation Frame.Slot means a specified slot of a specified frame • get-slot-value(Frame Slot) • Returns first value of Frame.Slot • get-slot-values(Frame Slot) • Returns all values of Frame.Slot as a list • slot-has-value-p(Frame Slot) • Returns True if Frame.Slot has at least one value; else False • member-slot-value-p(Frame Slot Value) • Returns True if Value is one of the values of Frame.Slot; else False • print-frame(Frame) • Prints the contents of Frame • Note: Frame and Slot must be symbols!

  24. Generic Frame Protocol –Update Operations • put-slot-value(Frame Slot Value) • Replace the current value(s) of Frame.Slot with Value • put-slot-values(Frame Slot Value-List) • Replace the current value(s) of Frame.Slot with Value-List, which must be a list of values • add-slot-value(Frame Slot Value) • Add Value to the current value(s) of Frame.Slot, if any • remove-slot-value(Frame Slot Value) • Remove Value from the current value(s) of Frame.slot • replace-slot-value(Frame Slot Old-Value New-Value) • In Frame.Slot, replace Old-Value with New-Value • remove-local-slot-values(Frame Slot) • Remove all of the values of Frame.Slot

  25. Generic Frame Protocol –Update Operations • save-kb • Saves the current KB

  26. Additional Pathway Tools Functions –Semantic Inference Layer • Semantic inference layer defines built-in functions to compute commonly required relationships in a PGDB • http://bioinformatics.ai.sri.com/ptools/ptools-fns.html

  27. PerlCyc and JavaCyc • Work on Unix (Solaris or Linux) only • Start up Pathway Tools with the –api arg • Pathway Tools listens on a Unix socket – perl program communicates through this socket • Supports both querying and editing PGDBs • Must run perl or java program on the same machine that runs Pathway Tools • This is a security measure, as the API server has no built-in security • Can only handle one connection at a time

  28. Obtaining PerlCyc and JavaCyc Download from http://www.sgn.cornell.edu/downloads/ PerlCyc written and maintained by Lukas Mueller at Boyce Thompson Institute for Plant Research. JavaCyc written by Thomas Yan at Carnegie Institute, maintained by Lukas Mueller. Easy to extend…

  29. GFP functions (require knowledge of Pathway Tools schema): get_slot_values get_class_all_instances put_slot_values Pathway Tools functions (described at http://bioinformatics.ai.sri.com/ptools/ptools-fns.html): genes_of_reaction find_indexed_frame pathways_of_gene transport_p getSlotValues getClassAllInstances putSlotValues genesOfReaction findIndexedFrame pathwaysOfGene transportP Examples of PerlCyc, JavaCyc Functions

  30. Writing a PerlCyc or JavaCyc program • Create a PerlCyc, JavaCyc object: perlcyc -> new (“ORGID”) new Javacyc (“ORGID”) • Call PerlCyc, JavaCyc functions on this object: my $cyc = perlcyc -> new (“ECOLI”); my @pathways = $cyc -> all_pathways (); Javacyc cyc = new Javacyc(“ECOLI”); ArrayList pathways = cyc.allPathways (); • Functions return object IDs, not objects. • Must connect to server again to retrieve attributes of an object. foreach my $p (@pathways) { print $cyc -> get_slot_value ($p, “COMMON-NAME”);} for (int i=0; I < pathways.size(); i++) { String pwy = (String) pathways.get(i); System.out.println (cyc.getSlotValue (pwy, “COMMON-NAME”); }

  31. Sample PerlCyc Query • Number of proteins in E. coli use perlcyc; my $cyc = perlcyc -> new (“ECOLI”); my @proteins = $cyc-> get_class_all_instances("|Proteins|"); my $protein_count = scalar(@proteins); print "Protein count: $protein_count.\n";

  32. Sample PerlCyc Query • Print IDs of all proteins with molecular weight between 10 and 20 kD and pI between 4 and 5. use perlcyc; my $cyc = perlcyc -> new (“ECOLI”); foreach my $p ($cyc->get_class_all_instances("|Proteins|")) { my $mw = $cyc->get_slot_value($p, "molecular-weight-kd"); my $pI = $cyc->get_slot_value($p, "pi"); if ($mw <= 20 && $mw >= 10 && $pI <= 5 && $pI >= 4) { print "$p\n"; } }

  33. Sample PerlCyc Query • List all the transcription factors in E. coli, and the list of genes that each regulates: use perlcyc; my $cyc = perlcyc -> new (“ECOLI”); foreach my $p ($cyc->get_class_all_instances("|Proteins|")) { if ($cyc->transcription_factor_p($p)) { my $name = $cyc->get_slot_value($p, "common-name"); my %genes = (); foreach my $tu ($cyc->regulon_of_protein($p)) { foreach my $g ($cyc->transcription_unit_genes($tu)) { $genes{$g} = $cyc->get_slot_value($g, "common-name"); } } print "\n\n$name: "; print join " ", values %genes; } }

  34. Sample Editing Using PerlCyc • Add a link from each gene to the corresponding object in MY-DB (assume ID is same in both cases) use perlcyc; my $cyc = perlcyc -> new (“HPY”); my @genes = $cyc->get_class_all_instances (“|Genes|”); foreach my $g (@genes) { $cyc->add_slot_value ($g, “DBLINKS”, “(MY-DB \”$g\”)”); } $cyc->save_kb();

  35. Sample JavaCyc Query:Enzymes for which ATP is a regulator import java.util.*; public class JavacycSample { public static void main(String[] args) { Javacyc cyc = new Javacyc("ECOLI"); ArrayList regframes = cyc.getClassAllInstances("|Regulation-of-Enzyme-Activity|"); for (int i = 0; i < regframes.size(); i++) { String reg = (String)regframes.get(i); boolean bool = cyc.memberSlotValueP(reg, “Regulator", "ATP"); if (bool) { String enzrxn = cyc.getSlotValue (reg, “Regulated-Entity”); String enzyme = cyc.getSlotValue (enzrxn, “Enzyme”); System.out.println(enz); } } } }

  36. Simple Lisp Query Example:Enzymes for which ATP is a regulator (defun atp-inhibits () (loop for x in (get-class-all-instances '|Regulation-of-Enzyme-Activity|) ;; Does the Regulator slot contain the compound ATP, and the mode ;; of regulation is negative (inhibition)? when (and (member-slot-value-p x ‘Regulator 'ATP) (member-slot-value-p x ‘Mode “-”) ) ;; Whenever the test is positive, we collect the value of the slot Enzyme ;; of the Regulated-Entity of the regulatory interaction frame. ;; The collected values are returned as a list, once the loop terminates. collect (get-slot-value (get-slot-value x ‘Regulated-Entity) ‘Enzyme) ) ) ;;; invoking the query: (select-organism :org-id 'ECOLI) (atp-inhibits) (get-slot-values 'TRYPSYN-RXN 'LEFT) ==> (INDOLE-3-GLYCEROL-P SER)

  37. Simple Perl Query Example:Enzymes for which ATP is a regulator use perlcyc; my $cyc = perlcyc -> new("ECOLI"); my @regs = $cyc -> get_class_all_instances("|Regulation-of-Enzyme-Activity|"); ## We check every instance of the class foreach my $reg (@regs) { ## We test for whether the INHIBITORS-ALL ## slot contains the compound frame ATP my $bool1 = $cyc -> member_slot_value_p($reg, “Regulator", "Atp"); my $bool2 = $cyc -> member_slot_value_p($reg, “Mode", “-"); if ($bool1 && $bool2) { ## Whenever the test is positive, we collect the value of the slot ENZYME . ## The results are printed in the terminal. my $enzrxn = $cyc -> get_slot_value($reg, “Regulated-Entity"); my $enz = $cyc -> get_slot_value($enzrxn, "Enzyme"); print STDOUT "$enz\n"; } }

  38. Getting started with Lisp • pathway-tools –lisp • (load “file”) (compile-file “file.lisp”) • Emacs is a useful editor • Pathway Tools source code is available: ask • Lisp resources: http://bioinformatics.ai.sri.com/ptools/ptools-resources.html

  39. Viewing Results via the Answer List • (replace-answer-list (query))

  40. Query Gotchas • Study schema carefully • :test #’fequal • Cascade of slot-values: check for NIL

  41. Semantic Inference Layerrelationships.lisp • Library of functions that encapsulate common query building blocks and intricacies of navigating the schema • enzymes-of-gene • reactions-of-gene • pathways-of-gene • genes-of-pathway • pathway-hole-p • reactions-of-compound • top-containers(protein) • all-rxns(type) (:metab-smm :metab-all :metab-pathways :enzyme :transport etc.) • (all-rxns :metab-pathways)

  42. Pathway Tools Schema and Semantic Inference LayerGenes, Operons, and Replicons

  43. Representing a Genome product components • Classes: • ORG is of class Organisms • CHROM1 is of class Chromosomes • PLASMID1 is of class Plasmids • Gene1 is of class Genes • Product1 is of class Polypeptides or RNA Product1 Gene1 Gene2 CHROM1 genome Gene3 CHROM2 ORG PLASMID1

  44. (defun genes-of-chrom (chrom) (loop for x in (get-slot-values chrom ‘components) when (instance-all-instance-of-p x ‘|Genes|) collect x) )

  45. Polynucleotides Review slots of COLI and of COLI-K12

  46. Genetic-Elements • Sequence is stored in a separate file or database table

  47. Polymer-Segments Review slots of Genes

  48. Complexities of Gene / Gene-ProductRelationships • The Product of a gene can be an instance of Polypeptides or RNAs • An instance of Polypeptides can have more than one gene encoding it • Sequence position: • Nucleotide positions of starting and ending codons specified in Left-End-Position and Right-End-Position (usually greater, except at origin) • Transcription-Direction + / - • Alternative splicing: • Nucleotide positions of starting and ending codons specified in Left-End-Position and Right-End-Position • Intron positions specified in Splice-Form-Introns of gene product • (200 300) (350 400)

  49. Gene Reaction Schematic

  50. Proteins

More Related