1.13k likes | 1.14k Views
This overview provides information on Pathway/Genome Databases, tools for data access and query mechanisms, schema motivations, and data exchange formats. Learn about the APIs, data import/export, and programmatic access to BioCyc databases.
E N D
Overview • Summary of Pathway Tools data access mechanisms and formats • Pathway Tools APIs • Overview of Pathway Tools schema
Motivations to Understanding Schema • When writing complex queries to PGDBs, those queries must refer to classes and slots within the schema • Queries using Lisp, Perl, Java APIs • Queries using Structured Advanced Query Form • Queries using BioVelo
More Information • Pathway Tools Web Site, Tutorial Slides • http://bioinformatics.ai.sri.com/ptools/ • http://brg.ai.sri.com/ptools/ptools-resources.html • Pathway Tools User’s Guide • Appendix: Guide to the Pathway Tools Schema • Curator's Guide • http://bioinformatics.ai.sri.com/ptools/curatorsguide.pdf
References • Ontology Papers section of http://biocyc.org/publications.shtml • "An Evidence Ontology for use in Pathway/Genome Databases" • "An ontology for biological function based on molecular interactions" • "Representations of metabolic knowledge: Pathways" • "Representations of metabolic knowledge"
Data Exchange • APIs: Lisp API, Java API, and Perl API • Read and modify access • Cyclone • Export to files • BioPAX Export Biopax.org • Export PGDB genome to Genbank format • Export entire PGDB as column-delimited and attribute-value file formats • Export PGDB reactions as SBML -- sbml.org • Import/Export of Pathways: between PGDBs • Import/Export of Selected Frames, for Spreadsheets • Import/Export of Compounds as Molfile, CML • BioWarehouse : Loader for Flatfiles, SQL access • http://bioinformatics.ai.sri.com/biowarehouse/ • BMC Bioinformatics 7:170 2006
Pathway Tools Ontology / Schema • Ontology classes: 1621 • Datatype classes: Define objects from genomes to pathways • Classification systems for pathways, chemical compounds, enzymatic reactions (EC system) • Protein Feature ontology • Controlled vocabularies: • Cell Component Ontology • Evidence codes • Comprehensive set of 279 attributes and relationships
Root Classes in the Pathway ToolsOntology • Chemicals -- All molecules • Polymer-Segments -- Regions of polymers • Protein-Features -- Features on proteins • Paralogous-Gene-Groups • Organisms • Generalized-Reactions -- Reactions and pathways • Enzymatic-Reactions -- Link enzymes to reactions they catalyze • Regulation -- Regulatory interactions • CCO -- Cell Component Ontology • Evidence -- Evidence ontology • Notes -- Timestamped, person-stamped notes • Organizations • People • Publications
Use GKB Editor to Inspect thePathway Tools Ontology • GKB Editor = Generic Knowledge Base Editor • Type in Navigator window: (GKB) or • [Right-Click] Edit->Ontology Editor • View->Browse Class Hierarchy • [Middle-Click] to expand hierarchy • To view classes or instances, select them and: • Frame -> List Frame Contents • Frame -> Edit Frame
Pathway Tools Schema • Appendix of Pathway Tools User’s Guide • Schema overview diagram
Principal Classes • Class names are capitalized, plural, separated by dashes • Genetic-Elements, with subclasses: • Chromosomes • Plasmids • Genes • Transcription-Units • RNAs • rRNAs, snRNAs, tRNAs, Charged-tRNAs • Proteins, with subclasses: • Polypeptides • Protein-Complexes
Principal Classes • Reactions, with subclasses: • Transport-Reactions • Enzymatic-Reactions • Pathways • Compounds-And-Elements
Principal Classes • Regulation
Slot Links TCA Cycle in-pathway Succinate + FAD = fumarate + FADH2 reaction Enzymatic-reaction catalyzes Succinate dehydrogenase component-of Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2 product sdhC sdhD sdhA sdhB
Programmatic Access to BioCyc • Common LISP • Native language of Pathway Tools • Interactive & Mature Environment • Full Access to the Data & Many Utility Functions • Source code is available for academics • PerlCyc • API of Functions, Exposed to Perl • Communication through UNIX Socket • JavaCyc • API of Functions, Exposed to Java • Communication through UNIX Socket • Cyclone
Cyclone • Developed by Schachter and colleagues from Genoscope • http://nemo-cyclone.sourceforge.net/archi.php • Cyclone is a Java-based system that: • Extracts data from a Pathway Tools PGDB • Converts it to an XML schema • Maps the data to Java objects and to a relational database • Changes made to the data on the Java side can be committed back to a Pathway Tools PGDB
Lisp API • Accessible whenever you start Pathway Tools with the –lisp argument • Lisp queries evaluate against the running Pathway Tools binary and execute very fast
Generic Frame Protocol (GFP) • A library of procedures for accessing Ocelot DBs • GFP specification: • http://www.ai.sri.com/~gfp/spec/paper/paper.html • A small number of GFP functions are sufficient for most complex queries
Example of a Single GFP Call • The General Pattern: gfp-function(frame-ID slot-ID value ...) (gfp-function frame-ID slot-ID value …) • LISP (get-slot-values 'TRYPSYN-RXN 'LEFT) ==> (INDOLE-3-GLYCEROL-P SER)
Generic Frame Protocol • get-class-all-instances (Class) • Returns the instances of Class • coercible-to-frame-p (Thing) • Is Thing a frame? Returns True if Thing is the name of a frame, or a frame object; else False
Generic Frame Protocol • Notation Frame.Slot means a specified slot of a specified frame • get-slot-value(Frame Slot) • Returns first value of Frame.Slot • get-slot-values(Frame Slot) • Returns all values of Frame.Slot as a list • slot-has-value-p(Frame Slot) • Returns True if Frame.Slot has at least one value; else False • member-slot-value-p(Frame Slot Value) • Returns True if Value is one of the values of Frame.Slot; else False • print-frame(Frame) • Prints the contents of Frame • Note: Frame and Slot must be symbols!
Generic Frame Protocol –Update Operations • put-slot-value(Frame Slot Value) • Replace the current value(s) of Frame.Slot with Value • put-slot-values(Frame Slot Value-List) • Replace the current value(s) of Frame.Slot with Value-List, which must be a list of values • add-slot-value(Frame Slot Value) • Add Value to the current value(s) of Frame.Slot, if any • remove-slot-value(Frame Slot Value) • Remove Value from the current value(s) of Frame.slot • replace-slot-value(Frame Slot Old-Value New-Value) • In Frame.Slot, replace Old-Value with New-Value • remove-local-slot-values(Frame Slot) • Remove all of the values of Frame.Slot
Generic Frame Protocol –Update Operations • save-kb • Saves the current KB
Additional Pathway Tools Functions –Semantic Inference Layer • Semantic inference layer defines built-in functions to compute commonly required relationships in a PGDB • http://bioinformatics.ai.sri.com/ptools/ptools-fns.html
PerlCyc and JavaCyc • Work on Unix (Solaris or Linux) only • Start up Pathway Tools with the –api arg • Pathway Tools listens on a Unix socket – perl program communicates through this socket • Supports both querying and editing PGDBs • Must run perl or java program on the same machine that runs Pathway Tools • This is a security measure, as the API server has no built-in security • Can only handle one connection at a time
Obtaining PerlCyc and JavaCyc Download from http://www.sgn.cornell.edu/downloads/ PerlCyc written and maintained by Lukas Mueller at Boyce Thompson Institute for Plant Research. JavaCyc written by Thomas Yan at Carnegie Institute, maintained by Lukas Mueller. Easy to extend…
GFP functions (require knowledge of Pathway Tools schema): get_slot_values get_class_all_instances put_slot_values Pathway Tools functions (described at http://bioinformatics.ai.sri.com/ptools/ptools-fns.html): genes_of_reaction find_indexed_frame pathways_of_gene transport_p getSlotValues getClassAllInstances putSlotValues genesOfReaction findIndexedFrame pathwaysOfGene transportP Examples of PerlCyc, JavaCyc Functions
Writing a PerlCyc or JavaCyc program • Create a PerlCyc, JavaCyc object: perlcyc -> new (“ORGID”) new Javacyc (“ORGID”) • Call PerlCyc, JavaCyc functions on this object: my $cyc = perlcyc -> new (“ECOLI”); my @pathways = $cyc -> all_pathways (); Javacyc cyc = new Javacyc(“ECOLI”); ArrayList pathways = cyc.allPathways (); • Functions return object IDs, not objects. • Must connect to server again to retrieve attributes of an object. foreach my $p (@pathways) { print $cyc -> get_slot_value ($p, “COMMON-NAME”);} for (int i=0; I < pathways.size(); i++) { String pwy = (String) pathways.get(i); System.out.println (cyc.getSlotValue (pwy, “COMMON-NAME”); }
Sample PerlCyc Query • Number of proteins in E. coli use perlcyc; my $cyc = perlcyc -> new (“ECOLI”); my @proteins = $cyc-> get_class_all_instances("|Proteins|"); my $protein_count = scalar(@proteins); print "Protein count: $protein_count.\n";
Sample PerlCyc Query • Print IDs of all proteins with molecular weight between 10 and 20 kD and pI between 4 and 5. use perlcyc; my $cyc = perlcyc -> new (“ECOLI”); foreach my $p ($cyc->get_class_all_instances("|Proteins|")) { my $mw = $cyc->get_slot_value($p, "molecular-weight-kd"); my $pI = $cyc->get_slot_value($p, "pi"); if ($mw <= 20 && $mw >= 10 && $pI <= 5 && $pI >= 4) { print "$p\n"; } }
Sample PerlCyc Query • List all the transcription factors in E. coli, and the list of genes that each regulates: use perlcyc; my $cyc = perlcyc -> new (“ECOLI”); foreach my $p ($cyc->get_class_all_instances("|Proteins|")) { if ($cyc->transcription_factor_p($p)) { my $name = $cyc->get_slot_value($p, "common-name"); my %genes = (); foreach my $tu ($cyc->regulon_of_protein($p)) { foreach my $g ($cyc->transcription_unit_genes($tu)) { $genes{$g} = $cyc->get_slot_value($g, "common-name"); } } print "\n\n$name: "; print join " ", values %genes; } }
Sample Editing Using PerlCyc • Add a link from each gene to the corresponding object in MY-DB (assume ID is same in both cases) use perlcyc; my $cyc = perlcyc -> new (“HPY”); my @genes = $cyc->get_class_all_instances (“|Genes|”); foreach my $g (@genes) { $cyc->add_slot_value ($g, “DBLINKS”, “(MY-DB \”$g\”)”); } $cyc->save_kb();
Sample JavaCyc Query:Enzymes for which ATP is a regulator import java.util.*; public class JavacycSample { public static void main(String[] args) { Javacyc cyc = new Javacyc("ECOLI"); ArrayList regframes = cyc.getClassAllInstances("|Regulation-of-Enzyme-Activity|"); for (int i = 0; i < regframes.size(); i++) { String reg = (String)regframes.get(i); boolean bool = cyc.memberSlotValueP(reg, “Regulator", "ATP"); if (bool) { String enzrxn = cyc.getSlotValue (reg, “Regulated-Entity”); String enzyme = cyc.getSlotValue (enzrxn, “Enzyme”); System.out.println(enz); } } } }
Simple Lisp Query Example:Enzymes for which ATP is a regulator (defun atp-inhibits () (loop for x in (get-class-all-instances '|Regulation-of-Enzyme-Activity|) ;; Does the Regulator slot contain the compound ATP, and the mode ;; of regulation is negative (inhibition)? when (and (member-slot-value-p x ‘Regulator 'ATP) (member-slot-value-p x ‘Mode “-”) ) ;; Whenever the test is positive, we collect the value of the slot Enzyme ;; of the Regulated-Entity of the regulatory interaction frame. ;; The collected values are returned as a list, once the loop terminates. collect (get-slot-value (get-slot-value x ‘Regulated-Entity) ‘Enzyme) ) ) ;;; invoking the query: (select-organism :org-id 'ECOLI) (atp-inhibits) (get-slot-values 'TRYPSYN-RXN 'LEFT) ==> (INDOLE-3-GLYCEROL-P SER)
Simple Perl Query Example:Enzymes for which ATP is a regulator use perlcyc; my $cyc = perlcyc -> new("ECOLI"); my @regs = $cyc -> get_class_all_instances("|Regulation-of-Enzyme-Activity|"); ## We check every instance of the class foreach my $reg (@regs) { ## We test for whether the INHIBITORS-ALL ## slot contains the compound frame ATP my $bool1 = $cyc -> member_slot_value_p($reg, “Regulator", "Atp"); my $bool2 = $cyc -> member_slot_value_p($reg, “Mode", “-"); if ($bool1 && $bool2) { ## Whenever the test is positive, we collect the value of the slot ENZYME . ## The results are printed in the terminal. my $enzrxn = $cyc -> get_slot_value($reg, “Regulated-Entity"); my $enz = $cyc -> get_slot_value($enzrxn, "Enzyme"); print STDOUT "$enz\n"; } }
Getting started with Lisp • pathway-tools –lisp • (load “file”) (compile-file “file.lisp”) • Emacs is a useful editor • Pathway Tools source code is available: ask • Lisp resources: http://bioinformatics.ai.sri.com/ptools/ptools-resources.html
Viewing Results via the Answer List • (replace-answer-list (query))
Query Gotchas • Study schema carefully • :test #’fequal • Cascade of slot-values: check for NIL
Semantic Inference Layerrelationships.lisp • Library of functions that encapsulate common query building blocks and intricacies of navigating the schema • enzymes-of-gene • reactions-of-gene • pathways-of-gene • genes-of-pathway • pathway-hole-p • reactions-of-compound • top-containers(protein) • all-rxns(type) (:metab-smm :metab-all :metab-pathways :enzyme :transport etc.) • (all-rxns :metab-pathways)
Pathway Tools Schema and Semantic Inference LayerGenes, Operons, and Replicons
Representing a Genome product components • Classes: • ORG is of class Organisms • CHROM1 is of class Chromosomes • PLASMID1 is of class Plasmids • Gene1 is of class Genes • Product1 is of class Polypeptides or RNA Product1 Gene1 Gene2 CHROM1 genome Gene3 CHROM2 ORG PLASMID1
(defun genes-of-chrom (chrom) (loop for x in (get-slot-values chrom ‘components) when (instance-all-instance-of-p x ‘|Genes|) collect x) )
Polynucleotides Review slots of COLI and of COLI-K12
Genetic-Elements • Sequence is stored in a separate file or database table
Polymer-Segments Review slots of Genes
Complexities of Gene / Gene-ProductRelationships • The Product of a gene can be an instance of Polypeptides or RNAs • An instance of Polypeptides can have more than one gene encoding it • Sequence position: • Nucleotide positions of starting and ending codons specified in Left-End-Position and Right-End-Position (usually greater, except at origin) • Transcription-Direction + / - • Alternative splicing: • Nucleotide positions of starting and ending codons specified in Left-End-Position and Right-End-Position • Intron positions specified in Splice-Form-Introns of gene product • (200 300) (350 400)