PathoLogic Pathway Predictor

PathoLogic Pathway Predictor

Inference of Metabolic Pathways Gene Products Genes/ORFs DNA Sequences Pathways Reactions Compounds Annotated Genomic Sequence Pathway/Genome Database Pathways Reactions PathoLogic Software Integrates genome and pathway data to identify putative metabolic networks Compounds Multi-organism Pathway Database (MetaCyc) Gene Products Genes Genomic Map

PathoLogic Functionality • Initialize schema for new PGDB • Transform existing genome to PGDB form • Infer metabolic pathways and store in PGDB • Infer operons and store in PGDB • Assemble Overview diagram • Assist user with manual tasks • Assign enzymes to reactions they catalyze • Identify false-positive pathway predictions • Build protein complexes from monomers • Infer transport reactions • Fill pathway holes

PathoLogic Input/Output • Inputs: • List of all genetic elements • Enter using GUI or provide a file • Files containing annotation for each genetic element • Files containing DNA sequence for each genetic element • MetaCyc database • Output: • Pathway/genome database for the subject organism • Reports that summarize: • Evidence in the input genome for the presence of reference pathways • Reactions missing from inferred pathways

File Naming Conventions • One pair of sequence and annotation files for each genetic element • Sequence files: FASTA format • suffix fsa or fna • Annotation file: • Genbank format: suffix .gbk • PathoLogic format: suffix .pf

Typical Problems Using Genbank Files With PathoLogic • Wrong qualifier names used: read PathoLogic documentation! • Extraneous information in a given qualifier • Check results of trial parse carefully

GenBank File Format • Accepted feature types: • CDS, tRNA, rRNA, misc_RNA • Accepted qualifiers: • /locus_tag Unique ID [recm] • /gene Gene name [req] • /product [req] • /EC_number [recm] • /product_comment [opt] • /gene_comment [opt] • /alt_name Synonyms [opt] • /pseudo Gene is a pseudogene [opt] • /db_xref DB:AccessionID [opt] • /go_component, /go_function, /go_process GO terms [opt] • For multifunctional proteins, put each function in a separate /product line

PathoLogic File Format • Each record starts with line containing an ID attribute • Tab delimited • Each record ends with a line containing // • One attribute-value pair is allowed per line • Use multiple FUNCTION lines for multifunctional proteins • Lines starting with ‘;’ are comment lines • Valid attributes are: • ID, NAME, SYNONYM • STARTBASE, ENDBASE, GENE-COMMENT • FUNCTION, PRODUCT-TYPE, EC, FUNCTION-COMMENT • DBLINK • GO • INTRON

PathoLogic File Format ID TP0734 NAME deoD STARTBASE 799084 ENDBASE 799785 FUNCTION purine nucleoside phosphorylase DBLINK PID:g3323039 PRODUCT-TYPE P GENE-COMMENT similar to GP:1638807 percent identity: 57.51; identified by sequence similarity; putative // ID TP0735 NAME gltA STARTBASE 799867 ENDBASE 801423 FUNCTION glutamate synthase DBLINK PID:g3323040 PRODUCT-TYPE P GO glutamate synthase (NADPH) activity [goid 0004355] [evidence IDA] [pmid 4565085]

Before you start: What to do when an error occurs • Most Navigator errors are automatically trapped – debugging information is saved to error.tmp file. • All other errors (including most PathoLogic errors) will cause software to drop into the Lisp debugger • Unix: error message will show up in the original terminal window from which you started Pathway Tools. • Windows: Error message will show up in the Lisp console. The Lisp console usually starts out iconified – its icon is a blue bust of Franz Liszt • 2 goals when an error occurs: • Try to continue working • Obtain enough information for a bug report to send to pathway-tools support team.

The Lisp Debugger • Sample error (details and number of restart actions differ for each case) Error: Received signal number 2 (Keyboard interrupt) Restart actions (select using :continue): 0: continue computation 1: Return to command level 2: Pathway Tools version 10.0 top level 3: Exit Pathway Tools version 10.0 [1c] EC(2): • To generate debugging information (stack backtrace): :zoom :count :all • To continue from error, find a restart that takes you to the top level – in this case, number 2 :cont 2 • To exit Pathway Tools: :exit

How to report an error • Determine if problem is reproducible, and how to reproduce it (make sure you have all the latest patches installed) • Send email to ptools-support@ai.sri.com containing: • Pathway Tools version number and platform • Description of exactly what you were doing (which command you invoked, what you typed, etc.) or instructions for how to reproduce the problem • error.tmp file, if one was generated • If software breaks into the lisp debugger, the complete error message and stack backtrace (obtained using the command :zoom :count :all, as described on previous slide)

Using the PPP GUI to Create a Pathway/Genome Database • Input Project Information • Organism -> Create New • Creates directory structure for new PGDB • Creates and saves empty PGDB, populated only with objects common to all PGDBs (schema classes, elements, etc.) and data you entered in the form. • Offers to invoke Replicon Editor

Input Project Information

Enter Replicon Information • For each replicon • Name • Type: chromosome, plasmid, etc. • Circular? • Annotation file • Sequence file (optional) • Contigs (optional) • Links to other DBs (optional) • GUI-Based entry • Build->Specify Replicons • File-Based Entry • Create genetic-elements.dat file using template provided

GUI-Based Replicon Entry

Batch Entry of Replicon Info File /<orgid>cyc/<version>/input/genetic-elements.dat: ID TEST-CHROM-1 NAME Chromosome 1 TYPE :CHRSM CIRCULAR? N ANNOT-FILE chrom1.pf SEQ-FILE chrom1.fsa // ID TEST-CHROM-2 NAME Chromosome 2 CIRCULAR? N ANNOT-FILE /mydata/chrom2.gbk SEQ-FILE /mydata/chrom2.fna //

Specify Reference PGDB(s) • This step is optional, and most users will omit it • MetaCyc is always the primary reference PGDB • Specify additional reference PGDB if you have your own curated PGDB which has: • Pathways and/or reactions that are not in MetaCyc • Manual functional assignments, with names similar to current genome • There is no point specifying any of our PGDBs as references, only your own curated PGDBs.

Building the PGDB • Trial Parse • Build -> Trial Parse • Check output to ensure numbers “look right” • Same number of gene start positions, end positions, names • Did my file contain EC numbers? Were they detected? • Did my file contain RNAs? Were they detected? • Fix any errors in input files • Build pathway/genome database • Build -> Automated Build

PathoLogic Parser Output

Automated Build • Parses input files • Creates objects for every gene and gene product • Uses EC numbers, GO annotations and name matcher to match enzymes to reactions in MetaCyc • Imports catalyzed enzymes and compounds from MetaCyc • Generates list of likely enzymes that couldn’t be assigned • Infers pathways likely to be present • Generates Cellular Overview Diagram (first pass) • Generates reports

Matching Enzymes to Reactions • Matches on full EC number (partial ECs ignored) • Matches on Molecular Function GO terms • If definition of GO term includes cross-reference either to an EC number or to a MetaCyc reaction. • Matches on full enzyme name • Match is case-insensitive and removes the punctuation characters “ -_(){}',:” • Also matches after removal of prefixes and suffixes such as: • “Putative”, “Hypothetical”, etc • alpha|beta|…|catalytic|inducible chain|subunit|component • Parenthetical gene name

Enzyme Name Matcher • For names that do not match, software identifies probable metabolic enzymes as those • Containing “ase” • Not containing keywords such as • “sensor kinase” • “topoisomerase” • “protein kinase” • “peptidase” • Etc • User should research unknown enzymes • MetaCyc, Swiss-Prot, PubMed

Stored in ORGIDcyc/VERSION/reports/name-matching-report.txt

Automated Pathway Inference • All pathways in MetaCyc for which there is at least one enzyme identified in the target organism are considered for possible inclusion. • Algorithm errs on side of inclusivity – easier to manually delete a pathway from an organism than to find a pathway that should have been predicted but wasn’t.

Considerations taken into account when deciding whether or not a pathway should be inferred: • Is there a unique enzyme – an enzyme not involved in any other pathway? • Does the organism fall in the expected taxonomic domain of the pathway? • Is this pathway part of a variant set, and, if so, is there more evidence for some other variant? • If there is no unique enzyme: • Is there evidence for more than one enzyme? • If a biosynthetic pathway, is there evidence for final reaction(s)? • If a degradation pathway, is there evidence for initial reaction(s)? • If an energy metabolism pathway, is there evidence for more than half the reactions?

Assigning Evidence Scores to Predicted Pathways • X|Y|Z denotes score for P in O • where: • X = total number of reactions in P • Y = enzymes catalyzing number of reactions for which there is evidence in O • Z = number of Y reactions that are used in other pathways in O

Pathway Evidence Report • On Organism Summary Page in Navigator, button “Generate Pathway Evidence Report” • Report saved as HTML file, view in browser • Hierarchical listing of all inferred pathways • “Pathway Glyph” shows evidence graphically • Steps with/without enzymes (green/black) • Steps that are unique to pathway (orange) • Steps filled by Pathway Hole Filler (blue) • Counts reactions in pathway, with evidence, in other pathways • Lists other pathways that share reactions • Link to pathway in MetaCyc

Manual Pruning of Pathways • Use pathway evidence report • Coloring scheme aids in assessing pathway evidence • Phase I: Prune extra variant pathways • Rescore pathways, re-generate pathway evidence report • Phase II: Prune pathways unlikely to be present • No/few unique enzymes • Most pathway steps present because they are used in another pathway • Pathway very unlikely to be present in this organism • Nonspecific enzyme name assigned to a pathway step

Caveats • Cannot predict pathways not present in MetaCyc • Evidence for short pathways is hard to interpret • Since many reactions occur in multiple pathways, some false positives • Next generation pathway inference algorithm is work currently in progress!

Output from PPP • Pathway/genome database • Summary pages • Pathway evidence page • Click “Summary of Organisms”, then click organism name, then click “Pathway Evidence”, then click “Save Pathway Report” • Missing enzymes report • Directory tree containing sequence files, reports, etc.

Resulting Directory Structure • ROOT/ptools-local/pgdbs/user/ORGIDcyc/VERSION/ • input • organism.dat • organism-init.dat • genetic-elements.dat • annotation files • sequence files • reports • name-matching-report.txt • trial-parse-report.txt • kb • ORGIDbase.ocelot • data • overview.graph • released -> VERSION

Manual Polishing • Refine -> Assign Probable Enzymes  Do this first • Refine -> Rescore Pathways  Redo after assigning enzymes • Refine -> Create Protein Complexes  Can be done at any time • Refine -> Assign Modified Proteins  Can be done at any time • Refine -> Transport Identification Parser  Can be done at any time • Refine -> Pathway Hole Filler • Refine -> Predict Transcription Units • Refine -> Update Overview  Do this last, and repeat after any material changes to PGDB

Assign Probable Enzymes

How to find reactions for probable enzymes • First, verify that enzyme name describes a specific, metabolic function • Search for fragment of name in MetaCyc – you may be able to find a match that PathoLogic missed • Look up protein in UniProt or other DBs • Search for gene name in PGDB for related organism (bear in mind that gene names are not reliable indicators of function, so check carefully) • Search for function name in PubMed • Other…

PathoLogic Pathway Predictor