420 likes | 444 Views
PathoLogic Pathway Predictor. Inference of Metabolic Pathways. Gene Products. Genes/ORFs. DNA Sequences. Pathways. Reactions. Compounds. Annotated Genomic Sequence. Pathway/Genome Database. Pathways. Reactions. PathoLogic Software
E N D
Inference of Metabolic Pathways Gene Products Genes/ORFs DNA Sequences Pathways Reactions Compounds Annotated Genomic Sequence Pathway/Genome Database Pathways Reactions PathoLogic Software Integrates genome and pathway data to identify putative metabolic networks Compounds Multi-organism Pathway Database (MetaCyc) Gene Products Genes Genomic Map
PathoLogic Functionality • Initialize schema for new PGDB • Transform existing genome to PGDB form • Infer metabolic pathways and store in PGDB • Infer operons and store in PGDB • Assemble Overview diagram • Assist user with manual tasks • Assign enzymes to reactions they catalyze • Identify false-positive pathway predictions • Build protein complexes from monomers • Infer transport reactions • Fill pathway holes • Note PathoLogic can be run from command line
PathoLogic Step 3: Metabolic Reconstruction • Phase I: Qualitative metabolic reconstruction (PathoLogic) • Inference of the reactome from the annotated genome • Inference of metabolic pathways by selecting from MetaCyc pathways • Karp et al, Stand Genomic Sci, 2011 5:424 • Phase II: Quantitative model construction (MetaFlux) • Infer biomass metabolites, nutrients • Gap fill reaction network • Modify reaction complement until biomass metabolites producible from nutrients • Solve model, assess computed fluxes • Iterate • Karp et al, Briefings in Bioinformatics 2015 Dale et al, BMC Bioinformatics 2010 11:15
MetaCyc: Curated Metabolic Database “A Systematic Comparison of the MetaCyc and KEGG Pathway Databases BMC Bioinformatics 2013 14(1):112
Pathway Prediction • Pathway prediction is useful because • Pathways organize the metabolic network into tractable units • Pathways guide us to search for missing enzymes • Pathways can be used for analysis of high-throughput data • Visualization, enrichment analysis • Pathway inference fills gaps in metabolic network • Reduces computational demands of gap filling • Pathway prediction is hard because • Reactome inference is imperfect • Some reactions present in multiple pathways • Pathway variants share many reactions in common • Increasing size of MetaCyc
Reactome Inference • For each protein in the organism, infer reaction(s) it catalyzes • Build from existing genome annotation! • Match protein functions to MetaCyc reactions • Enzyme names (uncontrolled vocabulary) • EC numbers • Gene Ontology terms
PathoLogic Enzyme Name Matcher • Name matcher generates alternative variants of each name and matches each to MetaCyc • Strips extraneous information found in enzyme names • Putativecarbamate kinase, alpha subunit • Flavin subunit of carbamate kinase • Cytoplasmiccarbamate kinase • Carbamatekinase (abcD) • Carbamatekinase (3.2.1.4)
Algorithm for Inference of Metabolic Pathways • For each pathway in MetaCyc consider • For what fraction of its reactions are enzymes present in the organism? • Are enzymes present for reactions unique to the pathway? • Is a given pathway outside its designated taxonomic range? • Calvin cycle: green plants, green algae, etc • Are enzymes present for designated “key reactions” within MetaCyc pathways? • Calvin cycle / ribulose bisphosphate carboxylase Standards in Genomic Sciences 5:424-429 2011
New Addition: Pathway Score • PS : Pathway Score [0,1] • R : Set of reactions within pathway • Ignore spontaneous reactions • RS : Reaction Score for a given reaction • T : Boost if organism is within taxonomic range of pathway
Reaction Score • RS = P + U + K • P = presence score • 0.2 if enzyme catalyzing rxn is present • Else 0 • U = uniqueness score • Ranges from 0.6 (rxn present in single pathway) to 0 (many pathways) • K = key reaction score • 0.5 if rxn is a key reaction of the pathway • Else 0
Pathway Decision Procedure for Pathway P • REJECT P if P is a transport, signaling, or synthetic (engineered) pathway • REJECT P if P is an electron transport pathway AND P lacks enzymes for any reaction • INCLUDE P if P has all reactions present (meaning an enzyme is present for each reaction) AND if P is outside its taxonomic range, P contains more than 3 reactions • REJECT P if P is outside its taxonomic range • REJECT P if P is missing enzymes for all key reactions of P
Decision Procedure • REJECT P if the score of P is significantly less than the score of a variant pathway of P • INCLUDE P if the score of P exceeds the threshold PATHWAY-PREDICTION-SCORE-CUTOFF • Defined in ptools-init.dat • Default decision: REJECT
PathoLogic Input/Output • Inputs: • List of all genetic elements • Enter using GUI or provide a file • Files containing annotation for each genetic element • Files containing DNA sequence for each genetic element • MetaCyc database • Output: • Pathway/genome database for the subject organism • Reports that summarize: • Evidence in the input genome for the presence of reference pathways • Reactions missing from inferred pathways
File Naming Conventions • One pair of sequence and annotation files for each genetic element • Sequence files: FASTA format • suffix fsa or fna • Annotation file: • Genbank format: suffix .gbk • PathoLogic format: suffix .pf
Typical Problems Using Genbank Files With PathoLogic • Wrong qualifier names used: read PathoLogic documentation! • Extraneous information in a given qualifier • Check results of trial parse carefully
GenBank File Format • Accepted feature types: • CDS, tRNA, rRNA, misc_RNA • Accepted qualifiers: • /locus_tag Unique ID [recm] • /gene Gene name [req] • /product [req] • /EC_number [recm] • /product_comment [opt] • /gene_comment [opt] • /alt_name Synonyms [opt] • /pseudo Gene is a pseudogene [opt] • /db_xref DB:AccessionID [opt] • /go_component, /go_function, /go_process GO terms [opt] • For multifunctional proteins, put each function in a separate /product line
PathoLogic File Format • Each record starts with line containing an ID attribute • Tab delimited • Each record ends with a line containing // • One attribute-value pair is allowed per line • Use multiple FUNCTION lines for multifunctional proteins • Lines starting with ‘;’ are comment lines • Valid attributes are: • ID, NAME, SYNONYM • STARTBASE, ENDBASE, GENE-COMMENT • FUNCTION, PRODUCT-TYPE, EC, FUNCTION-COMMENT • DBLINK • GO • INTRON
PathoLogic File Format ID TP0734 NAME deoD STARTBASE 799084 ENDBASE 799785 FUNCTION purine nucleoside phosphorylase DBLINK PID:g3323039 PRODUCT-TYPE P GENE-COMMENT similar to GP:1638807 percent identity: 57.51; identified by sequence similarity; putative // ID TP0735 NAME gltA STARTBASE 799867 ENDBASE 801423 FUNCTION glutamate synthase DBLINK PID:g3323040 PRODUCT-TYPE P GO glutamate synthase (NADPH) activity [goid 0004355] [evidence IDA] [pmid 4565085]
Before you start: What to do when an error occurs • Most Navigator errors are automatically trapped – debugging information is saved to error.tmp file. • All other errors (including most PathoLogic errors) will cause software to drop into the Lisp debugger • Unix: error message will show up in the original terminal window from which you started Pathway Tools. • Windows: Error message will show up in the Lisp console. The Lisp console usually starts out iconified – its icon is a blue bust of Franz Liszt • 2 goals when an error occurs: • Try to continue working • Obtain enough information for a bug report to send to pathway-tools support team.
The Lisp Debugger • Sample error (details and number of restart actions differ for each case) Error: Received signal number 2 (Keyboard interrupt) Restart actions (select using :continue): 0: continue computation 1: Return to command level 2: Pathway Tools version 10.0 top level 3: Exit Pathway Tools version 10.0 [1c] EC(2): • To generate debugging information (stack backtrace): :zoom :count :all • To continue from error, find a restart that takes you to the top level – in this case, number 2 :cont 2 • To exit Pathway Tools: :exit
How to report an error • Determine if problem is reproducible, and how to reproduce it (make sure you have all the latest patches installed) • Send email to ptools-support@ai.sri.com containing: • Pathway Tools version number and platform • Description of exactly what you were doing (which command you invoked, what you typed, etc.) or instructions for how to reproduce the problem • error.tmp file, if one was generated • If software breaks into the lisp debugger, the complete error message and stack backtrace (obtained using the command :zoom :count :all, as described on previous slide)
Using the PPP GUI to Create a Pathway/Genome Database • Input Project Information • Organism -> Create New • Creates directory structure for new PGDB • Creates and saves empty PGDB, populated only with objects common to all PGDBs (schema classes, elements, etc.) and data you entered in the form. • Offers to invoke Replicon Editor
Enter Replicon Information • For each replicon • Name • Type: chromosome, plasmid, etc. • Circular? • Annotation file • Sequence file (optional) • Contigs (optional) • Links to other DBs (optional) • GUI-Based entry • Build->Specify Replicons • File-Based Entry • Create genetic-elements.dat file using template provided
Batch Entry of Replicon Info File /<orgid>cyc/<version>/input/genetic-elements.dat: ID TEST-CHROM-1 NAME Chromosome 1 TYPE :CHRSM CIRCULAR? N ANNOT-FILE chrom1.pf SEQ-FILE chrom1.fsa // ID TEST-CHROM-2 NAME Chromosome 2 CIRCULAR? N ANNOT-FILE /mydata/chrom2.gbk SEQ-FILE /mydata/chrom2.fna //
Building the PGDB • Trial Parse • Build -> Trial Parse • Check output to ensure numbers “look right” • Same number of gene start positions, end positions, names • Did my file contain EC numbers? Were they detected? • Did my file contain RNAs? Were they detected? • Fix any errors in input files • Build pathway/genome database • Build -> Automated Build
Automated Build • Parses input files • Creates objects for every gene and gene product • Uses EC numbers, GO annotations and name matcher to match enzymes to reactions in MetaCyc • Imports catalyzed enzymes and compounds from MetaCyc • Generates list of likely enzymes that couldn’t be assigned • Infers pathways likely to be present • Generates Cellular Overview Diagram (first pass) • Generates reports
Enzyme Name Matcher • For names that do not match, software identifies probable metabolic enzymes as those • Containing “ase” • Not containing keywords such as • “sensor kinase” • “topoisomerase” • “protein kinase” • “peptidase” • Etc • User should research unknown enzymes • MetaCyc, Swiss-Prot, PubMed
Pathway Evidence Report • On Organism Summary Page in Navigator, button “Generate Pathway Evidence Report” • Report saved as HTML file, view in browser • Hierarchical listing of all inferred pathways • “Pathway Glyph” shows evidence graphically • Steps with/without enzymes (green/black) • Steps that are unique to pathway (orange) • Steps filled by Pathway Hole Filler (blue) • Counts reactions in pathway, with evidence, in other pathways • Lists other pathways that share reactions • Link to pathway in MetaCyc
Manual Pruning of Pathways • Use pathway evidence report • Coloring scheme aids in assessing pathway evidence • Phase I: Prune extra variant pathways • Rescore pathways, re-generate pathway evidence report • Phase II: Prune pathways unlikely to be present • No/few unique enzymes • Most pathway steps present because they are used in another pathway • Pathway very unlikely to be present in this organism • Nonspecific enzyme name assigned to a pathway step
Caveats • Cannot predict pathways not present in MetaCyc • Evidence for short pathways is hard to interpret • Since many reactions occur in multiple pathways, some false positives
Output from PPP • Pathway/genome database • Summary pages • Pathway evidence page • Click “Summary of Organisms”, then click organism name, then click “Pathway Evidence”, then click “Save Pathway Report” • Missing enzymes report • Directory tree containing sequence files, reports, etc.
Resulting Directory Structure • ROOT/ptools-local/pgdbs/user/ORGIDcyc/VERSION/ • input • organism.dat • organism-init.dat • genetic-elements.dat • annotation files • sequence files • reports • name-matching-report.txt • trial-parse-report.txt • kb • ORGIDbase.ocelot • data • overview.graph • released -> VERSION
Manual Polishing • Refine -> Assign Probable Enzymes Do this first • Refine -> Rescore Pathways Redo after assigning enzymes • Refine -> Create Protein Complexes Can be done at any time • Refine -> Assign Modified Proteins Can be done at any time • Refine -> Transport Identification Parser Can be done at any time • Refine -> Pathway Hole Filler • Refine -> Predict Transcription Units • Refine -> Update Overview Do this last, and repeat after any material changes to PGDB
How to find reactions for probable enzymes • First, verify that enzyme name describes a specific, metabolic function • Search for fragment of name in MetaCyc – you may be able to find a match that PathoLogic missed • Look up protein in UniProt or other DBs • Search for gene name in PGDB for related organism (bear in mind that gene names are not reliable indicators of function, so check carefully) • Search for function name in PubMed • Other…