180 likes | 272 Views
IMG terms and pathways. Krishna Palaniappan Amy Chen Frank Korzeniewski Yuri Grechkin Ernest Szeto Victor Markowitz. Natalia Ivanova Iain Anderson Thanos Lykidis Nikos Kyrpides. MGM Workshop February 1, 2012. New: SEED subsystems Transport DB, Phenotypes. Why so many?
E N D
IMG terms and pathways Krishna Palaniappan Amy Chen Frank Korzeniewski Yuri Grechkin Ernest Szeto Victor Markowitz Natalia Ivanova Iain Anderson Thanos Lykidis Nikos Kyrpides MGM Workshop February 1, 2012
New: SEED subsystems Transport DB, Phenotypes Why so many? What’s the difference? Which one should I use?
Experimental data: gene A in a genome X catalyzes a reaction interacts with another protein(s) gene knock-out causes certain phenotype … Where it all comes from This information is recorded in a structured way: • ontologies (e.g. Gene Ontology) • pathway collections(metabolic and protein-protein interaction) • other (reasoning rules, like TIGR Genome Properties)
Genes are connected to phenotypes via a multi-step process, with many parameters We have very vague ideas about the steps/parameters for the majority of genes/phenotypes If we design a relational database for gene/phenotype connections, most tables will be empty Modeling the data properly – why nobody does that phenotype gene pathway transcript reaction protein enzyme compounds evidence
KEGG http://www.genome.jp/kegg/ MetaCyc http://metacyc.org/ What it looks like in real life – KEGG vs MetaCyc
Which subunit has which cofactor? Type of Cu2+ cluster, type of Fe2+ cluster? One of the subunits is a cytochrome c, yet the enzyme is cytosolic? Does it require any help with maturation of metal clusters? • Pseudomonas sp. PB16 was shown to have only 1 enzyme from the pathway, hydroxylamine reductase. Does it have the entire pathway? Even MetaCyc record is still incomplete
Experimental data: gene A in a genome X catalyzes a reaction interacts with another protein(s) gene knock-out causes certain phenotype … Even bigger mess: bioinformatics inference What about gene B in genome Y, which is similar to gene A?
If gene B was manually annotated, the annotation must be correct If gene B was manually annotated, and it has a bi-directional best BLAST hit to gene A with e-value of 1.0e-5, the annotation must be correct If gene B was manually annotated, and it has >50% identity to gene A, it is found in the same conserved chromosomal neighborhood as gene A, the annotation must be correct … “True or false?” game
Software called PathoLogic Parses annotated files, tries to find matches between EC numbers/full product names/partial product names and reactions in MetaCyc database Automatically infers pathway presence based on matches to MetaCyc reactions Tries to find candidate genes for “missing” enzymes by doing BLAST of the genes assigned to this reaction in other organisms Generates a lot of false positives - inferred the presence of ammonia oxidation pathway in Staphylococcus based on the presence of 1 gene annotated as ammonia monooxygenase in GenBank file Poorly done inference - MetaCyc
Annotation is inferred based on orthology, defined as bi-directional best BLAST hits, manually refined based on “Ortholog tables” and chromosomal clusters Poorly documented, but seems to generate a lot less false positives than PathoLogic Better inference: KEGG
Problem: both BLAST or Smith-Waterman don’t know which amino acids are more important for protein function than others Using consensus sequence (either as PSSM or HMM) with family-specific bit score cutoffs would be much better Even the best structured inference is far from perfect
Pathway collections: KEGG, MetaCyc and others Which particular set of interactions is a pathway? (i. e. how do we define pathway boundaries within the network?)
All pathway collections share a common skeleton of reactions, which consist of reactants (compounds) All reactions share the common base of proteins annotated as catalysts Can we merge the information from different collections, using the best features of all of them? Ideal solution: pathway NR
A B Not an IMG term! R1 Enzyme (EC x.x.x.x) IMG term of the type “Protein complex” Enzyme (EC x.x.x.x) monomeric, needs cofactor C Enzyme (EC x.x.x.x) heterotrimeric, needs cofactor D C R2, spontaneous R4, chaperone Enzyme (EC x.x.x.x) heterotrimeric, subunit C IMG term of the type “Modified protein” Enzyme (EC x.x.x.x) monomeric precursor Enzyme (EC x.x.x.x) heterotrimeric, subunit B Enzyme (EC x.x.x.x) heterotrimeric, subunit A IMG term of the type “Gene product” IMG term of the type “Gene product” D R3, spontaneous Enzyme (EC x.x.x.x) heterotrimeric, subunit A precursor IMG terms: 3 types • IMG terms of 3 types:1. gene product2. multi-subunit protein complex3. modified protein