140 likes | 275 Views
Data Curation in IMG-ER Natalia Ivanova MGM Workshop September 28, 2011. Tricky question. What do you need to do data curation in IMG? I-phone PhD in Computer Science supernatural powers Correct answer: you need an IMG account http://img.jgi.doe.gov/er. Gene models Add a gene
E N D
Data Curation in IMG-ER Natalia Ivanova MGM Workshop September 28, 2011
Tricky question • What do you need to do data curation in IMG? • I-phone • PhD in Computer Science • supernatural powers • Correct answer: you need an IMG account http://img.jgi.doe.gov/er
Gene models Add a gene Make a gene pseudogene or “obsolete” (=delete it) 2. Functional annotations: Product names EC numbers Gene symbols If you believe something else needs to be changed (genome name, taxonomy, etc.) – please use IMG Questions/Comments link What can’t be changed: automated assignments to protein families (Pfam, COGs, TIGRfam, InterPro, SEED assignments, KO assignments) What can be curated in IMG-ER?
Product Name is free text (but see GenBank requirements http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit_annotation.html) Prot Description is free text (goes to “note” in GenBank submission) EC number and PUBMED ID – see explanation Notes are free text (goes to “note” in GenBank submission) Gene symbol is “gene name” – 4 letter abbreviation; goes to “gene” in GenBank submission
Two possible scenarios: You have submitted a genome to IMG-ER and want to have the best annotations possible for it (e. g. for GenBank submission) You’re an expert and know everything about a certain protein family (families) = “community service” How to find the genes that need curation?
Curation of genome annotations • “Hypothetical protein”, but with some evidence • Non-hypothetical protein, but no evidence Compare Gene Annotations add to Gene Cart review Gene Pages find genome Genome Statistics refine gene set • Find Genomes: • Genome Browser • Genome Search w/o enzymes but with candidate KO based enzymes • Protein families • Homologs/orthologs • Gene Neighborhoods
Most IMG pipelines are optimized for specificity, so they are more likely to have false negatives, but generate few false positives Compare Annotations Product name is a consensus of multiple assignments: BLASTp, TIGRfam, COG, Pfam Sources of false negatives - cutoffs: TIGRfam trusted cutoffs are quite stringent; COG doesn’t have trusted cutoffs; BLASTp cutoff of 50% identity Candidate genes with KO annotations – sources of false negatives Cutoffs for % identity and alignment length Why do you want to review annotations?
Your favorite genes (experimental verification, etc.) -> use Find Genes, Gene Search or BLAST “Compare Annotations” on Organism Details page “Candidate genes with KO annotations” on Organism Details page PhyloProfiler Curation of annotation in one genome (or a set of genomes)
A shortcut for product name/EC number assignments based on KO
Run PhyloProfiler of Deinococcus geothermalis as a query, Deinococcus hopiensis as target (with no homologs in) Select Dgeo_0119 as a sequence to check whether a homolog of this gene was missed in Deinococcus hopiensis Example of a missed gene
Use graphical viewer to check the translation Adjust the start if other start codons with better RBS exist upstream Adding missed genes - contd
Organism Details page -> Genome Statistics MyIMG Reviewing your annotations
Go to the link in the usual place: http://genomebiology.jgi-psf.org/Content/MGM-10.Sep2011/agenda.html The first 2 pages – questions without answers; the rest is cheat sheet IMG curation exercises