190 likes | 297 Views
Genome Annotation Continued. This week’s lab. Genome annotation - web based databases for assigning gene function. Last week’s lab. E-value Score Blastx Taxonomy. Lab. Sequence assembly and analysis Assemble individual sequence reads Phred = 30 - good or bad?.
E N D
Genome Annotation Continued • This week’s lab. • Genome annotation - web based databases for assigning gene function.
Last week’s lab • E-value • Score • Blastx • Taxonomy
Lab • Sequence assembly and analysis • Assemble individual sequence reads • Phred = 30 - good or bad?
Linking Protein Sequence, Structure, and Function Protein sequences Protein CDD: Conserved functional domains in proteins represented by a PSSM Domains PSI-BLAST, RPS-BLAST, CDART 3D Domains NCBI Field Guide
Position Specific Substitution Rates Weakly conserved serine Active site serine
Position Specific Score Matrix (PSSM) A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3 Serine is scored differently in these two positions Active site nucleophile
Hidden Markov Models • A statistical model that can be applied to any system that is represented as a discrete state. • Applies to protein and nt sequences. • Can be thought of much like PSSMs used in PSI-BLAST. • After several interations. • Are used in gene finding and protein profile analysis.
Uses of HMMs in protein function analysis. • TIGRFAMs • Strive to annotate function of an entire protein • PFAMs • Strive to annotate domains of proteins.
Homologs, orthologs, and paralogs. • Homologous genes are genes that share a common evolutionary ancestor. • Orthologs are genes found in different organisms that arose from a common ancestor. Speciation. • Paralogs are genes found in the same organism that arose from a common ancestor. Duplication could have occurred in the species or earlier, often have diverged in function
TIGRFAM • Curated such that proteins in a TIGRFAM should have the same function if they are equivalogs. • Proteins have identity over their entire length. • Equivalog family = all proteins that are conserved with respect to function since their last common ancestor. • Superfamily - all proteins with homology but may have different biological functions. • Subfamily - incomplete set of proteins with homology - may have diverse biological functions.
PFAM • More likely to describe a protein domain rather than a family. • Pfams will not overlap. • Crosslisted in TIGRFAM page. • ~70% of proteins in SWISS-Prot have a Pfam match.
COGs • Cluster of orthologous groups • Pairwise comparison of orthologs from many bacterial genomes. • Suggests function only (book example).
Gene Ontology (GO) • “The goal of the Gene Ontology project is to produce a controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing.” • Biological process, Molecular function, Cellular component
Literature Curation • Saccharomyces genome database (SGD) for example. • Manual curation of the literature for experimental evidence linking function to annotation.
Additional databases • SMART - Simple Modular Architecture Research Tool. • PROSITE - Protein motifs • PRODOM - A database based on PSI-BLAST PSSMs. • InterPro - A database that brings together many of the above databases so that you can search them all at once. • Others.
CDD • Conserved domain database - linking all of this information together. • Consists of SMART, Pfam, and COGs (KOGs). Searchable directly - automatically searched by BLAST. • Linked to CDART - allows the identification of proteins with a similar domain architecture.
Bottom line about databases • Are useful tools in assigning possible functions. • Be careful about annotations • example -proteins in the same COG can be orthologs that have evolved different functions. • Many annotations are not backed up by experimental data. • Some databases are automated - have not been checked for accuracy.
Annotation can not be guaranteed without experimental evidence. • Functional genomics