1 / 19

Genome Annotation Continued

Genome Annotation Continued. This week’s lab. Genome annotation - web based databases for assigning gene function. Last week’s lab. E-value Score Blastx Taxonomy. Lab. Sequence assembly and analysis Assemble individual sequence reads Phred = 30 - good or bad?.

salim
Download Presentation

Genome Annotation Continued

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genome Annotation Continued • This week’s lab. • Genome annotation - web based databases for assigning gene function.

  2. Last week’s lab • E-value • Score • Blastx • Taxonomy

  3. Lab • Sequence assembly and analysis • Assemble individual sequence reads • Phred = 30 - good or bad?

  4. Linking Protein Sequence, Structure, and Function Protein sequences Protein CDD: Conserved functional domains in proteins represented by a PSSM Domains PSI-BLAST, RPS-BLAST, CDART 3D Domains NCBI Field Guide

  5. Position Specific Substitution Rates Weakly conserved serine Active site serine

  6. Position Specific Score Matrix (PSSM) A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3 Serine is scored differently in these two positions Active site nucleophile

  7. Hidden Markov Models • A statistical model that can be applied to any system that is represented as a discrete state. • Applies to protein and nt sequences. • Can be thought of much like PSSMs used in PSI-BLAST. • After several interations. • Are used in gene finding and protein profile analysis.

  8. Uses of HMMs in protein function analysis. • TIGRFAMs • Strive to annotate function of an entire protein • PFAMs • Strive to annotate domains of proteins.

  9. Homologs, orthologs, and paralogs. • Homologous genes are genes that share a common evolutionary ancestor. • Orthologs are genes found in different organisms that arose from a common ancestor. Speciation. • Paralogs are genes found in the same organism that arose from a common ancestor. Duplication could have occurred in the species or earlier, often have diverged in function

  10. Orthologs may differ in function!

  11. TIGRFAM • Curated such that proteins in a TIGRFAM should have the same function if they are equivalogs. • Proteins have identity over their entire length. • Equivalog family = all proteins that are conserved with respect to function since their last common ancestor. • Superfamily - all proteins with homology but may have different biological functions. • Subfamily - incomplete set of proteins with homology - may have diverse biological functions.

  12. PFAM • More likely to describe a protein domain rather than a family. • Pfams will not overlap. • Crosslisted in TIGRFAM page. • ~70% of proteins in SWISS-Prot have a Pfam match.

  13. COGs • Cluster of orthologous groups • Pairwise comparison of orthologs from many bacterial genomes. • Suggests function only (book example).

  14. Gene Ontology (GO) • “The goal of the Gene Ontology project is to produce a controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing.” • Biological process, Molecular function, Cellular component

  15. Literature Curation • Saccharomyces genome database (SGD) for example. • Manual curation of the literature for experimental evidence linking function to annotation.

  16. Additional databases • SMART - Simple Modular Architecture Research Tool. • PROSITE - Protein motifs • PRODOM - A database based on PSI-BLAST PSSMs. • InterPro - A database that brings together many of the above databases so that you can search them all at once. • Others.

  17. CDD • Conserved domain database - linking all of this information together. • Consists of SMART, Pfam, and COGs (KOGs). Searchable directly - automatically searched by BLAST. • Linked to CDART - allows the identification of proteins with a similar domain architecture.

  18. Bottom line about databases • Are useful tools in assigning possible functions. • Be careful about annotations • example -proteins in the same COG can be orthologs that have evolved different functions. • Many annotations are not backed up by experimental data. • Some databases are automated - have not been checked for accuracy.

  19. Annotation can not be guaranteed without experimental evidence. • Functional genomics

More Related