1 / 20

Genome annotation and search for homologs

Genome annotation and search for homologs. Genome of the week. Discuss the diversity and features of selected microbial genomes. Link to the paper describing the genome on the MMG433 website. Bacillus subtilis. Gram-positive soil bacterium Genetically tractable, well-studied

Download Presentation

Genome annotation and search for homologs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genome annotation and search for homologs

  2. Genome of the week • Discuss the diversity and features of selected microbial genomes. • Link to the paper describing the genome on the MMG433 website.

  3. Bacillus subtilis • Gram-positive soil bacterium • Genetically tractable, well-studied • Developmental pathways (sporulation, genetic competence) • Industrial and agricultural importance • 4.2 Mb genome (sequence completed 1997)

  4. B. subtilis genome features • 4,106 protein coding genes • 10 rRNA operons • Nearly 50% of the genome consists of paralogous genes. • 77 ABC transporter binding proteins • 10 phage like regions - horizontal transfer. Low GC regions in the genome. • 18 sigma factors - initiate transcription. • 34 two-component regulatory systems.

  5. Sequencing of genomes • Hierarchical or contig based sequencing • Clone smaller seqments of the genome. • Labor intensive, slow • Not needed for sequencing microbial genomes • Shotgun method • Randomly clone and sequence 1.5-2 kb fragments of DNA. 5-10 fold coverage. • Computationally intensive.

  6. Sequence assembly • Focus of this week’s lab exercise • Algorithms to align and edit multiple sequences • Phrap and Consed • Sequencher (commercial) for lab.

  7. Finding functional features in a microbial genome. • Genes • rRNA operons, tRNAs - programs available • Origin of replication - oriC -near dnaA gene • Promoters • Transcription terminators • Horizontially transferred DNA • GC content

  8. Gene finding • Easy relative to eukaryotic genomes • No introns • 80-90% of DNA encodes genes. 5% in eukaryotes. • Find open reading frames (ORF scanning). • Find start codons (mostly ATG, not always) to stop codons. Smallest ORFs - usually 300 nt in length. • Additional features. Good Shine-Dalgarno sequence (ribosome binding site). AGGAGG. Not essential. • Similarity matches to genes in other genomes. • Effective way of searching for ORFs.

  9. Gene finding programs • Genefinder, Grail, Glimmer (TIGR), etc. • ORF finder from NCBI • Will use in a future lab exercise and in the final annotation project

  10. Annotating genes • How to assign preliminary functions to genes. • Automated programs. • Similarity searches • BLAST and PSI-BLAST • COGs, Pfam, CDD, other databases • Only 50-75% of genes will have a predicted function. Some have no known homologs in any other genome. • Functional characterization (individual genes) • Gene knockouts • Overexpression

  11. In most cases computer annotation will only be able to predict function - NOT assign function. • The biological function of many genes have not been determined, even in model systems. • As genomic characterization of gene function continues - more and more computer generated annotations will be correct.

  12. Molecular function - activity of a protein at the molecular level. • Examples would be ATPase, metal binding, converting glucose-6-phosphate to fructose-6-phosphate. • Biological function - cellular role of the protein. • Examples would be translation initiation, adapting to environmental changes, glycolysis.

  13. Homologs, orthologs, and paralogs. • Homologous genes are genes that share a common evolutionary ancestor. • Orthologs are genes found in different organisms that arose from a common ancestor • Paralogs are genes found in the same organism that arose from a common ancestor. Duplication could have occurred in the species or earlier.

  14. Using BLAST to predict gene function. • BLAST predicted protein sequence against the non-redundant database. • Determine best hits • Automated annotation programs will often assign the best hit function to the gene being searched. • Must manually confirm automated annotations.

  15. Assessment of BLAST output • What is the level of identity and similarity of the best hits? • More identity - more likely the proteins may have similar functions. • Does the area of similarity occur over the entire protein? Or just part of the protein? (fig. 2.19) • Often you will find hits to only part of your protein. A GTP-binding domain for example. • Have any of the best hits been characterized experimentally? • With so many microbial genomes sequenced chances are you will have to search extensively to find a hit that has been characterized experimentally.

  16. Databases used in protein function analysis. • COGs - Cluster of orthologous groups - proteins that are best hits against each other when comparing two genomes. • Pfam - Protein families -more likely to identify conserved domains rather than full-length proteins • TIGRfam - strives to find equivalogs - “proteins that are conserved with respect to FUNCTION since their last common ancestor”

  17. Databases used in protein function analysis. • SMART - Simple Modular Architecture Research Tool. • PROSITE - Protein motifs • CDD - Conserved domain database - linked to BLAST -Pfam, SMART, COGs. • InterPro - A database that brings together many of the above databases so that you can search them all at once.

  18. Bottom line on databases • Are useful tools in assigning possible functions. • Be careful about annotations • example -proteins in the same COG can be orthologs that have evolved different functions. • Many annotations are not backed up by experimental data. • Some databases are automated - have not been checked for accuracy.

  19. Examples YqeH and DnaA

  20. Protein function • Molecular function • YqeH - GTPase • DnaA - ATPase, DNA binding • Biological function • YqeH - Unknown • DnaA -DNA replication initiation

More Related