1 / 16

Genome analysis and annotation

Genome analysis and annotation. Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products ? Can we link genotype to phenotype ? (i.e. What genes are turned on when ? Why do two strains of the same pathogen vary in their pathogenicity ?)

sboozer
Download Presentation

Genome analysis and annotation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genome analysis and annotation

  2. Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products ? Can we link genotype to phenotype ? (i.e. What genes are turned on when ? Why do two strains of the same pathogen vary in their pathogenicity ?) Can we trace the evolutionary history of an organism from its genomic sequence and genome organization ? Evolutionary history of a pathway ? Genome Annotation

  3. Gene finding • Begins with the prediction of gene models through the 1) Identification of Open Reading Frames (ORFs) 2) Examination of base composition differences between coding vs. non-coding regions 3) Computational gene recognition (exons, introns, exo-intron boundaries) using a variety of gene-finding algorithms (GLIMMER, GRAIL, FGENEH, GENSCAN GLIMMER-HMM, etc…)

  4. Gene finding (cont’) • Another gene finding/confirmation approach is based on experimental evidence using homology • Alignment of Expressed Sequence Tags (EST) and full cDNA sequences with gDNA Advantages: gene discovery, proof of expression, training for gene finders Disadvantages: Disproportionate representations 2) Examination of protein translation profiles: Peptide sequencing, mass spectrometry, etc…

  5. Gene finding (cont’) The gene finding task comes with various levels of difficulty in different organisms • Much more difficult in eukaryoticgenomes and can become major focus of activity in the annotation phase of a genome: • 1) Low gene density (1-200 kb per gene) • Presence of repeats • Most eukaryotic genes have introns and exons, alternative splicing • Innacurate predictions and false postives are common • Relatively easy inbacterialand archeal genomes mostly due to: • High gene density (1 kb per gene on average) • Short intergenic regions • Lack of introns

  6. SmR2A (92% id.) SR2A (90% id.) SjR2 like (85% id.) SmR2A (91% id.) SmR2A (89% id.) Unknown repeat SmR2A (95% id.) Unknown repeat 94% id. Sm SR2 sub-familyB non-LTR retrotransposon 53% id. Sm SR2 sub-familyA non-LTR retrotransposon (SmR2A) Repeats complicate genome assembly and gene finding (Example: Schistosoma mansoni genome)

  7. Nucleotide sequence conservation using mVISTA S. japonicum Comparing genomes can help with gene finding S. mansoni

  8. S. mansoni as Reference Conclusion: The S. japonicum sequence can be used to find exons in S. mansoni S. japonicum as Reference Conclusion: The S. mansoni sequence can be used to find exons in S. japonicum Sequence homology at exons

  9. Case study: Gene finding in the Schistosoma mansoni eukaryotic parasite

  10. The TIGR Gene Modeling Pipeline • Prior to gene discovery efforts, repeats must be identified and masked. • Repeats tend to confuse ab-initio gene finders. • Fragments of transposons are often confused for protein-coding exons of genes. • By masking repeats, we increase the (signal / noise) ratio. Repeat Masking SequenceHomologySearching Ab-initio Gene Prediction Combining Evidence Final Gene Structures

  11. Construction of a S. mansoni Repeat Library • Catalog known Schistosoma Transposable Elements (TEs) • particularly retrotransposons: SR1, SR2, Sinbad, fugitive, salmonid, boudicca, saci, cercyon • De-novo construction of repeat library using RepeatScout (Price, et al. 2005) • 1125 repeat families found

  12. Genome Masking Statistics

  13. The TIGR Gene Modeling Pipeline • augustus: • provided by Mario Stanke • predicted 9,208 genes • glimmerHMM: • provided by Ela Pertea • predicted 25,890 genes Repeat Masking SequenceHomologySearching Ab-initio Gene Prediction Combining Evidence Final Gene Structures

  14. The TIGR Gene Modeling Pipeline • Spliced protein alignments using AAT (Huang, 1997) • Searched: • TIGR’s internal non-redundant protein db • Custom protein databases: • Caenorhabditis elegans and briggsae • Brugia malayi • Genewise predictions for best protein alignments Repeat Masking SequenceHomologySearching Ab-initio Gene Prediction • Spliced transcript alignments • alignments (blat, sim4) of S. mansoni ESTs and cDNAs, followed by alignment assembly using Program to Assemble Spliced Alignments (PASA) • AAT alignments of S. japonicum ESTs Combining Evidence Final Gene Structures

  15. 10 6 9 10 4 6 6 Start 6 6 2 End 6 7 1 7 EVidenceModeler (EVM) Combines predicted exons and alignmentsinto weighted consensus gene structures PASA transcript alignment assemblies Genewise protein alignments weight Gene Predictions, AAT alignments The TIGR Gene Modeling Pipeline Repeat Masking SequenceHomologySearching Ab-initio Gene Prediction Combining Evidence Final Gene Structures

  16. S.mansoni PASA assemblies S. japonicum EST alignments Genewise alignments(predictions) nr Protein Alignments Caenorhabditis sp. Protein Alignments Brugia malayi Protein Alignments Evidence View

More Related