1 / 40

3. Genome Annotation: Gene Prediction (II)

3. Genome Annotation: Gene Prediction (II). Gene Prediction: Computational Challenge. Gene : A sequence of nucleotides coding for protein Gene Prediction Problem : Determine the beginning and end positions of genes in a genome. Eukaryotic gene finding.

aysel
Download Presentation

3. Genome Annotation: Gene Prediction (II)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 3. Genome Annotation:Gene Prediction (II)

  2. Gene Prediction: Computational Challenge • Gene: A sequence of nucleotides coding for protein • Gene Prediction Problem: Determine the beginning and end positions of genes in a genome

  3. Eukaryotic gene finding • On average, vertebrate gene is about 30KB long • Coding region takes about 1KB • Exon sizes vary from double digit numbers to kilobases • An average 5’ UTR is about 750 bp • An average 3’UTR is about 450 bp but both can be much longer.

  4. Exons and Introns • In eukaryotes, the gene is a combination of coding segments (exons) that are interrupted by non-coding segments (introns) • This makes computational gene prediction in eukaryotes even more difficult • Prokaryotes don’t have introns - Genes in prokaryotes are continuous

  5. Central Dogma and Splicing intron1 intron2 exon2 exon3 exon1 transcription splicing translation exon = coding intron = non-coding

  6. Gene Structure

  7. Splicing Signals Exons are interspersed with introns and typically flanked by GT and AG

  8. Donor site 5’ 3’ Position % Splice site detection

  9. Consensus splice sites Donor: 7.9 bits Acceptor: 9.4 bits

  10. Promoters are DNA segments upstream of transcripts that initiate transcription Promoter attracts RNA Polymerase to the transcription start site Promoters 5’ 3’ Promoter

  11. Splicing mechanism (http://genes.mit.edu/chris/)

  12. Splicing mechanism • Adenine recognition site marks intron • snRNPs bind around adenine recognition site • The spliceosome thus forms • Spliceosome excises introns in the mRNA

  13. Two Approaches to Eukaryotic Gene Prediction • Statistical: coding segments (exons) have typical sequences on either end and use different subwords than non-coding segments (introns). • Similarity-based: many human genes are similar to genes in mice, chicken, or even bacteria. Therefore, already known mouse, chicken, and bacterial genes may help to find human genes.

  14. Similarity-Based Approach: Metaphor in Different Languages If you could compare the day’s news in English, side-by-side to the same news in a foreign language, some similarities may become apparent

  15. Distinguishing genes from non-coding regions Splice Dmel TGTTCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC Dsec TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC Dsim TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC Dyak TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGCCTTCTACCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC Dere TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-CTTAGCCATGCGGAGTGCCTCCTGCCATTGCCGTGCGGGCGAGCATGT---GGCTCCAGCATCTTT Dana TGTCCATAAATAAA-----TCTACAACATTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGACCGTTCATG------CGGCCGTGA---GGCTCCATCATCTTA Dpse TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGGCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATCATTTTC Dper TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGCCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATTATTTTC Dwil TGTTCATAAATGAA-----TTTACAACACTTAACTGAGTTAGCCAAGCCGAGTGCCGCCGGCCATTAGTATGCAAACGACCATGG---GGTTCCATTATCTTC Dmoj TGATTATAAACGTAATGCTTTTATAACAATTAGCTG-GTTAGCCAAGCCGAGTGGCGCC------TGCCGTGCGTACGCCCCTGTCCCGGCTCCATCAGCTTT Dvir TGTTTATAAAATTAATTCTTTTAAAACAATTAGCTG-GTTAGCCAGGCGGAATGGCGCC------GTCCGTGCGTGCGGCTCTGGCCCGGCTCCATCAGCTTC Dgri TGTCTATAAAAATAATTCTTTTATGACACTTAACTG-ATTAGCCAGGCAGAGTGTCGCC------TGCCATGGGCACGACCCTGGCCGGGTTCCATCAGCTTT ***** * * ** *** *** *** ******* ** ** ** * * ** * ** ** ** ** **** * ** • Protein-coding genes have specific evolutionary constraints • Gaps are multiples of three (preserve amino acid translation) • Mutations are largely 3-periodic (silent codon substitutions) • Specific triplets exchanged more frequently (conservative substs.) • Conservation boundaries are sharp (pinpoint individual splicing signals) • Encode as ‘evolutionary signatures’ • Computational test for each of them • Combine and score systematically

  16. Genes Intergenic Separation Mutations Gaps Frameshifts 30% 1.3% 0.14% 58% 14% 10.2%    2-fold 10-fold 75-fold Signature 1: Reading frame conservation RFC RFC 100% 60% 100% 55% 100% 90% 100% 40% 100% 60% 100% 100% 100% 20% 100% 30% 100% 40% 100% 60%

  17. Results in yeast

  18. Signature 2: Distinct patterns of codon substitution Codon observed in species 2 Codon observed in species 2 Genes Intergenic • Codon substitution patterns specific to genes • Genetic code dictates substitution patterns • Amino acid properties dictate substitution patterns Codon observed in species 1 Codon observed in species 1

  19. Codon Substitution Matrix (CSM) human mouse aliphatic polar negative positive aromatic polar

  20. Gene structure in eukaryotes exons Final exon Transcribed region Initial exon start codon stop codon 3’ 5’ GT AG Untranslated regions Promoter Transcription stop side Transcription start side donor and acceptor sides

  21. Gene Prediction and Motifs • Upstream regions of genes often contain motifs that can be used for gene prediction ATG STOP -35 -10 0 10 TTCCAA TATACT Pribnow Box GGAGG Ribosomal binding site Transcription start site

  22. Ribosomal Binding Site

  23. Splicing Signals • Try to recognize location of splicing signals at exon-intron junctions • This has yielded a weakly conserved donor splice site and acceptor splice site • Profiles for sites are still weak, and lends the problem to the Hidden Markov Model (HMM) approaches, which capture the statistical dependencies between sites

  24. GenScan Model • States- correspond to different functional units of a genome (promoter region, intron, exon,….) • The states for introns and exons are subdivided according to “phase” three frames. • There are two symmetric sub modules for forward and backward strands. Performance: 80% exon detecting (but if a gene has more than one exon probability of detection decrease rapidly.

  25. Donor and Acceptor Sites: GT and AG dinucleotides • The beginning and end of exons are signaled by donor and acceptor sites that usually have GT and AC dinucleotides • Detecting these sites is difficult, because GT and AC appear very often Donor Site Acceptor Site GT AC exon 1 exon 2

  26. Donor and Acceptor Sites: Motif Logos Donor: 7.9 bits Acceptor: 9.4 bits (Stephens & Schneider, 1996) (http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)

  27. Popular Gene Prediction Algorithms • GENSCAN: uses Hidden Markov Models (HMMs) • TWINSCAN • Uses both HMM and similarity (e.g., between human and mouse genomes)

  28. Similarity-based gene finding • Alignment of • Genomic sequence and (assembled) EST sequences • Genomic sequence and known (similar) protein sequences • Two or more similar genomic sequences

  29. Expressed Sequence Tags Cell or tissue dbEST Isolate mRNA and Reverse transcribe into cDNA Clone cDNA into a vector to Make a cDNA library Submit To dbEST EST 5’ Vectors 3’ Pick a clone And sequence the 5’ and 3’ Ends of cDNA insert

  30. Central Dogma and Splicing intron1 intron2 exon2 exon3 exon1 transcription splicing translation exon = coding intron = non-coding

  31. Potential splicing sites Splicing Sequence Alignment

  32. EST (codon sequence) { { { { { exon1 intron1 exon2 intron2 exon3 Portion of genome Comparing Genomic DNA Against

  33. EST sequence Human Genome Using Similarities to Find the Exon Structure • Human EST (mRNA) sequence is aligned to different locations in the human genome • Find the “best” path to reveal the exon structure of human gene

  34. Spliced Alignment Problem: Formulation • Goal: Find a chain of blocks in a genomic sequence that best fits a target sequence • Input: Genomic sequences G, target sequence T, and a set of candidate exons B. • Output: A chain of exons Γ such that the global alignment score between Γ* and T is maximum among all chains of blocks from B. Γ* - concatenation of all exons from chain Γ

  35. Lewis Carroll Example

  36. Spliced Alignment: Speedup

  37. Spliced Alignment: Speedup

  38. Spliced Alignment: Speedup P(i,j)=maxall blocks B preceding position i S(end(B), j, B)

  39. EST_genome • http://www.well.ox.ac.uk/~rmott/ESTGENOME/est_genome.shtml

  40. Gene finding based on multiple genomes • Twinscan • PhyloHMM

More Related