1 / 34

GTCAG ATGAGCAAAGTAGACACTCCAGTAACGCG GTGAGTACATTAA

intron. exon. exon. intron. intergene. exon. intergene. Find Gene Structures in DNA. First Exon State. Intron State. Intergene State. GTCAG ATGAGCAAAGTAGACACTCCAGTAACGCG GTGAGTACATTAA. Hidden Markov Model for Gene Finding. Intron, Exon, Intergenic states

nassor
Download Presentation

GTCAG ATGAGCAAAGTAGACACTCCAGTAACGCG GTGAGTACATTAA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. intron exon exon intron intergene exon intergene Find Gene Structures in DNA First Exon State Intron State Intergene State GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA

  2. Hidden Markov Model for Gene Finding • Intron, Exon, Intergenic states • Exon frame is encoded in the architecture by defining more states • Exon states have explicit duration density • Intron states have geometric duration • Parameters are trained separately in different levels of GC content (correlated with amount of genes, and length of exons & introns)

  3. Comparison-based Methods

  4. Exon3 Exon1 Exon2 Intron1 Intron2 5’ 3’ Cross-species gene finding [human] [mouse] GGTTTT--ATGAGTAAAGTAGACACTCCAGTAACGCGGTGAGTAC----ATTAA | ||||| ||||| ||| ||||| ||||||||||||| | | C-TCAGGAATGAGCAAAGTCGAC---CCAGTAACGCGGTAAGTACATTAACGA-

  5. Comparison of 1196 orthologous genes(Makalowski et al., 1996) • Sequence identity between genes in human/mouse • exons: 84.6% • protein: 85.4% • introns: 35% • 5’ UTRs: 67% • 3’ UTRs: 69% • 27 proteins were 100% identical.

  6. Human Mouse Human-mouse homology

  7. Not always: HoxA human-mouse

  8. Twinscan • Twinscan is an augmented version of the Gencscan HMM. I E transitions duration emissions ACUAUACAGACAUAUAUCAU

  9. Twinscan Algorithm • Align the two sequences (eg. from human and mouse) • Mark each human base as gap ( - ), mismatch ( : ), match ( | ) New “alphabet”: 4 x 3 = 12 letters  = { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| }

  10. Twinscan Algorithm • Run Viterbi using emissions ek(b) where b  { A-, A:, A|, …, T| } Note: Emission distributions ek(b) estimated from real genes from human/mouse eI(x|) < eE(x|): matches favored in exons eI(x-) > eE(x-): gaps (and mismatches) favored in introns

  11. Example Human: ACGGCGACGUGCACGU Mouse: ACUGUGACGUGCACUU Alignment: ||:|:|||||||||:| Input to Twinscan HMM: A| C| G: G| C: G| A| C| G| U| G| C| A| C| G: U| Recall, eE(A|) > eI(A|) eE(A-) < eI(A-) Likely exon

  12. HMMs for simultaneous alignment and gene finding: Generalized Pair HMMs

  13. 1 - 2 M P(xi, yj) 1-  - 2 1-  - 2   I P(xi) J P(yj)     A Pair HMM for alignments BEGIN I M J END

  14. Generalized Pair HMMs

  15. Exon GPHMM 1.Choose exon lengths (d,e). 2.Generate alignment of length d+e. e d

  16. Exon3 Exon1 Exon2 Intron1 Intron2 5’ 3’ CNS CNS CNS Cross-species gene finding [human] [mouse]

  17. The SLAM hidden Markov model

  18. length seq1 no. states length seq2 max duration Computational complexity

  19. Approximate alignment Reduces TU -factor to hT

  20. Measuring Performance

  21. TBLASTX SLAM SLAM CNS SGP-2 VISTA Twinscan RefSeq Genscan Example: HoxA2 and HoxA3

  22. Suffix Trees (a short break from biology)

  23. Suffix Trees • Suffix trees are a method to find all maximal matches between two strings (and much more) Example: x = dabdac d a b d a c 1 a c b d a b 4 c d c a c c 3 2 6 5

  24. Definition of a Suffix Tree Definition: For string x = x1…xm, a suffix tree is: • A rooted tree with m leaves Leaf i: xi…xm • Each edge is a substring • No two edges out of a node, start with same letter It follows, every substring corresponds to an initial part of a path from root to a leaf

  25. Naïve Algorithm to Construct a Suffix Tree • Initialize tree T: a single root node r • Insert special symbol $ at end of x • For j = 1 to m • Find longest match of xi…xm to T, starting from r • Split edge where match stops: new node w • Create edge (w, j), and label with unmatched portion of xi…xm

  26. 1. Insert d a b d a $ 2. Insert a b d a $ d a b d a $ 3. Insert b d a $ 4. Insert d a $ a $ b 5. Insert a $ d a b 6. Insert $ 4 $ d $ a $ $ 3 2 6 5 Example of Suffix Tree Construction x = d a b d a $ 1

  27. Memory to Store Suffix Tree • Can store in O( N ) memory! • Every edge is labeled with (i, j): (i,j) denotes xi…xj • Tree has O( N ) nodes Proof: • # leafs  # nodes – 1 • # leafs = |x|

  28. Faster Construction Several algorithms O( N ) time, O( N ) memory with a big constant ~15 bytes/char Technical but not deep, outside the scope of this course Optional: Gusfield, chapter 6

  29. Application: find all matches between x, y • Build suffix tree for x, mark nodes with x • Insert y in suffix tree, mark all nodes y “passes from” with y • The path label of every node marked both 0 and 1, is a common substring

  30. y y 2. Insert a b a d a $ 3. Insert b a d a $ a x y 4. Insert a d a $ y d 4 x y a 5. Insert d a $ 6 a $ 6. Insert a $ d d a 6. Insert $ 2 $ a 5 3 $ 1 Example of Suffix Tree construction x = d a b d a $ y = a b a d a $ d a b d a $ 1 1. Construct tree for x x x a $ b d a b 4 $ x d $ a 6 $ $ 3 2 5

  31. Application: common substrings of k strings To find the longest common substring of s1, s2, …sn • Build suffix tree for s1,…, sn • All nodes labeled {si1, …, sik} represent a match between si1, …, sik

  32. Suffix Arrays • Fast O(log n) search for every specific string • Used for data compression such as bzip2 • Can be built in O(n) time by first building suffix tree and then get ordered suffixes by in-order traversal • Too much memory— ~15n bytes • Difficult to implement • Theoretical build in O(n log n) using O(n/ sqrt(log n)) extra memory • Hot topic how to build fast in practice ABRACADABRA$ 11 $ 10 A$ 7 ABRA$ 0 ABRACADABRA$ 3 ACADABRA$ 5 ADABRA$ 8 BRA$ 1 BRACADABRA$ 4 CADABRA$ 6 DABRA$ 9 RA$ 2 RACADABRA#$

More Related