1 / 61

In the search of motifs (and other hidden structures)

In the search of motifs (and other hidden structures). Esko Ukkonen Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki CPM 2005, Jeju, 21 June 2005. Uncover a hidden structure(?). Motif?.

alia
Download Presentation

In the search of motifs (and other hidden structures)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. In the search of motifs (and other hidden structures) Esko Ukkonen Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki CPM 2005, Jeju, 21 June 2005

  2. Uncover a hidden structure(?)

  3. Motif? • a pattern that occursunexpectedly often in (a set of) strings • pattern: substring, substring with gaps, string in generalized alphabet (e.g., IUPAC), HMMs, binding affinity matrix, cluster of binding affinity matrices,… (= the hidden structure to be learned from data) • (unexpectedly: statistical modelling) • occurrence: exact, approximate, with high probability, … • strings ↔ applications: bioinformatics …

  4. Plan of the talk • Gapped motifs in a string • Founder sequence reconstruction problem, with applications to haplotype analysis and genotype phasing (WABI 2002, ALT 2004, WABI 2005) • Uncovering gene enhancer elements

  5. 1. Gapped motifs

  6. ATT HATTIVATTI I#A HATTIVATTI

  7. Substring motifs of a string S • string S = s1 … sn in alphabet A. • Problem: what are the frequently occurring (ungapped) substrings of S? Longest substring that occurs at least q times? • Thm: Suffix tree T(S) of S gives complete occurrence counts of all substring motifs of S in O(n) time (although S may have O(n2) substrings!)

  8. T(S) is full text index T(S) P P occurs in S at locations 8, 31, … 31 8 Path for P exists in T(S) ↔ P occurs in S

  9. Counting the substring motifs • internal nodes of T(S) ↔ repeating substrings of S • number of leaves of the subtree of a node for string P = number of occurrences of P in S

  10. T(hattivatti) hattivatti attivatti ttivatti tivatti ivatti vatti atti tti ti i vatti i t vatti i ti atti i vatti hattivatti ivatti ti vatti vatti tti vatti atti tivatti hattivatti ttivatti attivatti

  11. Substring motifs of hattivatti vatti i t vatti 4 2 i ti atti i 2 vatti 2 hattivatti ivatti 2 ti vatti vatti tti vatti atti tivatti hattivatti ttivatti attivatti Counts for the O(n)maximal motifs shown

  12. Finding repeats in DNA • human chromosome 3 • the first 48 999 930 bases • 31 min cpu time (8 processors, 4 GB) • Human genome: 3x109 bases • T(HumanGenome) feasible

  13. Longest repeat? Occurrences at: 28395980, 28401554r Length: 2559 ttagggtacatgtgcacaacgtgcaggtttgttacatatgtatacacgtgccatgatggtgtgctgcacccattaactcgtcatttagcgttaggtatatctccgaatgctatccctcccccctccccccaccccacaacagtccccggtgtgtgatgttccccttcctgtgtccatgtgttctcattgttcaattcccacctatgagtgagaacatgcggtgtttggttttttgtccttgcgaaagtttgctgagaatgatggtttccagcttcatccatatccctacaaaggacatgaactcatcatttttttatggctgcatagtattccatggtgtatatgtgccacattttcttaacccagtctacccttgttggacatctgggttggttccaagtctttgctattgtgaatagtgccgcaataaacatacgtgtgcatgtgtctttatagcagcatgatttataatcctttgggtatatacccagtaatgggatggctgggtcaaatggtatttctagttctagatccctgaggaatcaccacactgacttccacaatggttgaactagtttacagtcccagcaacagttcctatttctccacatcctctccagcacctgttgtttcctgactttttaatgatcgccattctaactggtgtgagatggtatctcattgtggttttgatttgcatttctctgatggccagtgatgatgagcattttttcatgtgttttttggctgcataaatgtcttcttttgagaagtgtctgttcatatccttcgcccacttttgatggggttgtttgtttttttcttgtaaatttgttggagttcattgtagattctgggtattagccctttgtcagatgagtaggttgcaaaaattttctcccattctgtaggttgcctgttcactctgatggtggtttcttctgctgtgcagaagctctttagtttaattagatcccatttgtcaattttggcttttgttgccatagcttttggtgttttagacatgaagtccttgcccatgcctatgtcctgaatggtattgcctaggttttcttctagggtttttatggttttaggtctaacatgtaagtctttaatccatcttgaattaattataaggtgtatattataaggtgtaattataaggtgtataattatatattaattataaggtgtatattaattataaggtgtaaggaagggatccagtttcagctttctacatatggctagccagttttccctgcaccatttattaaatagggaatcctttccccattgcttgtttttgtcaggtttgtcaaagatcagatagttgtagatatgcggcattatttctgagggctctgttctgttccattggtctatatctctgttttggtaccagtaccatgctgttttggttactgtagccttgtagtatagtttgaagtcaggtagcgtgatggttccagctttgttcttttggcttaggattgacttggcaatgtgggctcttttttggttccatatgaactttaaagtagttttttccaattctgtgaagaaattcattggtagcttgatggggatggcattgaatctataaattaccctgggcagtatggccattttcacaatattgaatcttcctacccatgagcgtgtactgttcttccatttgtttgtatcctcttttatttcattgagcagtggtttgtagttctccttgaagaggtccttcacatcccttgtaagttggattcctaggtattttattctctttgaagcaattgtgaatgggagttcactcatgatttgactctctgtttgtctgttattggtgtataagaatgcttgtgatttttgcacattgattttgtatcctgagactttgctgaagttgcttatcagcttaaggagattttgggctgagacgatggggttttctagatatacaatcatgtcatctgcaaacagggacaatttgacttcctcttttcctaattgaatacccgttatttccctctcctgcctgattgccctggccagaacttccaacactatgttgaataggagtggtgagagagggcatccctgtcttgtgccagttttcaaagggaatgcttccagtttttgtccattcagtatgatattggctgtgggtttgtcatagatagctcttattattttgagatacatcccatcaatacctaatttattgagagtttttagcatgaagagttcttgaattttgtcaaaggccttttctgcatcttttgagataatcatgtggtttctgtctttggttctgtttatatgctggagtacgtttattgattttcgtatgttgaaccagccttgcatcccagggatgaagcccacttgatcatggtggataagctttttgatgtgctgctggattcggtttgccagtattttattgaggatttctgcatcgatgttcatcaaggatattggtctaaaattctctttttttgttgtgtctctgtcaggctttggtatcaggatgatgctggcctcataaaatgagttagg

  14. Ten occurrences? ttttttttttttttgagacggagtctcgctctgtcgcccaggctggagtgcagtggcgggatctcggctcactgcaagctccgcctcccgggttcacgccattctcctgcctcagcctcccaagtagctgggactacaggcgcccgccactacgcccggctaattttttgtatttttagtagagacggggtttcaccgttttagccgggatggtctcgatctcctgacctcgtgatccgcccgcctcggcctcccaaagtgctgggattacaggcgt Length: 277 Occurrences at: 10130003, 11421803, 18695837, 26652515, 42971130, 47398125In the reversed complement at: 17858493, 41463059, 42431718, 42580925

  15. Gapped motifs of S • gapped pattern: P in (A U {#})* • gap symbol # matches any symbol in A • aa##bb#b • L(P) = occurrences of P in S • P is called a motif of S if |L(P)| > 1 and a motif with quorum q if |L(P)| ≥ q. • Problem: find occurrence count |L(P)| for all gapped motifs P of S • anban has exponentially many motifs (M-F. Sagot)!

  16. Motifs vs self-alignments • self-alignments of S => maximal motifs S align the occurrences

  17. Motifs vs multiple self-alignments • self-alignments of S => maximal motifs expand if possible

  18. Motifs vs self-alignments • S = aaaaabaaaaa P = a###a • aaaaabaaaaa aaaaabaaaaa a###a aaaaabaaaaa aaaaabaaaaa

  19. Motifs vs self-alignments • S = aaaaabaaaaa P = a###a • aaaaabaaaaa aaaaabaaaaa a###a aaaaabaaaaa aaaaabaaaaa

  20. Motifs vs self-alignments • S = aaaaabaaaaa P = a###a • aaaaabaaaaa aaaaabaaaaa • aaa#a#aaa is maximal motif for this self-alignment aaa#a#aaa aaaaabaaaaa aaaaabaaaaa

  21. Maximal motifs • multiple self-alignments of S ↔ maximal gapped motifs of S: the unanimous columns give the non-gap symbols of the motif • any motif P has a unique maximal motif M(P) (align the occurrences and maximize); L(M(P)) = L(P) + d • unfortunately: anban has exponentially many maximal motifs

  22. Blocks of maximal motifs • aaa##b##ba has blocks aaa, b, ba • Lemma: Maximal substring motifs (1-block motifs) ↔ (branching) nodes of T(S) • Thm: Each block of a maximal motif of S is a maximal substring motif of S, hence there are O(n) different strings that can be used as a block of a maximal motif. • Cor: There are O(n2k-1) different maximal motifs with k blocks [O(n2k) unrestricted motifs].

  23. Counting 2-block maximal motifs • Thm: The occurrence counts for all maximal motifs with two blocks can be found in (optimal) time O(n3).

  24. Algorithm (very simple) d Y X 2-block motif (X,d,Y) for each maximal substring motif X for each distance d = 1,2, … mark the leaves of T(S) that correspond to locations L(X) + d for each maximal substring motif Y, find the number h(Y) of marked leaves in its subtree in T(S) the occurrence count of motif (X,d,Y) is h(Y)

  25. Algorithm (very simple) d Y X 2-block motif (X,d,Y) for each maximal substring motif X for each distance d = 1,2, … mark the leaves of T(S) that correspond to locations L(X) + d for each maximal substring motif Y, find the number h(Y) of marked leaves in its subtree in T(S) the occurrence count of motif (X,d,Y) is h(Y) O(n) O(n) O(n)

  26. Counting 2-block maximal motifs (cont) • Thm: The occurrence counts for all maximal motifs with two blocks can be found in (optimal) time O(n3). • flexible gaps: x*y * = gap of any length • Thm: The occurrence counts for all maximal motifs with two blocks and one flexible gap can be found in (optimal) time O(n2).

  27. General case • Q1: Given q and W, has S a motif with at least W non-gap symbols and at least q occurrences? • In k-block case, is O(n2k-1) (or even better) time possible? • related work: A. Apostolico, M-F. Sagot, L. Parida, N. Pisanti, …

  28. 2. Founder reconstruction and applications

  29. Haplotype evolution: founders and iterated recombinations • WABI 2002

  30. founder haplotypes current (observed) haplotypes only recombinations; mutations not shown

  31. statistical models of recombination: average fragment length ~ 1/#generations

  32. Uncovering founder sequences • Problem: Given current sequences C (haplotypes), construct their ‘founders’ that produce the sequences by iterated recombinations using minimum possible total number of cross-overs (i.e., current sequences have a parse into smallest possible number of fragments taken from the founders)

  33. Example 0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1

  34. Example 0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1

  35. Example 0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 1 0 1 1 0 6 cross-overs

  36. Example 0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1

  37. Example 0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 18 cross-overs OBS: two founders (colors) always suffice if no restrictions

  38. Founder reconstruction problem • given a set D of m sequences, construct M founder sequences that give D in minimum number of cross-overs • solution by dynamic programming, exponential time in m (WABI 2002) • Q2: NP-hard?

  39. Modeling a set of haplotypes by a HMM • ’motif’ = Hidden Markov Model • minimum description length (MDL) modeling • ALT 2004

  40. Hidden Markov Model (HMM) • states i with emission alphabet Hi • emission probabilities P(H 0Hi) • state transition probabilities wij . . . . {P(H)} wij j i

  41. Conserved fragments and parses • haplotypes 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 • parse 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 12 2 2 2 2 2 2 2 • conserved 1 1 1 1 2 1 2 1 2 1 1 2 1 1 1 12 2 1 2 2 2 2 1 fragments • fragmentation model (HMM) 2 1 2 1 2 1 1 1 1 2 2 1 2 2 2 2 1 1 2 1 1 1 1

  42. Lactose tolerance • recent finding in Finnish population: an SNP C/T-13910, 14 kb upstream from the lactase gene, associates completely with lactose intolerance • two datasets over 23 SNPs in the vicinity of this SNP • lactose intolerant persons: 21 haplotypes • lactose tolerant persons: 38 haplotypes

  43. Case/control study by HMM Lactose tolerant (2 fragments per haplotype => young) Lactose intolerant (~6 fragments per haplotype)

  44. Genotype phasing via founders using a HMM • the genotype phasing problem: given a set of genotypes, find their resolving haplotype pairs • find at most M founders that produce resolving haplotype pairs in minimum possible number of cross-overs => relatively good haplotyping method • improved results with a related HMM, trained with the Expectation Maximization algorithm • WABI 2005

  45. HMM for haplotyping emission probability distribution transition probability distribution transition probability distribution … … … …

  46. Example HMM

More Related