610 likes | 748 Views
In the search of motifs (and other hidden structures). Esko Ukkonen Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki CPM 2005, Jeju, 21 June 2005. Uncover a hidden structure(?). Motif?.
E N D
In the search of motifs (and other hidden structures) Esko Ukkonen Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki CPM 2005, Jeju, 21 June 2005
Motif? • a pattern that occursunexpectedly often in (a set of) strings • pattern: substring, substring with gaps, string in generalized alphabet (e.g., IUPAC), HMMs, binding affinity matrix, cluster of binding affinity matrices,… (= the hidden structure to be learned from data) • (unexpectedly: statistical modelling) • occurrence: exact, approximate, with high probability, … • strings ↔ applications: bioinformatics …
Plan of the talk • Gapped motifs in a string • Founder sequence reconstruction problem, with applications to haplotype analysis and genotype phasing (WABI 2002, ALT 2004, WABI 2005) • Uncovering gene enhancer elements
ATT HATTIVATTI I#A HATTIVATTI
Substring motifs of a string S • string S = s1 … sn in alphabet A. • Problem: what are the frequently occurring (ungapped) substrings of S? Longest substring that occurs at least q times? • Thm: Suffix tree T(S) of S gives complete occurrence counts of all substring motifs of S in O(n) time (although S may have O(n2) substrings!)
T(S) is full text index T(S) P P occurs in S at locations 8, 31, … 31 8 Path for P exists in T(S) ↔ P occurs in S
Counting the substring motifs • internal nodes of T(S) ↔ repeating substrings of S • number of leaves of the subtree of a node for string P = number of occurrences of P in S
T(hattivatti) hattivatti attivatti ttivatti tivatti ivatti vatti atti tti ti i vatti i t vatti i ti atti i vatti hattivatti ivatti ti vatti vatti tti vatti atti tivatti hattivatti ttivatti attivatti
Substring motifs of hattivatti vatti i t vatti 4 2 i ti atti i 2 vatti 2 hattivatti ivatti 2 ti vatti vatti tti vatti atti tivatti hattivatti ttivatti attivatti Counts for the O(n)maximal motifs shown
Finding repeats in DNA • human chromosome 3 • the first 48 999 930 bases • 31 min cpu time (8 processors, 4 GB) • Human genome: 3x109 bases • T(HumanGenome) feasible
Longest repeat? Occurrences at: 28395980, 28401554r Length: 2559 ttagggtacatgtgcacaacgtgcaggtttgttacatatgtatacacgtgccatgatggtgtgctgcacccattaactcgtcatttagcgttaggtatatctccgaatgctatccctcccccctccccccaccccacaacagtccccggtgtgtgatgttccccttcctgtgtccatgtgttctcattgttcaattcccacctatgagtgagaacatgcggtgtttggttttttgtccttgcgaaagtttgctgagaatgatggtttccagcttcatccatatccctacaaaggacatgaactcatcatttttttatggctgcatagtattccatggtgtatatgtgccacattttcttaacccagtctacccttgttggacatctgggttggttccaagtctttgctattgtgaatagtgccgcaataaacatacgtgtgcatgtgtctttatagcagcatgatttataatcctttgggtatatacccagtaatgggatggctgggtcaaatggtatttctagttctagatccctgaggaatcaccacactgacttccacaatggttgaactagtttacagtcccagcaacagttcctatttctccacatcctctccagcacctgttgtttcctgactttttaatgatcgccattctaactggtgtgagatggtatctcattgtggttttgatttgcatttctctgatggccagtgatgatgagcattttttcatgtgttttttggctgcataaatgtcttcttttgagaagtgtctgttcatatccttcgcccacttttgatggggttgtttgtttttttcttgtaaatttgttggagttcattgtagattctgggtattagccctttgtcagatgagtaggttgcaaaaattttctcccattctgtaggttgcctgttcactctgatggtggtttcttctgctgtgcagaagctctttagtttaattagatcccatttgtcaattttggcttttgttgccatagcttttggtgttttagacatgaagtccttgcccatgcctatgtcctgaatggtattgcctaggttttcttctagggtttttatggttttaggtctaacatgtaagtctttaatccatcttgaattaattataaggtgtatattataaggtgtaattataaggtgtataattatatattaattataaggtgtatattaattataaggtgtaaggaagggatccagtttcagctttctacatatggctagccagttttccctgcaccatttattaaatagggaatcctttccccattgcttgtttttgtcaggtttgtcaaagatcagatagttgtagatatgcggcattatttctgagggctctgttctgttccattggtctatatctctgttttggtaccagtaccatgctgttttggttactgtagccttgtagtatagtttgaagtcaggtagcgtgatggttccagctttgttcttttggcttaggattgacttggcaatgtgggctcttttttggttccatatgaactttaaagtagttttttccaattctgtgaagaaattcattggtagcttgatggggatggcattgaatctataaattaccctgggcagtatggccattttcacaatattgaatcttcctacccatgagcgtgtactgttcttccatttgtttgtatcctcttttatttcattgagcagtggtttgtagttctccttgaagaggtccttcacatcccttgtaagttggattcctaggtattttattctctttgaagcaattgtgaatgggagttcactcatgatttgactctctgtttgtctgttattggtgtataagaatgcttgtgatttttgcacattgattttgtatcctgagactttgctgaagttgcttatcagcttaaggagattttgggctgagacgatggggttttctagatatacaatcatgtcatctgcaaacagggacaatttgacttcctcttttcctaattgaatacccgttatttccctctcctgcctgattgccctggccagaacttccaacactatgttgaataggagtggtgagagagggcatccctgtcttgtgccagttttcaaagggaatgcttccagtttttgtccattcagtatgatattggctgtgggtttgtcatagatagctcttattattttgagatacatcccatcaatacctaatttattgagagtttttagcatgaagagttcttgaattttgtcaaaggccttttctgcatcttttgagataatcatgtggtttctgtctttggttctgtttatatgctggagtacgtttattgattttcgtatgttgaaccagccttgcatcccagggatgaagcccacttgatcatggtggataagctttttgatgtgctgctggattcggtttgccagtattttattgaggatttctgcatcgatgttcatcaaggatattggtctaaaattctctttttttgttgtgtctctgtcaggctttggtatcaggatgatgctggcctcataaaatgagttagg
Ten occurrences? ttttttttttttttgagacggagtctcgctctgtcgcccaggctggagtgcagtggcgggatctcggctcactgcaagctccgcctcccgggttcacgccattctcctgcctcagcctcccaagtagctgggactacaggcgcccgccactacgcccggctaattttttgtatttttagtagagacggggtttcaccgttttagccgggatggtctcgatctcctgacctcgtgatccgcccgcctcggcctcccaaagtgctgggattacaggcgt Length: 277 Occurrences at: 10130003, 11421803, 18695837, 26652515, 42971130, 47398125In the reversed complement at: 17858493, 41463059, 42431718, 42580925
Gapped motifs of S • gapped pattern: P in (A U {#})* • gap symbol # matches any symbol in A • aa##bb#b • L(P) = occurrences of P in S • P is called a motif of S if |L(P)| > 1 and a motif with quorum q if |L(P)| ≥ q. • Problem: find occurrence count |L(P)| for all gapped motifs P of S • anban has exponentially many motifs (M-F. Sagot)!
Motifs vs self-alignments • self-alignments of S => maximal motifs S align the occurrences
Motifs vs multiple self-alignments • self-alignments of S => maximal motifs expand if possible
Motifs vs self-alignments • S = aaaaabaaaaa P = a###a • aaaaabaaaaa aaaaabaaaaa a###a aaaaabaaaaa aaaaabaaaaa
Motifs vs self-alignments • S = aaaaabaaaaa P = a###a • aaaaabaaaaa aaaaabaaaaa a###a aaaaabaaaaa aaaaabaaaaa
Motifs vs self-alignments • S = aaaaabaaaaa P = a###a • aaaaabaaaaa aaaaabaaaaa • aaa#a#aaa is maximal motif for this self-alignment aaa#a#aaa aaaaabaaaaa aaaaabaaaaa
Maximal motifs • multiple self-alignments of S ↔ maximal gapped motifs of S: the unanimous columns give the non-gap symbols of the motif • any motif P has a unique maximal motif M(P) (align the occurrences and maximize); L(M(P)) = L(P) + d • unfortunately: anban has exponentially many maximal motifs
Blocks of maximal motifs • aaa##b##ba has blocks aaa, b, ba • Lemma: Maximal substring motifs (1-block motifs) ↔ (branching) nodes of T(S) • Thm: Each block of a maximal motif of S is a maximal substring motif of S, hence there are O(n) different strings that can be used as a block of a maximal motif. • Cor: There are O(n2k-1) different maximal motifs with k blocks [O(n2k) unrestricted motifs].
Counting 2-block maximal motifs • Thm: The occurrence counts for all maximal motifs with two blocks can be found in (optimal) time O(n3).
Algorithm (very simple) d Y X 2-block motif (X,d,Y) for each maximal substring motif X for each distance d = 1,2, … mark the leaves of T(S) that correspond to locations L(X) + d for each maximal substring motif Y, find the number h(Y) of marked leaves in its subtree in T(S) the occurrence count of motif (X,d,Y) is h(Y)
Algorithm (very simple) d Y X 2-block motif (X,d,Y) for each maximal substring motif X for each distance d = 1,2, … mark the leaves of T(S) that correspond to locations L(X) + d for each maximal substring motif Y, find the number h(Y) of marked leaves in its subtree in T(S) the occurrence count of motif (X,d,Y) is h(Y) O(n) O(n) O(n)
Counting 2-block maximal motifs (cont) • Thm: The occurrence counts for all maximal motifs with two blocks can be found in (optimal) time O(n3). • flexible gaps: x*y * = gap of any length • Thm: The occurrence counts for all maximal motifs with two blocks and one flexible gap can be found in (optimal) time O(n2).
General case • Q1: Given q and W, has S a motif with at least W non-gap symbols and at least q occurrences? • In k-block case, is O(n2k-1) (or even better) time possible? • related work: A. Apostolico, M-F. Sagot, L. Parida, N. Pisanti, …
Haplotype evolution: founders and iterated recombinations • WABI 2002
founder haplotypes current (observed) haplotypes only recombinations; mutations not shown
statistical models of recombination: average fragment length ~ 1/#generations
Uncovering founder sequences • Problem: Given current sequences C (haplotypes), construct their ‘founders’ that produce the sequences by iterated recombinations using minimum possible total number of cross-overs (i.e., current sequences have a parse into smallest possible number of fragments taken from the founders)
Example 0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1
Example 0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1
Example 0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 1 0 1 1 0 6 cross-overs
Example 0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1
Example 0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 18 cross-overs OBS: two founders (colors) always suffice if no restrictions
Founder reconstruction problem • given a set D of m sequences, construct M founder sequences that give D in minimum number of cross-overs • solution by dynamic programming, exponential time in m (WABI 2002) • Q2: NP-hard?
Modeling a set of haplotypes by a HMM • ’motif’ = Hidden Markov Model • minimum description length (MDL) modeling • ALT 2004
Hidden Markov Model (HMM) • states i with emission alphabet Hi • emission probabilities P(H 0Hi) • state transition probabilities wij . . . . {P(H)} wij j i
Conserved fragments and parses • haplotypes 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 • parse 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 12 2 2 2 2 2 2 2 • conserved 1 1 1 1 2 1 2 1 2 1 1 2 1 1 1 12 2 1 2 2 2 2 1 fragments • fragmentation model (HMM) 2 1 2 1 2 1 1 1 1 2 2 1 2 2 2 2 1 1 2 1 1 1 1
Lactose tolerance • recent finding in Finnish population: an SNP C/T-13910, 14 kb upstream from the lactase gene, associates completely with lactose intolerance • two datasets over 23 SNPs in the vicinity of this SNP • lactose intolerant persons: 21 haplotypes • lactose tolerant persons: 38 haplotypes
Case/control study by HMM Lactose tolerant (2 fragments per haplotype => young) Lactose intolerant (~6 fragments per haplotype)
Genotype phasing via founders using a HMM • the genotype phasing problem: given a set of genotypes, find their resolving haplotype pairs • find at most M founders that produce resolving haplotype pairs in minimum possible number of cross-overs => relatively good haplotyping method • improved results with a related HMM, trained with the Expectation Maximization algorithm • WABI 2005
HMM for haplotyping emission probability distribution transition probability distribution transition probability distribution … … … …