710 likes | 854 Views
Part II Algorithms for string motif finding. Jaime Seguel, PhD Electrical and Computer Engineering Department University of Puerto Rico at Mayaguez. Disclaimer.
E N D
Part IIAlgorithms for string motif finding Jaime Seguel, PhD Electrical and Computer Engineering Department University of Puerto Rico at Mayaguez Summer Institute in Bioinformatics PSC - 2008
Disclaimer Some slides presented in this talk have been taken with minor or without modifications from power point presentations published in the Website http://www.bioalgorithms.info/ Summer Institute in Bioinformatics PSC - 2008
Outline • The problem of finding small common patterns in a set of DNA sequences • Brute force approach: • consensus maximization • Hamming distance minimization • Branch-and-Bound approach: • Consensus maximization • Hamming distance minimization • Consensus and Pattern Branching: • Greedy Motif Search • Summary Summer Institute in Bioinformatics PSC - 2008
Problem: Given the following 10 DNA sequences, each with 82 characters: Is there a 15-character common pattern? atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaagggggggatgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttataggtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga Summer Institute in Bioinformatics PSC - 2008
YES of course!It is AAAAAAAGGGGGGG atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGatgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttataggtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa Summer Institute in Bioinformatics PSC - 2008
That was an easy one! A general algorithm for finding an l-character common pattern is Common l-character Pattern Detection Algorithm: Input: t DNA sequences, each of length n; and l < n, the length of the pattern Procedure: • Compare the first two strings using a pattern matching algorithm • If no l-character common pattern is found, return “NO” • Otherwise, save the l-character common pattern • For j = 2,…,t • Check if the pattern appears in the jth sequence • If it does not, return “NO” • End For • Return “Yes, of course! It is {pattern}” Bioinformatics Algorithms
Complexity of the Common l-character Pattern Detection Algorithm The time complexity of the previously discussed Common l-character Pattern Detection Algorithm can be estimated as follows: • Step 1 is computed in time • Steps 4 – 7 are computed in • The whole algorithm takes time • Therefore, the algorithm is polynomial (indeed, quadratic) Summer Institute in Bioinformatics PSC - 2008
Unfortunately… Real-lifeproblems are not that simple: • The pattern is not exactly the same in each array because random point mutations may occur in the sequences • The length of the pattern is usually unknown • It is not know where it is located relative to the genes start These facts-of-life make the motif (i.e. pattern) finding problem much more complex Summer Institute in Bioinformatics PSC - 2008
Same sequences except by a few point mutations: Is there a motif? atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcgggatgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttataggtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga Summer Institute in Bioinformatics PSC - 2008
Well, there are 15-character patterns that look pretty much alike. Indeed, they differ for at most 4 characters. Is that what you are asking for? atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGatgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttataggtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa AgAAgAAAGGttGGG ..|..|||.|..||| cAAtAAAAcGGcGGG Summer Institute in Bioinformatics PSC - 2008
Instead of a Pattern, what we get is a Motif Logo • Motifs can mutate on non important bases • The illustration shows five motifs in five different genes that have mutations in position 3 and 5 • Representations called motif logosillustrate the conserved and variable regions of a motif TGGGGGA TGAGAGA TGGGGGA TGAGAGA TGAGGGA Summer Institute in Bioinformatics PSC - 2008
A larger motif logo Summer Institute in Bioinformatics PSC - 2008
Consensus strings • The largest characters in a motif logo represent a consensus string, this is, a string containing the most frequently repeated characters • The “quality” of a motif logo as a “generalized common pattern” in a family of DNA sequences can be assessed by scoring all consensus strings as follows BIOINFORMATI
Selecting a t x l “window” Parameters: t=3, l=9,n=18 Sequences: S1: TTGAGGTACACCTATAAC S2: TAGCTCCACTCATATCAG S3: TATCGCATGTACAATCAC Selected window: s=(4, 2, 7) (initial positions) ***AGGTACACC****** *AGCTCCACT******** ******ATGTACAAT*** BIOINFORMATI
Alignment, profile, consensus an scoring the selected window A G G T A C A C C A(4, 2, 7) =A G C T C C A C T A T G T A C A A T P(4, 2, 7)=A: 3 0 0 0 2 0 3 1 0 C: 0 0 1 0 1 3 0 2 1 G: 0 22 0 0 0 0 0 0 T: 0 1 0 3 0 0 0 0 2 Consensus: A G G T A C A C T Score A(4,2,7): 3+2+2+3+2+3+2+2 = 19 BIOINFORMATI
Motif finding problem as a maximization problem Given a set of t DNA sequences of length n and a segment length l < n; find a set of t subsequences of length l , one from each of the given DNA sequences, whose consensus score is maximal BIOINFORMATI
Brute force approach to maximum consensus score Input: t DNA sequences of length n, and the pattern’s length l • Initialize bestScore 0; • For s=(s1,s2 , . . ., st) from (1,1 . . . 1) to (n-l+1, . . ., n-l+1) • Compute Score score of alignment matrix A(s) • If Score> bestScore • bestScore Score • bestMotif (s1,s2 , . . . , st) • Return bestMotif BIOINFORMATI
Complexity • Count the windows: Varying (n - l + 1)positions in each of tsequences produces (n - l + 1)twindows (or sets of starting positions). The order is • For each set of starting positions, the scoring function makes O(l) operations, so complexity is O(l nt) • That means that for t = 8, n = 1000, l = 10 we must perform approximately 1025computations!!! • Even in a supercomputer this will take a few billions years!!! BIOINFORMATI
A different approach Instead of finding all windows, why not comparing each of the possible l-character patterns over the alphabet {A, G, T, C} with each of the l-mers (subsequences of l characters) in each of the tDNAsequences and find the pattern that appears in all t sequences with the minimum number of mutations Question is: Will this approach yield a better brute force algorithm ? Summer Institute in Bioinformatics PSC - 2008
Hamming distances The Hamming distance dH(v,w) is the number of nucleotide pairs that do not match when v and w are aligned. For example: dH(AAAAAA,ACAAAC) = 2 The Hamming distance between a patternV and a DNA sequenceS is the minimum of all distances d(X, V) taken over all possible substrings X over S Summer Institute in Bioinformatics PSC - 2008
Illustration of a total distance: some computations Parameters: t=3, l=9, n=15 Sequences: S1: TTGAGGTACACCTAT S2: TAGCTCCACTCATAT S3: TATCGCATGTACAAT Proposed Pattern: V=AGGTATACG BIOINFORMATI
The distance form pattern AGGTATACG to sequence S1 is 2 • TTGAGGTACACCTAT First sequence (S1) and chosen subsequence X d(TTGAGGTAC,AGGTATACG)=8 • TTGAGGTACACCTAT Second choice of X d(TGAGGTACA,AGGTATACG)=5 • TTGAGGTACACCTAT Third choice of X d(GAGGTACAC,AGGTATACG)=8 • TTGAGGTACACCTAT Forth choice of X d(AGGTACACC,AGGTATACG)=2 Minimum • TTGAGGTACACCTAT Fifth choice of X d(GGTACACCT,AGGTATACG)=7 • TTGAGGTACACCTAT Sixth choice of X d(GTACACCTA,AGGTATACG)=8 • TTGAGGTACACCTAT Seventh choice of X d(TACACCTAT,AGGTATACG)=9 BIOINFORMATI
The total distance • In the previous example: d(TTGAGGTACACCTAT,AGGTATACG) d(S1, AGGTATACG) =2 achieved when X is the 9-letter segment starting at position 4 in the DNA string • Similarly, we get d(S2, AGGTATACG) = d(S3, AGGTATACG) = 4 • The total Hamming distance over the set of DNA sequences {S1, S2, S3} is defined to be TotalDistance( AGGTATACG, {S1, S2, S3}) = d(S1, AGGTATACG)+ d(S2, AGGTATACG)+ d(S3, AGGTATACG)} = 2 + 4 + 4 = 10 BIOINFORMATI
Motif finding problem as a Hamming distance minimization problem Given a set of t DNA sequences of length n and a segment length l < n; find a string v in the DNA alphabet (this is, a string of nucleotides) with length l which minimizesTotalDistance(v, Set of DNA sequences) This is finding: min {TotalDistance(v, {S1,…,St}): v DNA sequence of length l } BIOINFORMATI
Brute force implementation of the total-distance minimization method Input: t DNA sequences of length n, and the pattern length l • Initialize bestWord AAA…A (l characters) • Initialize bestDistance highest integer in your system • For each l-mer v from AAA…A to TTT…T • Compute TotalDistance(v, DNA set) • If TotalDistance(v, DNA) < bestDistance • bestDistanceTotalDistance(v, DNA set) • bestWord v • Return bestWord BIOINFORMATI
Complexity • Minimizing the total Hamming distance requires examining all 4l combinations for the pattern v, and each pattern choice is followed by O(t(n-l+1)) operations. This is, the method’s complexity is O(4l t(n-l+1)) • Conclusion, the complexity of the brute-force total distance minimization is dominated by an exponential factor, as well. But the actual count is much less in this case. BIOINFORMATI
It’s all in what affects the exponential growth!!! • In most practical situations n is significantly larger than l. Recall that l is usually a number between 7 and 15. • The advantage of 4l over (n -l+ 1)t is that the former expression does not depend exponentially neither on the number of sequences in the set (t) nor in the sequence lengths (n) • The latter parameters (t and n) are less likely to be bounded in practice BIOINFORMATI
Mathematical Equivalence • The Motif Finding is a maximization problem while Median String is a minimization problem. Computationally, Median String allows searches over much larger data sets. Are the results comparable? • Indeed, the Motif Finding problem and Median String problem are mathematically equivalent. Next we show that minimizing TotalDistance is equivalent to maximizing Score Summer Institute in Bioinformatics PSC - 2008
Proof of the mathematical equivalence l a G g t a c T t C c A t a c g t Alignment a c g t T A g t a c g t C c A t C c g t a c g G _________________ A 3 0 1 0 311 0 Profile C 24 0 0 14 0 0 G 0 14 0 0 0 31 T 0 0 0 51 0 14 _________________ Consensus a c g t a c g t Score 3+4+4+5+3+4+3+4 TotalDistance 2+1+1+0+2+1+2+1 Sum 5 5 5 5 5 5 5 5 At any column I Scorei+ Hamming Distancei= t Because there are lcolumns Score+ TotalDistance= l * t Rearranging: Score= l * t - TotalDistance Since l* t is constant the minimization of the right side is equivalent to the maximization of the left side t Summer Institute in Bioinformatics PSC - 2008
Structuring the Search Let’s take a closer look to the pseudo-code line For each l-merv from AAA…A to TTT…T • There is more than one way to navigate over all possible l-mers • We need a navigation method able to exhibit intermediate approximations so potentially “low scoring or highly distant” l-mers can be eliminated as earlier as possible in the search Summer Institute in Bioinformatics PSC - 2008
Structuring the Search • For the Median String Problem we need to consider all 4l possible l-mers: aa… aa aa… ac aa… ag aa… at . . tt… tt How to organize this search? l Summer Institute in Bioinformatics PSC - 2008
Alternative Representation of the Search Space • Let A = 1, C = 2, G = 3, T = 4 • Then the sequences from AA…A to TT…T become: 11…11 11…12 11…13 11…14 . . 44…44 • Notice that the sequences above simply list all numbers as if we were counting on base 4 without using 0 as a digit l Summer Institute in Bioinformatics PSC - 2008
Linked lists don’t exhibit intermediate approximations • Suppose l = 2 aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt • Need to visit all the predecessors of a sequence before visiting the sequence itself Start Summer Institute in Bioinformatics PSC - 2008
Trees do !!! • Linked lists organize the patterns. A tree, instead, may show the patterns and their prefixes aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt Summer Institute in Bioinformatics PSC - 2008
Search Tree a- c- g- t- aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt root -- Summer Institute in Bioinformatics PSC - 2008
Moving through a Search Tree • Four common moves in a search tree that we are about to explore: • Move to the next leaf • Visit all the leaves • Visit the next node • Bypass the children of a node Summer Institute in Bioinformatics PSC - 2008
Visit the Next Leaf Given a current leaf a, we need to compute the “next” leaf: • NextLeaf( a,L, k ) // a : the array of digits • foriL to 1 //L: length of the array • ifai < k // k : max digit value • aiai + 1 • returna • ai 1 • returna Summer Institute in Bioinformatics PSC - 2008
NextLeaf (cont’d) • The algorithm is common addition in radix k: • Increment the least significant digit • “Carry the one” to the next digit position when the digit is at maximal value Summer Institute in Bioinformatics PSC - 2008
NextLeaf: Example • Moving to the next leaf: 1- 2- 3- 4- 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 -- Current Location Summer Institute in Bioinformatics PSC - 2008
NextLeaf: Example (cont’d) • Moving to the next leaf: 1- 2- 3- 4- 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 -- Next Location Summer Institute in Bioinformatics PSC - 2008
Visit All Leaves • Printing all permutations in ascending order: • AllLeaves(L,k) // L: length of the sequence • a (1,...,1) // k : max digit value • while forever // a: array of digits • output a • a NextLeaf(a,L,k) • ifa = (1,...,1) • return Summer Institute in Bioinformatics PSC - 2008
Visit All Leaves: Example • Moving through all the leaves in order: 1- 2- 3- 4- 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 -- Order of steps Summer Institute in Bioinformatics PSC - 2008
Depth First Search • So we can search all leaves • How about searching all vertices of the tree? • We can do this with a depth first search Summer Institute in Bioinformatics PSC - 2008
Visit the Next Vertex • NextVertex(a,i,L,k) // a : the array of digits • ifi < L // i : prefix length • a i+1 1 // L: max length • return ( a,i+1) // k : max digit value • else • forjl to 1 • ifaj < k • ajaj +1 • return( a,j ) • return(a,0) Summer Institute in Bioinformatics PSC - 2008
Example • Moving to the next vertex: 1- 2- 3- 4- 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 Current Location -- Summer Institute in Bioinformatics PSC - 2008
Example • Moving to the next vertices: 1- 2- 3- 4- 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 Location after 5 next vertex moves -- Summer Institute in Bioinformatics PSC - 2008
Bypass Move • Given a prefix (internal vertex), find next vertex after skipping all its children • Bypass(a,i,L,k) // a: array of digits • forji to 1 // i : prefix length • ifaj < k// L: maximum length • ajaj +1// k : max digit value • return(a,j) • return(a,0) Summer Institute in Bioinformatics PSC - 2008
Bypass Move: Example • Bypassing the descendants of “2-”: 1- 2- 3- 4- 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 Current Location -- Summer Institute in Bioinformatics PSC - 2008
Example • Bypassing the descendants of “2-”: 1- 2- 3- 4- 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 Next Location -- Summer Institute in Bioinformatics PSC - 2008
Improving brute force search: The Branch and Bound approach • Sets of s=(s1, s2, …,st) may have a weak profile for the first i positions (s1, s2, …,si) • Every row of alignment may add at most lto Score • Optimism: if all subsequent (t-i) positions (si+1, …st) add (t – i ) * ltoScore(s,i,DNA) • If Score(s,i,DNA) + (t – i) * l < BestScore, it makes no sense to search in the descendents of the current vertex • Use ByPass() Summer Institute in Bioinformatics PSC - 2008