Part II Algorithms for string motif finding

Part IIAlgorithms for string motif finding Jaime Seguel, PhD Electrical and Computer Engineering Department University of Puerto Rico at Mayaguez Summer Institute in Bioinformatics PSC - 2008

Disclaimer Some slides presented in this talk have been taken with minor or without modifications from power point presentations published in the Website http://www.bioalgorithms.info/ Summer Institute in Bioinformatics PSC - 2008

Outline • The problem of finding small common patterns in a set of DNA sequences • Brute force approach: • consensus maximization • Hamming distance minimization • Branch-and-Bound approach: • Consensus maximization • Hamming distance minimization • Consensus and Pattern Branching: • Greedy Motif Search • Summary Summer Institute in Bioinformatics PSC - 2008

Problem: Given the following 10 DNA sequences, each with 82 characters: Is there a 15-character common pattern? atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaagggggggatgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttataggtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga Summer Institute in Bioinformatics PSC - 2008

YES of course!It is AAAAAAAGGGGGGG atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGatgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttataggtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa Summer Institute in Bioinformatics PSC - 2008

That was an easy one! A general algorithm for finding an l-character common pattern is Common l-character Pattern Detection Algorithm: Input: t DNA sequences, each of length n; and l < n, the length of the pattern Procedure: • Compare the first two strings using a pattern matching algorithm • If no l-character common pattern is found, return “NO” • Otherwise, save the l-character common pattern • For j = 2,…,t • Check if the pattern appears in the jth sequence • If it does not, return “NO” • End For • Return “Yes, of course! It is {pattern}” Bioinformatics Algorithms

Complexity of the Common l-character Pattern Detection Algorithm The time complexity of the previously discussed Common l-character Pattern Detection Algorithm can be estimated as follows: • Step 1 is computed in time • Steps 4 – 7 are computed in • The whole algorithm takes time • Therefore, the algorithm is polynomial (indeed, quadratic) Summer Institute in Bioinformatics PSC - 2008

Unfortunately… Real-lifeproblems are not that simple: • The pattern is not exactly the same in each array because random point mutations may occur in the sequences • The length of the pattern is usually unknown • It is not know where it is located relative to the genes start These facts-of-life make the motif (i.e. pattern) finding problem much more complex Summer Institute in Bioinformatics PSC - 2008

Same sequences except by a few point mutations: Is there a motif? atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcgggatgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttataggtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga Summer Institute in Bioinformatics PSC - 2008

Well, there are 15-character patterns that look pretty much alike. Indeed, they differ for at most 4 characters. Is that what you are asking for? atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGatgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttataggtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa AgAAgAAAGGttGGG ..|..|||.|..||| cAAtAAAAcGGcGGG Summer Institute in Bioinformatics PSC - 2008

Instead of a Pattern, what we get is a Motif Logo • Motifs can mutate on non important bases • The illustration shows five motifs in five different genes that have mutations in position 3 and 5 • Representations called motif logosillustrate the conserved and variable regions of a motif TGGGGGA TGAGAGA TGGGGGA TGAGAGA TGAGGGA Summer Institute in Bioinformatics PSC - 2008

A larger motif logo Summer Institute in Bioinformatics PSC - 2008

Consensus strings • The largest characters in a motif logo represent a consensus string, this is, a string containing the most frequently repeated characters • The “quality” of a motif logo as a “generalized common pattern” in a family of DNA sequences can be assessed by scoring all consensus strings as follows BIOINFORMATI

Selecting a t x l “window” Parameters: t=3, l=9,n=18 Sequences: S1: TTGAGGTACACCTATAAC S2: TAGCTCCACTCATATCAG S3: TATCGCATGTACAATCAC Selected window: s=(4, 2, 7) (initial positions) ***AGGTACACC****** *AGCTCCACT******** ******ATGTACAAT*** BIOINFORMATI

Alignment, profile, consensus an scoring the selected window A G G T A C A C C A(4, 2, 7) =A G C T C C A C T A T G T A C A A T P(4, 2, 7)=A: 3 0 0 0 2 0 3 1 0 C: 0 0 1 0 1 3 0 2 1 G: 0 22 0 0 0 0 0 0 T: 0 1 0 3 0 0 0 0 2 Consensus: A G G T A C A C T Score A(4,2,7): 3+2+2+3+2+3+2+2 = 19 BIOINFORMATI

Motif finding problem as a maximization problem Given a set of t DNA sequences of length n and a segment length l < n; find a set of t subsequences of length l , one from each of the given DNA sequences, whose consensus score is maximal BIOINFORMATI

Brute force approach to maximum consensus score Input: t DNA sequences of length n, and the pattern’s length l • Initialize bestScore 0; • For s=(s1,s2 , . . ., st) from (1,1 . . . 1) to (n-l+1, . . ., n-l+1) • Compute Score score of alignment matrix A(s) • If Score> bestScore • bestScore Score • bestMotif  (s1,s2 , . . . , st) • Return bestMotif BIOINFORMATI

Complexity • Count the windows: Varying (n - l + 1)positions in each of tsequences produces (n - l + 1)twindows (or sets of starting positions). The order is • For each set of starting positions, the scoring function makes O(l) operations, so complexity is O(l nt) • That means that for t = 8, n = 1000, l = 10 we must perform approximately 1025computations!!! • Even in a supercomputer this will take a few billions years!!! BIOINFORMATI

A different approach Instead of finding all windows, why not comparing each of the possible l-character patterns over the alphabet {A, G, T, C} with each of the l-mers (subsequences of l characters) in each of the tDNAsequences and find the pattern that appears in all t sequences with the minimum number of mutations Question is: Will this approach yield a better brute force algorithm ? Summer Institute in Bioinformatics PSC - 2008

Hamming distances The Hamming distance dH(v,w) is the number of nucleotide pairs that do not match when v and w are aligned. For example: dH(AAAAAA,ACAAAC) = 2 The Hamming distance between a patternV and a DNA sequenceS is the minimum of all distances d(X, V) taken over all possible substrings X over S Summer Institute in Bioinformatics PSC - 2008

Illustration of a total distance: some computations Parameters: t=3, l=9, n=15 Sequences: S1: TTGAGGTACACCTAT S2: TAGCTCCACTCATAT S3: TATCGCATGTACAAT Proposed Pattern: V=AGGTATACG BIOINFORMATI

The distance form pattern AGGTATACG to sequence S1 is 2 • TTGAGGTACACCTAT  First sequence (S1) and chosen subsequence X d(TTGAGGTAC,AGGTATACG)=8 • TTGAGGTACACCTAT Second choice of X d(TGAGGTACA,AGGTATACG)=5 • TTGAGGTACACCTAT Third choice of X d(GAGGTACAC,AGGTATACG)=8 • TTGAGGTACACCTAT Forth choice of X d(AGGTACACC,AGGTATACG)=2  Minimum • TTGAGGTACACCTAT Fifth choice of X d(GGTACACCT,AGGTATACG)=7 • TTGAGGTACACCTAT Sixth choice of X d(GTACACCTA,AGGTATACG)=8 • TTGAGGTACACCTAT Seventh choice of X d(TACACCTAT,AGGTATACG)=9 BIOINFORMATI

The total distance • In the previous example: d(TTGAGGTACACCTAT,AGGTATACG) d(S1, AGGTATACG) =2 achieved when X is the 9-letter segment starting at position 4 in the DNA string • Similarly, we get d(S2, AGGTATACG) = d(S3, AGGTATACG) = 4 • The total Hamming distance over the set of DNA sequences {S1, S2, S3} is defined to be TotalDistance( AGGTATACG, {S1, S2, S3}) = d(S1, AGGTATACG)+ d(S2, AGGTATACG)+ d(S3, AGGTATACG)} = 2 + 4 + 4 = 10 BIOINFORMATI

Motif finding problem as a Hamming distance minimization problem Given a set of t DNA sequences of length n and a segment length l < n; find a string v in the DNA alphabet (this is, a string of nucleotides) with length l which minimizesTotalDistance(v, Set of DNA sequences) This is finding: min {TotalDistance(v, {S1,…,St}): v DNA sequence of length l } BIOINFORMATI

Brute force implementation of the total-distance minimization method Input: t DNA sequences of length n, and the pattern length l • Initialize bestWord AAA…A (l characters) • Initialize bestDistance highest integer in your system • For each l-mer v from AAA…A to TTT…T • Compute TotalDistance(v, DNA set) • If TotalDistance(v, DNA) < bestDistance • bestDistanceTotalDistance(v, DNA set) • bestWord  v • Return bestWord BIOINFORMATI

Complexity • Minimizing the total Hamming distance requires examining all 4l combinations for the pattern v, and each pattern choice is followed by O(t(n-l+1)) operations. This is, the method’s complexity is O(4l t(n-l+1)) • Conclusion, the complexity of the brute-force total distance minimization is dominated by an exponential factor, as well. But the actual count is much less in this case. BIOINFORMATI

It’s all in what affects the exponential growth!!! • In most practical situations n is significantly larger than l. Recall that l is usually a number between 7 and 15. • The advantage of 4l over (n -l+ 1)t is that the former expression does not depend exponentially neither on the number of sequences in the set (t) nor in the sequence lengths (n) • The latter parameters (t and n) are less likely to be bounded in practice BIOINFORMATI

Mathematical Equivalence • The Motif Finding is a maximization problem while Median String is a minimization problem. Computationally, Median String allows searches over much larger data sets. Are the results comparable? • Indeed, the Motif Finding problem and Median String problem are mathematically equivalent. Next we show that minimizing TotalDistance is equivalent to maximizing Score Summer Institute in Bioinformatics PSC - 2008

Proof of the mathematical equivalence l a G g t a c T t C c A t a c g t Alignment a c g t T A g t a c g t C c A t C c g t a c g G _________________ A 3 0 1 0 311 0 Profile C 24 0 0 14 0 0 G 0 14 0 0 0 31 T 0 0 0 51 0 14 _________________ Consensus a c g t a c g t Score 3+4+4+5+3+4+3+4 TotalDistance 2+1+1+0+2+1+2+1 Sum 5 5 5 5 5 5 5 5 At any column I Scorei+ Hamming Distancei= t Because there are lcolumns Score+ TotalDistance= l * t Rearranging: Score= l * t - TotalDistance Since l* t is constant the minimization of the right side is equivalent to the maximization of the left side t Summer Institute in Bioinformatics PSC - 2008

Structuring the Search Let’s take a closer look to the pseudo-code line For each l-merv from AAA…A to TTT…T • There is more than one way to navigate over all possible l-mers • We need a navigation method able to exhibit intermediate approximations so potentially “low scoring or highly distant” l-mers can be eliminated as earlier as possible in the search Summer Institute in Bioinformatics PSC - 2008

Structuring the Search • For the Median String Problem we need to consider all 4l possible l-mers: aa… aa aa… ac aa… ag aa… at . . tt… tt How to organize this search? l Summer Institute in Bioinformatics PSC - 2008

Alternative Representation of the Search Space • Let A = 1, C = 2, G = 3, T = 4 • Then the sequences from AA…A to TT…T become: 11…11 11…12 11…13 11…14 . . 44…44 • Notice that the sequences above simply list all numbers as if we were counting on base 4 without using 0 as a digit l Summer Institute in Bioinformatics PSC - 2008

Linked lists don’t exhibit intermediate approximations • Suppose l = 2 aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt • Need to visit all the predecessors of a sequence before visiting the sequence itself Start Summer Institute in Bioinformatics PSC - 2008

Trees do !!! • Linked lists organize the patterns. A tree, instead, may show the patterns and their prefixes aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt Summer Institute in Bioinformatics PSC - 2008

Search Tree a- c- g- t- aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt root -- Summer Institute in Bioinformatics PSC - 2008

Moving through a Search Tree • Four common moves in a search tree that we are about to explore: • Move to the next leaf • Visit all the leaves • Visit the next node • Bypass the children of a node Summer Institute in Bioinformatics PSC - 2008

Visit the Next Leaf Given a current leaf a, we need to compute the “next” leaf: • NextLeaf( a,L, k ) // a : the array of digits • foriL to 1 //L: length of the array • ifai < k // k : max digit value • aiai + 1 • returna • ai 1 • returna Summer Institute in Bioinformatics PSC - 2008

NextLeaf (cont’d) • The algorithm is common addition in radix k: • Increment the least significant digit • “Carry the one” to the next digit position when the digit is at maximal value Summer Institute in Bioinformatics PSC - 2008

NextLeaf: Example • Moving to the next leaf: 1- 2- 3- 4- 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 -- Current Location Summer Institute in Bioinformatics PSC - 2008

NextLeaf: Example (cont’d) • Moving to the next leaf: 1- 2- 3- 4- 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 -- Next Location Summer Institute in Bioinformatics PSC - 2008

Visit All Leaves • Printing all permutations in ascending order: • AllLeaves(L,k) // L: length of the sequence • a (1,...,1) // k : max digit value • while forever // a: array of digits • output a • a NextLeaf(a,L,k) • ifa = (1,...,1) • return Summer Institute in Bioinformatics PSC - 2008

Visit All Leaves: Example • Moving through all the leaves in order: 1- 2- 3- 4- 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 -- Order of steps Summer Institute in Bioinformatics PSC - 2008

Depth First Search • So we can search all leaves • How about searching all vertices of the tree? • We can do this with a depth first search Summer Institute in Bioinformatics PSC - 2008

Visit the Next Vertex • NextVertex(a,i,L,k) // a : the array of digits • ifi < L // i : prefix length • a i+1 1 // L: max length • return ( a,i+1) // k : max digit value • else • forjl to 1 • ifaj < k • ajaj +1 • return( a,j ) • return(a,0) Summer Institute in Bioinformatics PSC - 2008

Example • Moving to the next vertex: 1- 2- 3- 4- 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 Current Location -- Summer Institute in Bioinformatics PSC - 2008

Example • Moving to the next vertices: 1- 2- 3- 4- 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 Location after 5 next vertex moves -- Summer Institute in Bioinformatics PSC - 2008

Bypass Move • Given a prefix (internal vertex), find next vertex after skipping all its children • Bypass(a,i,L,k) // a: array of digits • forji to 1 // i : prefix length • ifaj < k// L: maximum length • ajaj +1// k : max digit value • return(a,j) • return(a,0) Summer Institute in Bioinformatics PSC - 2008

Bypass Move: Example • Bypassing the descendants of “2-”: 1- 2- 3- 4- 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 Current Location -- Summer Institute in Bioinformatics PSC - 2008

Example • Bypassing the descendants of “2-”: 1- 2- 3- 4- 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 Next Location -- Summer Institute in Bioinformatics PSC - 2008

Improving brute force search: The Branch and Bound approach • Sets of s=(s1, s2, …,st) may have a weak profile for the first i positions (s1, s2, …,si) • Every row of alignment may add at most lto Score • Optimism: if all subsequent (t-i) positions (si+1, …st) add (t – i ) * ltoScore(s,i,DNA) • If Score(s,i,DNA) + (t – i) * l < BestScore, it makes no sense to search in the descendents of the current vertex • Use ByPass() Summer Institute in Bioinformatics PSC - 2008

Part II Algorithms for string motif finding

Part II Algorithms for string motif finding

Presentation Transcript

Regulatory Motif Finding

Regulatory Motif Finding

DNA Motif Finding

Regulatory Motif Finding (II)

(Regulatory-) Motif Finding

Motif finding

Comparative Motif Finding

Motif Finding

Motif Finding

Motif finding

Randomized Algorithms and Motif Finding

Randomized Algorithms and Motif Finding

Motif Finding

Randomized Algorithms and Motif Finding

Motif Finding

Efficient Algorithms for Motif Search

Gibbs sampling for motif finding

Motif finding methods and algorithms

Regulatory Motif Finding

Motif Finding

Motif Finding

Randomized Algorithms for Motif Finding [1] Ch 12.2