Exhaustive search

Exhaustive search CS 498 SS Saurabh Sinha

To define a motif, lets say we know where the motif starts in the sequence The motif start positions in their sequences can be represented as s = (s1,s2,s3,…,st) A new motif model

Given s = (s1, … st)and DNA Construct profile matrix Score(s, DNA) = a G g t a c T t C c A t a c g t a c g t T A g t a c g t C c A t C c g t a c g G _________________ A3 0 1 0 31 1 0 C24 0 0 14 0 0 G 0 14 0 0 0 31 T 0 0 0 51 0 14 _________________ Consensus a c g t a c g t Score3+4+4+5+3+4+3+4=30 Scoring Motifs l t

Good profile matrices • Goal is to find the starting positions s=(s1,…st)to maximize the Score(s, DNA) of the resulting profile matrix • This is the “motif finding problem”

A different formulation • Hamming distance between two strings v and w is dH(v,w) = number of mismatches between v and w • Given an array of starting positions s=(s1,…st), we define dH(v, s) = ∑idH(v,si) • Define: TotalDist(v, DNA) = mins dH(v,s) • Computing TotalDist is easy • find closest string to v in each input sequence

The median string problem • Find v that minimizes TotalDist(v) • A double minimization (mins, minv) • Equivalent to motif finding problem • Show this

Naïve time complexity • Motif finding problem: Consider every (s1,…st): O((n-l+1)t) • Median string problem: Consider every l-mer: O(4l). Relatively fast ! • Common form of both expressions: Find a vector of L variables, each variable can take k values: O(kL)

An algorithm to enumerate ! • Want to generate all strings in {1,2,3,4}L NEXTLEAF(a, L, k) for i := L to 1 if ai < k ai := ai+1 return a ai := 1 return a ALLLEAVES(L, k) a := (1,..,1) while true output a a := NEXTLEAF(a,L,k) if a = (1,..,1) return 11…11 11…12 11…13 11…14 . . 44…44 Increment the least significant digit; and “carry over” to next position if necessary

“Seach Tree” for enumeration -- Order of steps 4 1 2 3 1- 2- 3- 4- 1 4 2 3 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Visiting all vertices in tree • Not just the leaves, but the internal nodes also • PreOrder traversal of a tree • PreOrder(node): • Visit node • PreOrder(left child of node) • PreOrder(right child of node) • This is a recursive method • Can rewrite without recursion

Visit the Next Vertex • NextVertex(a,i,L,k) // a : the array of digits • ifi < L // i : prefix length • a i+1 1 // L: max length • return ( a,i+1) // k : max digit value • else • forjL to 1 • ifaj < k • ajaj +1 • return( a,j ) • return(a,0)

In words • If at an internal node, just go one level deeper. • If at a leaf node, • go to next leaf • if moved to a non-sibling in this process, jump up

Bypassing • What if we wish to skip an entire subtree (at some internal node) during the tree traversal ? BYPASS(a, i, L, k) for j := i to 1 if aj < k aj := aj+1 return (a,j) return (a,0)

Brute Force Solution for the Motif finding problem • BruteForceMotifSearchAgain(DNA, t, n, l) • s  (1,1,…, 1) • bestScoreScore(s,DNA) • while forever • s NextLeaf (s, t, n-l+1) • if (Score(s,DNA) > bestScore) • bestScore Score(s, DNA) • bestMotif  (s1,s2 , . . . , st) • returnbestMotif O(l(n-l+1)t)

Can We Do Better? • Sets of s=(s1, s2, …,st) may have a weak profile for the first i positions (s1, s2, …,si) • Every row of alignment may add at most l to Score • Optimism: if all subsequent (t-i) positions (si+1, …st) add (t – i ) * ltoScore(s,i,DNA) • If Score(s,i,DNA) + (t – i) * l < BestScore, it makes no sense to search in vertices of the current subtree • Use ByPass() • “Branch and bound” strategy • This saves us from looking at (n – l + 1)t-ileaves

Pseudocode for Branch and Bound Motif Search • BranchAndBoundMotifSearch(DNA,t,n,l) • s (1,…,1) • bestScore0 • i 1 • whilei > 0 • ifi < t • optimisticScoreScore(s,i, DNA) +(t – i ) * l • ifoptimisticScore< bestScore • (s, i) Bypass(s,i, n-l +1) • else • (s, i) NextVertex(s,i, n-l+1) • else • ifScore(s,DNA) > bestScore • bestScore Score(s) • bestMotif (s1, s2, s3, …, st) • (s,i) NextVertex(s,i,t,n-l+ 1) • return bestMotif

The median string problem • Enumerate 4l strings v • For each v, compute TotalDist(v, DNA) • This requires linear scan of DNA, i.e., O(nt) • Overall: O(nt4l) • Improvement by branch and bound ? • During enumeration of l-mers, suppose we are at some prefix v’, and find that TotalDist(v’,DNA) > BestDistanceSoFar. • Why enumerate further ?

BranchAndBoundMedianStringSearch(DNA,t,n,l ) s (1,…,1) bestDistance ∞ i 1 whilei > 0 ifi < l prefix string corresponding to the first i nucleotides of s optimisticDistance TotalDistance(prefix,DNA) ifoptimisticDistance >bestDistance (s, i )  Bypass(s,i, l, 4) else (s, i ) NextVertex(s,i, l, 4) else word nucleotide string corresponding to s if TotalDistance(s,DNA) < bestDistance bestDistanceTotalDistance(word, DNA) bestWordword (s,i )  NextVertex(s,i,l, 4) return bestWord Bounded Median String Search

Greedy Algorithms

A greedy approach to the motif finding problem • Given t sequences of length n each, to find a profile matrix of length l. • Enumerative approach O(l nt) • Impractical • Instead consider a more practical algorithm called “GREEDYMOTIFSEARCH”

Greedy Motif Search • Find two closest l-mers in sequences 1 and 2 and form 2 x lalignment matrix with Score(s,2,DNA) • At each of the following t-2 iterations, finds a “best” l-mer in sequence i from the perspective of the already constructed (i-1) x l alignment matrix for the first (i-1) sequences • In other words, it finds an l-mer in sequence i maximizing Score(s,i,DNA) under the assumption that the first (i-1)l-mers have been already chosen • Sacrifices optimal solution for speed: in fact the bulk of the time is actually spent locating the first 2 l-mers

Greedy Motif Search pseudocode • GREEDYMOTIFSEARCH (DNA, t, n, l) • bestMotif := (1,…,1) • s := (1,…,1) • for s1=1 to n-l+1 for s2 = 1 to n-l+1 if (Score(s,2,DNA) > Score(bestMotif,2,DNA) bestMotif1 := s1 bestMotif2 := s2 • s1 := bestMotif1; s2 := bestMotif2 • for i = 3 to t for si = 1 to n-l+1 if (Score(s,i,DNA) > Score(bestMotif,i,DNA) bestMotifi := si si := bestMotifi • Return bestMotif

A digression • Score of a profile matrix looks only at the “majority” base in each column, not at the entire distribution • The issue of non-uniform “background” frequencies of bases in the genome • A better “score” of a profile matrix ?

Information Content • First convert a “profile matrix” to a “position weight matrix” or PWM • Convert frequencies to probabilities • PWM W: Wk = frequency of base  at position k • q = frequency of base  by chance • Information content of W:

Information Content • If Wk is always equal to q, i.e., if W is similar to random sequence, information content of W is 0. • If W is different from q, information content is high.

Greedy Motif Search • Can be trivially modified to use “Information Content” as the score • At each step, instead of choosing the top (1) partial motif, keep the top k partial motifs • “Beam search” • Use statistical criteria to evaluate significance of Information Content • The program “CONSENSUS” from Stormo lab.

Exhaustive search