250 likes | 352 Views
Efficient Algorithms for Motif Search. Sudha Balla Sanguthevar Rajasekaran University of Connecticut. Problem1 Definition. Input: n sequences of length m each, integers l and d , s.t. l << m and d < l .
E N D
Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut
Problem1 Definition Input: n sequences of length m each, integers l and d, s.t. l << m and d < l. Each input sequence has an occurrence of a motif M of length l at a Hamming Distance of d from M. Output: M The above problem is known as the Planted (l, d) Motif Problem.
Problem2 Definition Input is a database DB of n sequences, integers l, d, and q. Output should be all the patterns in DB such that each pattern is of length l and it occurs in at least q of the n sequences. A pattern u is considered an occurrence of another pattern v as long as the edit distance between u and v is at most d.
Problem 1: State of the Art Two kinds of algorithms are known: Approximate and Exact. WINNOWER (Pevzner and Sze[2000]) and PROJECTION (Buhler and Tompa[2001]) are approximate algorithms. MITRA (Eskin and Pevzner [2002]) is an exact algorithm.
A Probabilistic Analysis Problem1 is complicated by the fact that, for a given value of l, the higher the value of d, the higher the expected number of motifs that occur by random chance. For instance, when n=20, m=600, l=9, d=2, the expected number of spurious motifs is 1.6. On the other hand for n=20, m=600, l=10, d=2, the expected number of spurious motifs is only 6.1 X 10-8.
WINNOWER • Generate all l-mers from out of all the input sequences. The number of such l-mers is O(nm). • Generate a graph G(V,E). Each l-mer is a node in G. Two nodes are connected if the hamming distance between them is at most 2d. • Find all cliques in the graph. Process these cliques to identify M.
WINNOWER Details Pevzner and Sze observe that the graph G constructed above is 'almost random' and is multipartite. They use the notion of an extendable clique. If Q is any clique, node u is called a neighbor of Q if the nodes in Q and u also form a clique. A clique is called extendable if it has at least one neighbor in every part of the multipartite graph G. The algorithm WINNOWER is based on the observation that every edge in a maximal n-clique belongs to at least (n-2) extendable cliques of size k. This (k-2) observation is used to eliminate edges.
PROJECTION Let C be the collection of all l-mers in the input. Project these l-mers along k randomly chosen columns. (k is typically 7). Group the k-mers such that equal k-mers are in the same group. If a group is of size greater than a threshold s (s is typically 3), then M is likely to have this k-mer. The rest of M is computed using maximum likelihood estimates.
MITRA MITRA is based on WINNOWER; Uses pairwise similarity information. MITRA uses a mismatch tree data structure and splits the space of all possible patterns into disjoint subspaces that start with a given prefix. Pruning is applied in each subspace.
Pattern Branching One way of solving the planted motif search problem is to start from each l-mer in the input, search the neighbors of this l-mer, score them appropriately and output the best scoring neighbor. Pattern Branching only examines a selected subset of neighbors of any l-mer u of the input and hence is more efficient. For any l-mer u, let Di(u) stand for the set of neighbors of u that are at a hamming distance of i. For any input sequence Sj let d(u,Sj) denote the minimum hamming distance between u and any l-mer of Sj. Let d(u,S)=Σnj=1d(u,Sj).
Pattern Branching Contd… For any l-mer u in the input let BestNeighbor(u) stand for the neighbor v in D1(u) whose distance d(v,S) is minimum from among all the elements of D1(u). The PatternBranching algorithm starts from a u, identifies u1= BestNeighbor(u); Then it identifies u2=BestNeighbor(u1); and so on. It finally outputs ud. The best ud from among all possible u's is output.
A Simple Algorithm • Form all possible l-mers from the input sequences. Let C be this collection. Let C’ be the collection of l-mers in the first input sequence. • 2) For every u in C’ generate all l-mers that are at a hamming distance of d from u. Let C’’ be the collection of these l-mers. Note that C’’ contains M. • 3) For every pair of l-mers (u, v) with u in C and v in C’’ compute the hamming distance between u and v. Output that l-mer of C’’ that has a neighbor (i.e., an l-mer at a hamming distance of d) in each one of the n input sequences.
æ ö æ ö l ç ÷ ç ÷ S O d | | nm2l ç ÷ ç ÷ d è ø è ø A Simple Algorithm Contd… The run time of the above algorithm is
PMS1 1) Generate all possible l-mers from out of each of the n input sequences. Let Ci be the collection of l-mers from the i-th sequence. 2) For each Ci and each u in Ci do: Generate all l-mers v such that u and v are at a hamming distance of d. Let Ci’ be the neighbors of Ci. 3) Sort all the l-mers in every Ci. Let Li be the sorted list corresponding to Ci. 4) Merge all the Li’s and output the generated (in step 2) l-mer that occurs in all the Li’s.
PMS1 Contd… The run time of PMS1 is: (Here w is the word length of the computer. Radix sort is used.)
PMS2 Note that if M occurs in every input sequence, then every substring of M also occurs in every input sequence. In particular, there are at least l - k + 1 k-mers (for d <= k <= l) such that each of these occurs in every input sequence at a hamming distance of at most d. Let Q be the collection of k-mers that can be formed out of M. There are l - k + 1k-mers in Q. Each one of these k-mers will be present in each input sequence at a hamming distance of at most d.
PMS3 This algorithm enables one to handle large values of d. Let d’=d/2. Let M be the motif of interest with |M|=l=2l’ for some integer l’. Let M’ refer to the first half of M and M’’ to the second half. We know that M occurs in every input sequence. Let S be an arbitrary input sequence and let p be the occurrence of M in S. If p’ and p’’ are the two halves of p, then, either (1) the hamming distance between M’ and p’ is at most d’ or (2) the hamming distance between M’’ and p’’ is at most d’.
PMS3 Contd… Also, note that in every input sequence either M’ occurs with a hamming distance of at most d’ or M’’ occurs with a hamming distance of at most d’. As a result, in at least n/2 sequences either M’ occurs with a hamming distance of at most d’ or M’’ occurs with a hamming distance of at most d’. PMS3 exploits these observations.
A Comparison with MITRA For l=11 and d=2, MITRA takes one minute whereas PMS2 takes around a second. For l=12 and d=3, two versions of MITRA take one minute and four minutes, respectively. PMS2 takes 15.53 seconds. For l=14 and d=4, two versions of MITRA take 4 minutes and 10 minutes, respectively. PMS2 takes 226.83 seconds.
Known Algorithms for Problem 2 Sagot [1998]’s algorithm runs in time O(n2mld |Σ|d) and is based on generalized suffix trees. Space used is O(n2m/w) where w is the word length of the computer. This algorithm builds a suffix tree on the given sequences in O(nm) time using O(nm) space. If u is any l-mer present in the input, there are O(ld (|Σ|-1)d) possible neighbors for u. Any of these neighbors could potentially be a motif of interest. Since there are O(nm) l-mers in the input, the number of such neighbors is O(nmld(|Σ|-1)d).
Sagot’s Algorithm Contd… This algorithm, for each such neighbor v, walks through the tree to check if v is a possible answer. This walking step is referred to as 'spelling'. The spelling operation takes a total of O(n2mld(|Σ|-1)d) time using an additional O(nm) space. When employed for solving Problem 2, the same algorithm takes O(n2mld|Σ|d ) time. The algorithm of Adebiyi and Kaufmann [2002] takes an expected O(nm+d(nm)1.9 log nm) time.
An Algorithm Similar to PMS1 The basic idea behind the algorithm is: We generate all possible l-mers in the database. There are at most mn such l-mers and these are the patterns of interest. For each such l-mer we want to determine if it occurs in at least q of the input sequences. Let u be one of the above l-mers. If v is a string such that the edit distance between u and v is at most d, then we say v is a neighbor of u. We generate all the neighbors of u. For each neighbor v of u we determine a list of input sequences in which v is present. These lists (over all possible neighbors of u) are then merged to obtain a list of input sequences in which u occurs (within an edit distance of d).
New Algorithm Contd… The above algorithm runs in time O(n2mld|Σ|d). The space used is O(nmd+ld|Σ|d). Space used is less than those of prior algorithms. Only arrays are used in the new algorithm. The underlying constant is small and hence will potentially perform better in practice than Sagot’s algorithm.