170 likes | 254 Views
Implementation of Planted Motif Search Algorithms PMS1 and PMS2. Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering University of Connecticut, Storrs, CT. Introduction. General Problem: Multiple Sequence Comparison Biological Basis DNA structure/function
E N D
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering University of Connecticut, Storrs, CT
Introduction • General Problem: Multiple Sequence Comparison • Biological Basis • DNA structure/function • Sequence of nucleotides • Modeled as strings • Genes code for proteins • Structure Function • Evolution – Result of DNA mutations and selective pressures Image credit: www.britannica.com
Introduction • Goals of Multiple Sequence Comparison • Deduce evolutionary relationships. • Protein and gene function studies. • Find transcription factor/ regulatory protein binding sites. • Approaches: • Find common subpatterns and deduce a biological relationship. • Find common subpatterns between DNA sequences with a known biological relationship.
Planted Motif Search • Motifs- Common functional subsequences in a set of biological sequences • Planted (l,d) motif search problem: • Input are n strings (S1, S2, … , Sn) of length m and two integers l and d. Find all strings x such that |x| = l and every input string contains at least one variant of x at a Hamming distance of at most d. • Primary applications: Finding transcription factor binding sites; drug target identification
Algorithm PMS1 • Generate the set of all l-mers in each input sequence. Let Ci correspond to the l-mers of Si. • For each l-mer u in Ci (1 <i<n), generate all l-mers v such that v is at a Hamming distance of at most d from u (v is a “neighbor” of u). Let Li correspond to all l-mers u and v from input sequence Si. • Alphabetically sort each set of neighbors Li and eliminate any duplicates. • Merge and intersect all sets Li to find the l-mer that appears in each neighborhood. Such l-mers constitute the motifs in the input sequences.
Algorithm PMS2 • Algorithm PMS2 exploits these observations • If M occurs in each input sequence, then at least l-k+1 length-k substrings of M occur in each input sequence. • In each input sequence there must be at least one position ij such that a k-mer of M occurs at each position ij – ij+l – k .
Algorithm PMS2 • Use a modified PMS1 to solve the planted (d+c, d)-motif problem. Let R contain the (d+c)-motifs. • Find all of the occurrences of R in an arbitrary input sequence Sj,. Let Li contain the (d+c)-motifs of R with variants starting at position i of Sj. • For each position i in Sj • A is the l-mer of Sj starting at position i • M1 and M2 are members of Li and Li+l – (d+c). • If the last 2(d+c) – l characters of M1 are equal to the first 2(d+c) – l characters of M2, form an l-mer B by appending the last l – (d+c) characters of M2 to M1. • If dH(A,B) <d, add B to a list of candidates C. Once the list of candidates is complete, check if each candidate is a motif.
Results • n =20and m = 600; arbitrary motif inserted in each input sequence • Each implementation gave the correct planted motif for each (l,d) case • PMS1 was faster than PMS2 for the challenging instances (9,2) and (11,3) • Otherwise, PMS2 could be faster, depending on the value of c • Low values of c lead to a high number of (d+c,d)-motifs, which leads to a high number of candidate strings • Conslusions • PMS1 better-suited for challenge problems • PMS2 better suited for larger l Runtimes, in seconds, of algorithms PMS1 and PMS2
Minimization of Consensus Sequences • Consensus Sequence • An expression that can be used to describe two or more sequences • Two forms: • {c1, c2, … ,cn} – Presence of one of the given characters c in the list • {i1, i2, … in}c – Character c may occur any number of times ik • Examples: • Merging abcde, abccde, abcdee, and abccdee gives ab{1,2}cd{1,2}e • Merging agtgc and actgc gives a{c,g}tgc • Problem Statement: Output a minimum number of consensus sequences for a given set of input sequences.
Minimization of Consensus Sequences • Algorithm • To start, all input sequences are “alive” • An arbitrary alive sequence S is chosen and compared with every other alive sequence T to check if they can be merged. • Dynamic programming is used to optimally align S and T • The optimal alignments of S and T will have loops corresponding to insertions, deletions, and replacements. • Merging may occur only if all loops can be resolved • All mismatches can be resolved • Insertions and deletions can be resolved only if there is a match of the inserted/deleted character to the left or right of the loop. • If S and T can be merged, a consensus sequence is generated and added to the list of “alive” sequences. S and T are killed. • This process continues until no two alive sequences can be merged. At that point, all remaining alive input and consensus sequences are output.
Summary • Planted Motif Search Problem: Find an l-mer that differs in d or less places from at least one l-mer in each input sequence • Algorithms PMS1 and PMS2 are based on a model that generates the neighborhood of every input sequence and intersects them to find the motifs • PMS1 is best suited for challenge problems; use PMS2 for larger l • Future work will include the minimization of consensus sequences (regular expressions)
Acknowledgements • Special thanks to: • Sanguthevar Rajasekaran • National Science Foundation • University of Connecticut Department of Computer Science and Engineering
Levenshtein Distance • Formal definition: The lowest number of edit operations, consisting of insertion (I), deletion (D), and replacement (R), necessary to convert one string to another. • Algorithm • Let Di,j be the edit distance of S1(1…i) and S2(1…j). • Add a blank space to the beginning of each string and align the strings along the edges of a matrix. • By definition, Di,0= i and D0,j= j. • Recurrence relation: Di,j= min(Di-1,j+ 1, Di,j-1+ 1, Di-1,j-1 + ti,j ) • ti,j = 0 if S1[i] = S2[j] , 1 otherwise (substitution) • By definition of Di,j, Dn,m, where n = |S1| and m = |S2|, is the edit distance of S1 and S2
Example • S1 = vintner, S2 = writers Adapted from Algorithms on Strings, Trees, and Sequences by Dan Gusfield, 1999. • Value in bottom-right cell gives Levenshtein distance (5)
Optimal Alignment from Levenshtein Distance • Working from the bottom right of the matrix, insert pointers • Set a pointer from cell (i,j) to • Cell (i-1, j) if Di,j = Di-1,j + 1 • Corresponds to a deletion of S1(i) from S1 • Cell (i, j-1) if Di,j = Di,j-1 + 1 • Corresponds to an insertion of S2(j) into S1 • Cell (i-1, j-1) if Di,j = Di-1,j-1 + ti,j • Corresponds to match (t=0) or replacement (t=1) • Follow the pointers from Dn,m to D(0,0) to get optimal alignment • Some cells may have two pointers, in which case more than one optimal alignment exists • 3 optimal alignments in the example: