320 likes | 468 Views
Identification of Distinguishing Motifs. Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hk Joint work with WangSen FENG and Lusheng WANG. Outline. The Definitions of Problems Applications Previous work Our work
E N D
Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hk Joint work with WangSen FENG and Lusheng WANG
Outline • The Definitions of Problems • Applications • Previous work • Our work • Algorithm for Single Group • Algorithm for Two Groups • Simulation Results for Single Group • Simulation Results for Two Groups
Motif Identification • Two versions 1. Single Group 2. Two Groups
Single Group • Instance: a group of n sequences. • Objective: find a length-L motif that appears in each of the given sequences and those occurrences of the motif are similar
Two Groups • Instance: two groups of sequences: B (Bad) and G (Good) • Objective: find a motif of length-L that appears in every sequence in group B and does not appear in anywhere of the sequences in G the occurrences of the motif have errors
Applications • Finding Targets for Potential Drugs (T. Jiang, C. Trendall, S, Wang, T. Wareham, X. Zhang, 98) (K. Lanctot, M. Li, B. Ma, S. Wang, and L. Zhang 1999) -- bad strings in B are from Bacteria. -- good strings in G are from Humans -- find a substring s of length L that is conserved in all bad strings, but not conserved in good strings. -- use s to screen chemicals -- those selected chemicals can then be tested as potential broad-range antibiotics.
Applications 2. Creating Diagnostic Probes for Bacterial Infection (T. Brown, G.A. Leonard, E.D. Booth, G. Kneale, 1990) -- a group of closely related pathogenic bacteria -- find a substring that occurs in each of the bacterial sequences (with as few substitutions as possible) and does not occur in the human sequences
Applications 3. Locating binding sites and regulatory signals 4. Creating Universal PCR Primers 5. Creating Unbiased Consensus Sequences 6. Anti-sense Drug Design
Previous work • The closest substring problem was proved to be NP-hard. So are the single group and two groups (K. Lanctot, M. Li, B. Ma, S. Wang, and L. Zhang 1999) • Polynomial time approximation schemes -theoretical results -speed is slow in order to solve practical instances
Previous Programs • Bailey and Elkan: MEME (1994) uses a modified EM algorithm, allows the motif to be absent in some of the given sequences • Waterman: Extended sample-driven approach (1984) • Keich and Pavel Pevzner: two programs (2002) • Buhler and Tompa : Projection (2002) combine EM and random projection • Price, Ramabhadran and Pevzner: PatternBranching uses branching from sample strings (2003) faster than the previously best known program: projection
Previous Programs (continued) • Do not allow indels • Only for the one group problem • Some algorithms can handle one gap
Our work • An extension of the EM approach • A randomized algorithm for the single group problem which can handle indels • We give an algorithm for the two groups problem
Representation of motifs • Consensus pattern: choosing the letter that appears the most in each of the L columns (Figure a) • Profile: 4×L matrix W (ACGT), each cell W(i,j) is a number indicating the occurrence rate of letter i in column j.(Figure b) • Use the profile representation in the early stage of the EM algorithm • Use the consensus pattern representation to improve the accuracy caaccca caacccc catcccg catccct cacccca -------------------- consensus pattern caaccca Another con. Pattern catccca (a) A 0 1 0.4 0 0 0 0.4 C 1 0 0.2 1 1 1 0.2 G 0 0 0.0 0 0 0 0.2 T 0 0 0.4 0 0 0 0.2 (b)
Computing the single group problem The EM (Expectation Maximization) Algorithm (Wang,L. Dong,L. and Fan,H. 2004) Input: • n sequences S1,S2,...,Sn • a 4L matrix W (the initial guess of the motif) Output: • new matrix W that is a local maximal solution A 0.25 0.0 1.0 C 0.25 1.0 0.0 G 0.25 0.0 0.0 T 0.25 0.0 0.0
Step 1:L-mer: Sij, a length-L substring For each L-mer Sij, calculate the likelihood that Sijis the occurrence of the motif: P(i,j)=x=1 to L W(Sij(x),x) To avoid zero weights, a fixed small number is added to W(i,j) (0.1) Step 2: Normalize the likelihood: P'(i, j)=P(i,j) / x=1m-L+1P(i, x) s. t. j=1 to m-L+1P'(i,j)=1 Sij= c a a W=a 0.25 01 c 0.25 1 0 g 0.25 0 0 t 0.25 0 0 P(i,j): 0.25*0.1*1=0.025
Step 3: Re-estimate the motif matrix W. W= i=1 nj=1 m-L+1 Wij Where Wij is constructed from Sij Sij(1) Sij(2) Sij(3) Sij = c a a Wij= a 0 0.0250.025 c 0.025 0 0 g 0 0 0 t s 0 0 0 Sij= c a a W=a 0.25 01 c 0.25 1 0 g 0.25 0 0 t 0.25 0 0 P(i,j): 0.25*0.1*1=0.025
Step 4 Normalize W W'(b,x)= W(b,x)/b=A,C,G,TW(b,x) Replace W with W'
Step 5 Steps 1 to 4 is called a cycle. If W changes very little from last cycle, then EM converges and the algorithm ends. otherwise, goto step 1 and start next cycle Determine the amount of change: max|Wq(b,x)-Wq-1(b,x)|< set =0.05 such that the algorithm stops within few cycles
Our Algorithm For Single Group(with indels) General frame is the same as the previous algorithm 1. We get a initial guess of the motif W 2. With W as initial value, use the new EM algorithm to update W 3. Repeat 1–2 several (Maxtrials) times and choose the best result.
Incorporating Indels • We add the “space” as a letter, so the matrix for EM algorithm became 5×L • K: the maximum total number of indels • For each starting position, consider all length L+h substrings, h=0,1,-1,…,k,-k is the number of indels. • For each length L+h substring, align it with the matrix
Align a length L+h string with a 5×L matrix • Dynamic programming • similar to pair wise string alignment • d[i, j] is the score of aligning the first i columns in the matrix with the first j letters in the string d[i, j]=max{d[i-1, j-1] ×W[x,i], d[i-1,j] ×w[△,i], d[i, j-1] ×e} Buttom-up order: d[L, L+h] Best alignment (with indel)
Continued After calculated the motif W (profile representation: matrix) , we use the matrix W to find the occurrence of the motif in each sequence
Find the motif occurrences • find the occurrence of the motif in each string ∑i=1LW(ai,i) a1a2a3…aL is a length-Lsubstring (L-mer) and W is the matrix for the motif
Algorithm for the two Groups (no indels) • We follow the basic steps of EM method • Modify the formula to re-construct W • Re-estimate the matrix W from both group B and G
Main idea When the motif represented by the matrix W is too close to some L-mers from group G (p(i,j)>ave), we scoop the pattern from the matrix by subtracting the corresponding matrix Wij
Experiment Results (Single Group) • Input: (1) randomly generate sequences n = 20 m= 600 (2) insert motif into the sequences Center string s (length L) Mutate d positions (insertion, deletion, mutation) Implant the mutated copy into the sequences • Output: Use our program to find the implanted pattern.
Experiment Results (Single Group) Table 2: 10 sequences: no indel 5 sequences : one deletion 5 sequences : one insertion Table 1: 15 sequences: no indel 5 sequences: one deletion In table 2, the running time increases significantly and accuracy in many cases is slightly worse than that in Table 1
Experiment Results (Single Group) • Table 4: • 5 sequences : one insertion • 5 sequences : two insertions • 10 sequences: no indel • Table 3: • 5 sequences : one deletion • 5 sequences : two deletions • 10 sequences: no indel The results in Table 4 are slightly better than those in Table 3. The reason might be that the case in Table 4 needs to insert two columns in the matrix for the motif, whereas the case in Table 3 needs to insert two spaces in the motif sequences
Experiment Results (Single Group) • Table 5, the mixed case: • Probability: • one insertion : 1/8 one deletion : 1/8 • two insertions : 1/8 two deletions: 1/8 • one insertion and one deletion: 1/8 • no indel: 3/8
Experiment Results (Two Groups) • Center (m=600): c1: the center for group B, random sequence c2: the center for group G, randomly mutate 200 positions from c1 • Generate two groups n=10 Randomly mutate 200 positions from the center
Experiment Results (Two Groups) Table 7 shows the results when the average Hamming distance between c1 and c2 is about 175 Table 6 shows the results when the average Hamming distance between c1 and c2 is about 128 • From Table 6, we can see that it is easy to find a motif that can distinguish the two groups when L is large • Compare Table 7 with Table 6, we can see that it is easy to find a distinguishing motif when the distance between the two centers is large
Summary • An algorithm for the single group problem that can handle indels • An algorithm for the two groups problem