350 likes | 466 Views
A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns. Costas Iliopoulos M. Sohel Rahman. Classic Pattern Matching. Input : A string T of length n (the text) A string P of length m (the pattern). Output Whether P occurs in T
E N D
A New Model to Solve the Swap MatchingProblem and Efficient Algorithms for ShortPatterns Costas Iliopoulos M. Sohel Rahman SOFSEM 2008
Classic Pattern Matching • Input: • A string T of length n (the text) • A string P of length m (the pattern). • Output • Whether P occurs in T • Occ = {i | P = T [i..i + m − 1]} From Alphabet Existence Query Computation of Occurrence set SOFSEM 2008
Example P = GAC • We have GAC at position 3 and 12 • Occ = {3, 12}. Occ = {5, 14}. SOFSEM 2008
Swap Matching P = ACGCT 1 2 3 4 5 6 7 8 9 10 11 12 13 A G C T C A C G T C C T T Text A C G C T 1 2 3 4 5 SOFSEM 2008
Swap Matching P = ACGCT Occ = {1,5,6} 1 2 3 4 5 6 7 8 9 10 11 12 13 A G C T C A C G T C C T T Text A C G C T A C G C T A C G C T SOFSEM 2008
Motivation • Swap Error is a common error during typing. • The phenomenon of swaps occurs in gene mutations and duplications. SOFSEM 2008
Existing results O(nm1/3 log m log ) 2000: Amir, Aumann, Landau, Lewenstein, Lewenstein. O(n log2 m) 1998: Amir, Landau, Lewenstein, Lewenstein. (Some very special cases) 2003: Amir, Cole, Hariharan, Lewenstein, Porat. O(n log m log ) All results uses FFT = min(m,||) SOFSEM 2008
Existing results • Some related variants are also investigated in the literature: • Approximate version: • Amir, Lewenstein, Porat (2002) • Weighted Version: • Zhang, Guo, Iliopoulos (2004) SOFSEM 2008
Our Contribution • A new graph theoretic model • O(m/w n logm) time. • For word-size patterns: O(n log m) • The first non-FFT efficient algorithm for swap matching SOFSEM 2008
The new Model SOFSEM 2008
T-Graph 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 a c a c b a c c b a c a c b a T = T-Graph a c a c b a c c b a c a c b a SOFSEM 2008
P-Graph 1 2 3 4 5 a c b a b P = P-Graph 2 1 3 4 5 a c b a b a c b a b b a c a b SOFSEM 2008
P-Graph 1 2 3 4 5 a c c a b P = P-Graph 2 1 3 4 5 a c c a b a c c a b a c c a b SOFSEM 2008
So… P swap matches T P-Graph swap matches T-Graph SOFSEM 2008
An Efficient Algorithm SOFSEM 2008
Degenerate strings • Let = {A, C, G, T} • Then we can get 2^4 -1 = 15 non-empty sets of letters. • At each position of a degenerate string we have one of those sets. SOFSEM 2008
Degenerate strings… A C G T A C G A C T A G T C G T C G A C A G A T C T C G A C G T SOFSEM 2008
Degenerate strings… 1 2 3 4 5 6 7 T T X= A C C A C C C A SOFSEM 2008
Degenerate stringsEquality/Match 1 2 3 4 5 6 7 T T X[3] =d Y[1]. WHY? X= A C C A C C C Because, X[3] Y[1] = A A Y =d X[1..3] C T Y= A Y =d X[3..5] A C Y =d X[4..6] SOFSEM 2008
P-Graph => Degenerate String 2 1 3 4 5 a c b a a c b a b b c a b a a a a a b b c b b c c SOFSEM 2008
Swap Match vs Deg. Match a a a a a b b P => c b b c c 1 2 3 4 5 6 7 8 9 10 b c b a a a b c b a T = a a a a a According to Deg. Mat, OK! b b c b b According to Swap. Mat, NOT OK! c c SOFSEM 2008
Why Doesn’t Work? 1 2 3 4 5 6 7 8 9 10 b c b a a a b c b a T = a a 2 1 3 4 5 a a a b b a c b a c b b c c a c b a b 1 2 3 4 5 a c c a b b c a b SOFSEM 2008
Forbidden Graph a c a a c c a b c a b SOFSEM 2008
Our Algorithm Shift-Or Algorithm The concept of the Forbidden Graph SOFSEM 2008
D-Mask a a c c a b P = a a => a a b c c c b c D-> a b c X 1 ac 0 1 0 1 2 ac 0 1 0 1 3 ac 0 1 0 1 4 abc 0 0 0 1 5 ab 0 0 1 1 SOFSEM 2008
2 1 3 4 5 a c a F-Mask a c c a b c a b (a,a) (a,b) (b,b) (c,c) (c,a) (X,X) 1 0 0 0 0 0 0 2 0 1 0 0 1 0 0 0 0 0 0 0 0 3 0 1 1 1 4 0 0 0 0 0 0 1 1 5 0 0 1 0 0 0 0 SOFSEM 2008
Computing R matrix 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 X a c a c b a c c b a c a c b a Da F(X,a) 1 0 a 1 1 0 0 0 0 1 1 c 2 1 1 0 0 1 1 1 c 3 Shift Or 1 1 0 0 1 1 1 a 4 1 1 0 0 1 1 1 b 5 1 1 0 0 1 SOFSEM 2008
Computing R matrix 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 X a c a c b a c c b a c a c b a Dc F(a,c) 1 0 0 a 1 0 0 0 0 0 1 1 0 c 2 1 0 0 0 0 1 1 1 c 3 Shift Or 1 1 0 0 1 1 1 1 a 4 1 1 0 0 1 1 1 1 b 5 1 1 1 0 1 SOFSEM 2008
Computing R matrix 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 X a c a c b a c c b a c a c b a Da F(c,a) 1 0 0 0 a 1 0 0 0 0 0 1 1 0 0 c 2 0 0 0 0 0 1 1 1 0 c 3 Shift Or 1 0 0 0 0 1 1 1 1 a 4 1 1 0 0 1 1 1 1 1 b 5 1 1 0 1 1 SOFSEM 2008
Computing R matrix 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 X a c a c b a c c b a c a c b a Db F(c,b) 1 0 0 0 0 1 a 1 0 0 1 0 1 1 1 0 0 0 1 c 2 0 0 1 0 1 1 1 1 0 0 1 c 3 Shift Or 0 0 1 0 1 1 1 1 1 0 0 a 4 0 0 0 0 0 1 1 1 1 1 0 b 5 1 0 0 0 0 SOFSEM 2008
Computing R matrix 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 X a c a c b a c c b a c a c b a 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 a 1 1 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 c 2 1 1 1 0 0 1 1 1 0 1 1 1 0 0 1 1 c 3 1 1 1 1 0 0 1 1 1 0 1 1 1 0 0 1 a 4 1 1 1 1 1 0 0 1 1 1 0 1 1 1 0 0 b 5 SOFSEM 2008
Running Time Computing D-Maks: O(m/w (m + ||)) Computing F-Maks: O(m/w m log m) Computing R Values: O(m/w n log m) O(m/w n log m) short patterns (m~w) O(n log m) SOFSEM 2008
Future Works • Explore the possibilities of using Graph pattern matching • Experimental works • Forthcoming paper contains experimental works using biological examples. SOFSEM 2008
The End Thank you very much SOFSEM 2008