450 likes | 608 Views
Function Matching. Amihood Amir Yonatan Aumann Moshe Lewenstein Ely Porat. Bar Ilan University. Baker ’ s Parameterized Matching. Prog.c int a,b; a=1; a = g(a)*5+f(a); b=2; a = func(a,b); a = a*g(b); b=1; b = g(b)*5+f(b); …. Baker ’ s Parameterized Matching. c=1;
E N D
Function Matching Amihood Amir Yonatan Aumann Moshe Lewenstein Ely Porat Bar Ilan University
Baker’s Parameterized Matching Prog.c int a,b; a=1; a = g(a)*5+f(a); b=2; a = func(a,b); a = a*g(b); b=1; b = g(b)*5+f(b); ….
Baker’s Parameterized Matching c=1; c = g(c)*5+f(c); Prog.c int a,b; a=1; a = g(a)*5+f(a); b=2; a = func(a,b); a = a*g(b); b=1; b = g(b)*5+f(b); …. Pattern Baker’s work pdup dupstat psearch SICOMP 1997 JCSS 1996
Two dimensional parameterized matching pattern ‘A horse is a horse, it ain’t make a difference what color it is’John Wayne
Parameterized Matching InputP = p1…pm over alphabet T = t1 . . . tn over alphabet Output: locations i of T, for which a bijection : exists s.t. (P) = (p1) (p2)… (pm) = ti…ti+m-1
Parameterized Matching • One dimensional • Baker 1996, JCSS - Suffix Trees • Baker 1997, SICOMP - Boyer Moore • Amir, Farach, Muthu 1995, IPL - Knuth-Morris-Pratt • Two dimensional Regular methods fail !!
Function Matching Input: P = p1…pm over alphabet T = t1 . . . tn over alphabet Output: locations i of T, where f: exists s.t. f(P) = f(p1)f(p2)…f(pm) = ti…ti+m-1
Function Matching Input: P = p1…pm over alphabet T = t1 . . . tn over alphabet Output: locations i of T, where f: exists s.t. f(P) = f(p1)f(p2)…f(pm) = ti…ti+m-1 P = h e h a e h T = a b c b a c b a d a b d a d d a d
Function Matching Input: P = p1…pm over alphabet T = t1 . . . tn over alphabet Output: locations i of T, where f: exists s.t. f(P) = f(p1)f(p2)…f(pm) = ti…ti+m-1 f(h) = b f(e) = c f(a) = a P = hehaeh T = a bcbacb a d a b d a d d a d
Function Matching Input: P = p1…pm over alphabet T = t1 . . . tn over alphabet Output: locations i of T, where f: exists s.t. f(P) = f(p1)f(p2)…f(pm) = ti…ti+m-1 f(h) = a f(e) = d f(a) = b P = hehaeh T = a b c b a c b adabda d d a d
Function Matching Input: P = p1…pm over alphabet T = t1 . . . tn over alphabet Output: locations i of T, where f: exists s.t. f(P) = f(p1)f(p2)…f(pm) = ti…ti+m-1 f(h) = d f(e) = a f(a) = d P = hehaeh T = a b c b a c b a d a b daddad
Function Matching Input: P = p1…pm over alphabet T = t1 . . . tn over alphabet Output: locations i of T, where f: exists s.t. f(P) = f(p1)f(p2)…f(pm) = ti…ti+m-1 no match ! f(h) = ?? P = h e h a e h T = a b c b a c b a d a b d a d d a d
Function Matching vs. Parameterized Matching P p-matches ti…ti+m-1 iff 1. P f-matches ti…ti+m-1 and 2. # of symbols in ti…ti+m-1 = # of symbols in P f(h) = b f(e) = c f(a) = a f(h) = d f(e) = a f(a) = d P = hehaeh hehaeh T = a bcbacb a d a b daddad
Naïve Algorithm At each location i of text T check if pattern f-matches Check For each letter ‘a’ in pattern Are elements aligned with the pattern ‘a’s the same? no? declare ‘no match’ All letters “OK” – declare ‘match’ Running time:O(nm), where m = |P| and n = |T|
Function Matching with Don’t Cares Input: P = p1…pm over alphabet {?} T = t1 . . . tn over alphabet Output: locations i of T, where f: exists s.t. f(P) = f(p1)f(p2)…f(pm) = ti…ti+m-1, f(?) - wildcard P = he ? ? eh T = a b c b a cb c d bc d a d d a d
Why do we need don’t cares? Pattern Text
Linearize Text and Pattern Pattern Text Line 1 Line 2 T = …
Linearize Text and Pattern n m Text m Pattern n Line 1 Line 2 n-m n-m P = ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? T= … … Line 5 Line 6
Polynomial Multiplication - Convolutions t1 t2 t3 t4 . . . tn-2 tn-1 tn pm pm-1 . . . p2 p1 p1t1 p1t2 . . . p1tn-2 p1tn-1 p1tn p2t1 p2t2 p2t3 . . . p2tn-2 p2tn-1 p2tn p3t1 p3t2 p3t3 p3t3 . . . p3tn-1 p3tn . . . .. . . . pmt1 . . . pmtm pmtm+1 . . pmtn-1 pmtn . . . . . . Running time: O(n log m)
Convolutions: Fischer-Patterson [1974] p1 p2 p3 p4 . . . pm t1t2 t3 t4 . . . tn-2 tn-1 tn pm pm-1 . . . p2 p1 p1t1 p1t2 . . . p1tn-2 p1tn-1 p1tn p2t1p2t2p2t3 . . . p2tn-2 p2tn-1 p2tn p3t1 p3t2p3t3p3t4 . . . p3tn-1 p3tn . . . .. . . . pmt1 . . . pmtmpmtm+1 . . pmtn-1 pmtn . . . . . .
Convolutions: Fischer-Patterson [1974] p1 p2 p3 p4 . . . pm t1 t2 t3 t4 . . . tn-2 tn-1 tn pm pm-1 . . . p2 p1 p1t1 p1t2 . . . p1tn-2 p1tn-1 p1tn p2t1 p2t2 p2t3 . . . p2tn-2 p2tn-1 p2tn p3t1 p3t2 p3t3 p3t4 . . . p3tn-1 p3tn . . . .. . . . pmt1 . . . pmtm pmtm+1 . . pmtn-1 pmtn . . . . . .
How does this help for Function Matching? The property that needs to be checked is: beneath each symbol from the pattern alphabet all text characters must be the same
Example - T = a b c b a c b a c a b d a d d a d e a P = h e h a e h ? e PR = e ? h e a h e h
Example - T = a b c b a c b a c a b d a d d a d e a P = h e h a e h ? e PR = e ? h e a h e h h in P vs. a in T Ta= 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 1 0 0 1 PRh = 0 0 1 0 0 1 0 1
Example - T = a b c b a c b a c a b d a d d a d e a P = h e h a e h ? e PR = e ? h e a h e h h - a Ta= 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 1 0 0 1 PRh= 0 0 1 0 0 1 0 1 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 2 0 2 1 0 3 0 1 2 0 1 2 0 1 1 0 1
Example - h e h a e h ? e T = a b c b a c b a c a b d a d d a d e a P = h e h a e h ? e PR = e ? h e a h e h h - a Ta = 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 1 0 0 1 PRh = 0 0 1 0 0 1 0 1 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 2 0 2 1 0 3 0 1 2 0 1 2 0 1 1 0 1
Example - T = a b c b a c b a c a b d a d d a d e a P = h e h a e h ? e PR = e ? h e a h e h h - a Ta = 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 1 0 0 1 PRh = 0 0 1 0 0 1 0 1 0 0 1 0 0 1 1 1 0 2 0 2 1 0 3 0 1 2 0 1 2 0 1 1 0 1 => in O(n log m) time!!
Example - T = a b c b a c b a c a b d a d d a d e a P = h e h a e h ? e PR = e ? h e a h e h h - a 1 0 2 0 2 1 0 3 0 1 2 0 h - b 0 3 0 1 1 1 1 0 1 0 1 0 h - c 2 0 1 2 0 1 1 0 1 0 0 0 h - d 0 0 0 0 0 0 1 0 1 2 0 3 Match(h) 0 1 0 0 0 0 0 1 0 0 0 1 => in O(| | n log m) time!!
In general - the Algorithm • For each character ‘a’ in create Pa • For each character ‘b’ in create Tb • For all Pa and Tb multiply them and • construct Match(a) for each ‘a’ in • Announce each location i of T as a ‘match’ if Match(a)[i] = 1 for all a’s in P => in O(| || | n log m) time.
Improvement Lemma: Let a1, ..., ak , then k iff for all i,j, ai = aj Idea: Let’s encode text with numbers for symbols and encode pattern to compute their sum and separately their sum of squares.
Improvement Lemma: Let a1, ..., ak , then k iff for all i,j, ai = aj Example: Compute sum of text char’s beneath “e” T# =1 2 3 2 13 2 1 3 1 2 4 1 4 4 1 4 5 1 T = a b c b a c b a c a b d a d d a d e a P = h e h a e h ? e Pe = 0 1 0 0 1 0 0 1
Improvement Lemma: Let a1, ..., ak , then k iff for all i,j, ai = aj Example: Compute sum of squares beneath “e” T#2= 1 4 9 4 1 9 4 1 9 1 4 16 1 16 16 1 16 25 1 T# =1 2 3 2 1 3 2 1 3 1 2 4 1 4 4 1 4 5 1 T = a b c b a c b a c a b d a d d a d e a P = h e h a e h ? e Pe = 0 1 0 0 1 0 0 1
Improvement Lemma: Let a1, ..., ak , then k iff for all i,j, ai = aj Running Time: Two convolutions for each pattern character. O(| | n log m)
We have seen – 2 algorithms for Function Matching • O(nm) - naïve algorithm • O(| | n log m) - convolution based • O(n log2m) - randomized convolutions based • Lower bound of (nm) for deterministic convolutions based methods We will see: Can we do better for big alphabets?
Def:A pattern is 2-charactered if every character appears at most twice in the pattern. Lemma: Let P be a pattern and T a text. 2-charactered patterns P1 and P2 s.t. at loc. i of T Pf-matches iffP1 and P2f-match. Example:P = a b c b c c b b P1 = a1 b1 c1 b1 c1 c2 b2 b2 (even pairs) P2 = a1 b1 c1 b2 c2 c2 b2 b3 (odd pairs)
Situation: An algorithm for Function Matching with 2-charactered patterns a general algorithm for Function Matching. So, all that needs to be checked is that: each pair in P has equal text symbols beneath it.
New Randomized Algorithm • For each character:- a in T, randomly choose ra in {0, 1} - relace all a’s in T with ra - get T’- b in P, randomly choose sbin {1,2} - set first b to be sb and the second b to be -sb - get P’ • Convolve T’ and P’R • For each location i, for which T’*P’R[i] equals 0 for the convolutiondeclare a ‘match’
h(v) = a h(q) = b h(u) = a h(s) = a Example: P = v q v u q u ? s T = a b a a b a b a c a b d a b c b d b a g(P) = 2 6 –2 8 –6 –8 0 0 f(T) = 1 0 1 1 0 1 0 1 0 0 1 0 1 1 0 0 0 1 0 1 2+0–2+8+0–8+0+0 = 0 g(v) = g(q) = g(u) = 2 6 8 f(a) = f(b) = f(c) = f(d) = 1 0 0 1
Example: P = v q v u q u ? s T = a b a a b a b a c a b d a b c b d b a g(P) = 2 6 –2 8 –6 –8 0 0 f(T) = 1 0 1 1 0 1 0 1 0 0 1 0 1 1 0 0 0 1 0 1 0+6–2+0-6+0+0+0 = -2 g(v) = g(q) = g(u) = 2 6 8 f(a) = f(b) = f(c) = f(d) = 1 0 0 1
Example: P = v q v u q u ? s T = a b a a b a b a c a b d a b c b d b a g(P) = 2 6 –2 8 –6 –8 0 0 f(T) = 1 0 1 1 0 1 0 1 0 0 1 0 1 1 0 0 0 1 0 1 0= 2+6+0+0+0-8+0+0 g(v) = g(q) = g(u) = 2 6 8 f(a) = f(b) = f(c) = f(d) = 1 0 0 1
Running Time: O(nk log m) with probability 2-k O(n log2m) with probability 1/m if P f-matches at location i of T then f(T)*g(P)R [i+m-1] is trivially always equal to 0 if P does not f-match at location i of T then for each convolution <f,g>, f(T)*g(P)R [i+m-1], equals 0 with probability ½ with k rounds of amplification the probability is (½)k Correctness:
Limitation of the Convolutions Model Can we do the same deterministically? No! To show this we use the model of communication complexity Alice Bob x y f(x,y)
Limitation of the Convolutions Model Known: for x,y in {0,1}k the communication complexity of equals(x,y) is (k) Take pattern P = a1 a2 a3 …am a1 a2 a3 …am, where i j ai aj Given a collection of convolutions {<g(P), f(T)>} the convolutions of location i, (g(P)*f(t))[i+m-1] = g(aj )*f(ti+j-1) + g(aj )*f(ti+j+m-1). Since we are in essence comparing ti…ti+m-1 to ti+m…ti+2m-1 we get the equal information from the convolution. This is lower bounded by (m) for each location, In general (nm)
Another Application for Function Matching Protein Folding detection: 10 10 9 9 8 8 1 2 3 7 7 1 2 3 4 5 6 P = 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 4 11 12 … 12 11 3 2 1
Questions • Can Function Matching be solved deterministicallyin o(nm) time for big alphabets? • Are there special cases of Function Matching thatare easier (other than Parameterized Matching andother trivial ones)? • Does 2-dimensional Parameterized Matching needto be solved with function matching?