320 likes | 424 Views
String Edit Distance Matching Problem With Moves. Graham Cormode S. Muthukrishna November 2001. Pattern Matching. Text T of length n Pattern P of length m Our goal : find “good” matches of P in T as measured by some edit distance function d(_,_) For every i= 1,2… n find: D[i] = back.
E N D
String Edit Distance MatchingProblem With Moves • Graham Cormode • S. Muthukrishna November 2001
Pattern Matching • Text T of length n • Pattern P of length m Our goal: find “good” matches of P in T as measured by some edit distance function d(_,_) For every i=1,2…n find: D[i] = back
Pattern Matching example • T = “assasin” P = “ssi” -D[1]=2 assasin Or assasin _ssi ss i -D[3]=1 assasin s_si -naively it will take O(mn^3)
The main idea • d(X,Y) = smallest number of operations to turn X into Y. • The operations are insertion, deletion, replacement of a character and moves of a substring. • The idea is to approximate D[i] up to a factor of O(LognLog*n)
General algorithm Embed the string distance into L1 vector distance, up to a O(LognLog*n) factor : • compute the vector with a single pass over the string. • Find the vector representation of O(n) substrings. Do all this in O(nLogn) time. (a deterministic algorithm with an aproximate result)
Edit Sensitive Parsing (ESP)for the embedding • We want to parse the string so that edit operations will have a limit effect on the parsing (an edit operation on char i changes the parsing only of the “neighborhood” of i ) For example: “abcbbabcbdfgj” “xyzbbabcbdfgj” “abcbb abc b dfgj” “xyzbb abc b dfgj”
Edit Sensitive Parsing (ESP)for the embedding • In practice, find landmarks in the strings, based only on their locality and parse by them. • Local maxima are good landmarks, “abcegiklmrtabc” but may be far apart in large alphabets, so we will reduce the alphabet.
Edit Sensitive Parsing (ESP)choosing the landmarks • We use a technique called Alphabet reduction: Text: c a b a g e f a c e d Binary:010 000 001 000 110 100 101 000 010 100 011 Label: 00 2 1 0 3 2 1 0 3 2 3 Label(A[i]) = 2l + bit(l,A[i]) l = location of first bit that is different between A[i] and A[i-1] The value of the l bit in A[i]
Edit Sensitive Parsing (ESP)Alphabet reduction (cont.) - In each iteration the alphabet is reduced fromΣ to 2Log|Σ|. - After Log*|Σ| iterations we get |Σ| < 7. - Then we reduce from 6 to 3, ensuring no adjacent pairs are identical (start from 6 then 5 then 3): Final iteration: 2 1 0 3 2 1 0 3 2 3 After reduction: 2 1 0 1 2 1 0 1 2 0
Edit Sensitive Parsing (ESP)Alphabet reduction (cont.) • Properties of final labels: • Final alphabet is {0,1,2}. • No adjacent pair is identical. • Takes Log*|Σ| iterations. • Each label depends on O(Log*|Σ|) characters to its left.
Edit Sensitive Parsing (ESP)So how do we choose the landmarks? • For repeats, parse in a regular way: aaaaaaa -> (aaa)(aa)(aa) • For varying substrings, use alphabet reduction then mark landmarks as follows:
Edit Sensitive Parsing (ESP)So how do we choose the landmarks? • Consider the final labels, Mark any character that is a local maxima (greater than left & right): Text: c a b a g e f a c e d Final: 010 2 1 0 1 2 1 0 1 2 0
Edit Sensitive Parsing (ESP)So how do we choose the landmarks? • Consider the final labels, Mark any character that is a local maxima (greater than left & right): Text: c a b a g e f a c e d Final: 0102 1 0 1 2 1 0 1 2 0 • Then mark any local minima if not adjacent to a marked char.
Edit Sensitive Parsing (ESP)So how do we choose the landmarks? • Consider the final labels, Mark any character that is a local maxima (greater than left & right): Text: c a b a g e f a c e d Final: 0102 1 0 1 2 1 0 1 2 0 • Then mark any local minima if not adjacent to a marked char. • Clearly, distance between marked labels is 2 or 3.
Edit Sensitive Parsing (ESP)What did we achieve so far? • By now the whole string has been arranged in pairs and triples. • The important outcome is that 2 strings with small edit distance will be parsed to a “very close” arrangement. (the parsing of each character depends on a Log*n neighborhood)
Edit Sensitive Parsing (ESP)constructing the ESP tree • We will now re-label each pair or triple – can be done by building a dictionary (Karp-Miller-Rosenberg) or by hashing (Karp-Rabin 1987). Hash(w[0…m-1]) =(integer value of w[0…m-1]) mod q For some large q
Edit Sensitive Parsing (ESP)constructing the ESP tree In O(nLogn) construction time we get a 2-3 tree:
How do we represent a 2-3 tree as a vector? • The vector is the frequency of occurrence of each (level,label) pair:
Proof of correctness • Theorem: 1/2d(X,Y)V(X) – V(Y) 1O(lognlog*n)d(X,Y)
Upper bound proof: • V(X) – V(Y) 1 O(lognlog*n)d(X,Y) • Insert/change/delete a character: at most log*n + c nodes change per tree level. • Move a substring: the only changes are at the fringes, that is 4(log*n +c) nodes change per tree level. since there are lognlevels in the tree: • Conclusion: each operation changes V byO(lognlog*n) and there are d(X,Y) changes.
Lower bound proof: • d(X,Y) 2 V(X) – V(Y) 1 • The Idea is to transform X into Y using at most 2 V(X) – V(Y) 1 operations: We want to keep hold of large pieces of the string that are common to both X and Y. So we will go through and protect enough pieces of X that will be needed in Y.
Lower bound proof: (cont.) • We avoid doing any changes in the protected pieces. • At the first level of the tree we add or remove characters as needed. (if a character appears in Y and not in X, we add it to the end of X). So we get V0(X)-V0(Y) 1 = 0. • We proceed inductively up on the tree, then to make any node in level i, we need to move at most 2 nodes from level i-1. (we know that on level i-1 we have enough nodes i.e: Vi-1(X)-Vi-1(Y) 1 = 0)
Lower bound proof: example • Y: • X: I Counter=0 G H D E F E A B A B C A B B C Mark protected pieces: L K M J F E E C B A A B B C B C
Lower bound proof: example • Y: • X: I Counter=1 (deletion) Counter=0 Counter=2 (insertion) G H D E F E A B A B C A B B C Remove and add characters as needed: L K M J F E E A C B A A B B C B C
Lower bound proof: example • Y: • X: I Counter=2 (insertion) Counter=3 (1 move) G H D E F E A B A B C A B B C Move to level 2, move nodes in level 1 as needed: L K M J F E E A B A A B B C B C
Lower bound proof: example • Y: • X: I Counter=5 (2 moves) Counter=3 (1 move) G H D E F E A B A B C A B B C Move to level 3, move nodes in level 2 as needed: L This node will not move K M D F E E A B A A B B C B C
Lower bound proof: example • Y: • X: I Counter=5 (2 moves) G H D E F E A B A B C A B B C That’s it!! I G H D E F E A B A B C A B B C
I G H D E F E A B A B C A B B C L K M J F E E C B A A B B C B C Lower bound proof: (example with d(X,Y)=1) • Did we achieve what we wanted? Y X 2 V(Y)-V(X) 1 = 2(#red nodes + # green nodes) = 18 And we surely created X from Y in less then 18 moves
Application to string matching • To find D[i] , we need to compare P against all possible substrings of T. we can reduce this to O(n): • d(T[1,m],P) d(T[1,m], T[1,r]) + d(T[1,r], P) = |r–m| + d(T[1,r], P) 2d(T[1,r], P) So we only need to consider O(n) substrings of length m And we get a 2-approximation of the optimal matching !! Since we need at least |r–m| operations to make T[1,r] the same length as P.
Application to string matching – Final algorithm • Create ESP trees for T and P. • Find V(T[1:m])-V(P)1 • Iteratively compute D[i] V(T[i:i+m-1])-V(P)1 Overall we get O(nLogn) time cost for the whole algorithm, and compute every D[i] up to a factor of O(lognlog*n)