180 likes | 315 Views
Approximate Matching of Run-Length Compressed Strings. Algorithmica (2003) Veli M¨akinen, Gonzalo Navarro, and Esko Ukkonen. Run-Length encoding. aaabb (a,3),(b,2). Edit Distance on Run-Length Compressed Strings Extending to Weighted Edit Distance Approximate Searching
E N D
Approximate Matching of Run-Length Compressed Strings Algorithmica (2003) Veli M¨akinen, Gonzalo Navarro, and Esko Ukkonen
Run-Length encoding aaabb (a,3),(b,2) • Edit Distance on Run-Length Compressed Strings • Extending to Weighted Edit Distance • Approximate Searching • Improving a Greedy Algorithm for LCS
Part1: An O(mn’+m’n) Algorithm for the Levenshtein Distance • String A=a1a2···amcompressed length m’ String B=b1b2···bncompressed length n’ • Levenshtein distance, DL(A , B) di, j= min(di-1, j+ 1, di, j-1 + 1, di-1, j-1 + if ai= bjthen 0 else 1) • DID(A, B) di, j= min(di-1, j+ 1, di, j-1 + 1, di-1, j-1 + if ai= bjthen 0 else ∞) • Use Dynamic Programming
Relationship between DID and LCS • 2 ×|LCS(A, B)| = m+n -DID(A, B) • m + n = 2 ×|LCS(A, B)|+ x + y • DID(A, B) = x + y
Known: Top and Left borderGoal: Right and Button border • Equal letter box:
Different letter box: • Observation: consecutive cells in the (dij) matrix differ at most by one
s-t • path(d, r )=Cs min(d, r )+ Cd max(d- r, 0) + Cimax(r - d, 0), (q,0) s-q (s,t) t r r Cs Cs d d Cs Ci Cd d=r d<r d>r
How to evaluatemin value in constant time • The problem is, path is not a constant any more +Cs-Ci (s1,t1) (s3,t3) (s2,t2) +Cs-Cd (s4,t4)
Part3: Approximate Searching • Find all approximate occurrence of A(short pattern) in B(long string) • Let all d0,j=0 and find all dm,j≦k • More efficient approach — evaluate only the first m columns in each long run
Time Complexity • Short run in B with length r≦m: O(m’r+m) • Long run: O(m’m+m+m) • Total time complexity is O(n’m’m+R), R = number of occurence
Part4: Improving a Greedy Algorithm for LCS • Basic idea: Fill the only corner of the boxes • Different letter box: ←x→ +s +t
Equal letter box: • Recursively tracing an optimal path • Time complexity of tracing a path is O(m’+n’) • The algorithm takes O(m’n’(m’+n’))
Analysis of Time Complexity • Observation: each cell in the borders of the boxes can be visited only once • Also achieve O(m’n+n’m) bound • Time complexity is O(min(m’n’(m’+n’), m’n+n’m)) • Space complexity is O(m’n’)