1 / 18

Approximate Matching of Run-Length Compressed Strings

Approximate Matching of Run-Length Compressed Strings. Algorithmica (2003) Veli M¨akinen, Gonzalo Navarro, and Esko Ukkonen. Run-Length encoding. aaabb (a,3),(b,2). Edit Distance on Run-Length Compressed Strings Extending to Weighted Edit Distance Approximate Searching

franz
Download Presentation

Approximate Matching of Run-Length Compressed Strings

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Approximate Matching of Run-Length Compressed Strings Algorithmica (2003) Veli M¨akinen, Gonzalo Navarro, and Esko Ukkonen

  2. Run-Length encoding aaabb (a,3),(b,2) • Edit Distance on Run-Length Compressed Strings • Extending to Weighted Edit Distance • Approximate Searching • Improving a Greedy Algorithm for LCS

  3. Part1: An O(mn’+m’n) Algorithm for the Levenshtein Distance • String A=a1a2···amcompressed length m’ String B=b1b2···bncompressed length n’ • Levenshtein distance, DL(A , B) di, j= min(di-1, j+ 1, di, j-1 + 1, di-1, j-1 + if ai= bjthen 0 else 1) • DID(A, B) di, j= min(di-1, j+ 1, di, j-1 + 1, di-1, j-1 + if ai= bjthen 0 else ∞) • Use Dynamic Programming

  4. Relationship between DID and LCS • 2 ×|LCS(A, B)| = m+n -DID(A, B) • m + n = 2 ×|LCS(A, B)|+ x + y • DID(A, B) = x + y

  5. Notations

  6. Known: Top and Left borderGoal: Right and Button border • Equal letter box:

  7. Different letter box: • Observation: consecutive cells in the (dij) matrix differ at most by one

  8. Algorithm:

  9. Time Complexity of the Algorithm

  10. Part2: Extending to Weighted Edit Distance

  11. Which one is correct? or

  12. s-t • path(d, r )=Cs min(d, r )+ Cd max(d- r, 0) + Cimax(r - d, 0), (q,0) s-q (s,t) t r r Cs Cs d d Cs Ci Cd d=r d<r d>r

  13. How to evaluatemin value in constant time • The problem is, path is not a constant any more +Cs-Ci (s1,t1) (s3,t3) (s2,t2) +Cs-Cd (s4,t4)

  14. Part3: Approximate Searching • Find all approximate occurrence of A(short pattern) in B(long string) • Let all d0,j=0 and find all dm,j≦k • More efficient approach — evaluate only the first m columns in each long run

  15. Time Complexity • Short run in B with length r≦m: O(m’r+m) • Long run: O(m’m+m+m) • Total time complexity is O(n’m’m+R), R = number of occurence

  16. Part4: Improving a Greedy Algorithm for LCS • Basic idea: Fill the only corner of the boxes • Different letter box: ←x→ +s +t

  17. Equal letter box: • Recursively tracing an optimal path • Time complexity of tracing a path is O(m’+n’) • The algorithm takes O(m’n’(m’+n’))

  18. Analysis of Time Complexity • Observation: each cell in the borders of the boxes can be visited only once • Also achieve O(m’n+n’m) bound • Time complexity is O(min(m’n’(m’+n’), m’n+n’m)) • Space complexity is O(m’n’)

More Related