380 likes | 597 Views
Dynamic Edit Distance Table under a General Weighted Cost Function. Heikki Hyyrö (University of Tampere, Finland) Kazuyuki Narisawa (Kyushu University, Japan) and Shunsuke Inenaga (Kyushu University, Japan). Contents. Edit Distance Left Increment/Decrement Edit Distance Problem
E N D
Dynamic Edit Distance Table under a General Weighted Cost Function HeikkiHyyrö(University of Tampere, Finland) Kazuyuki Narisawa(Kyushu University, Japan) and ShunsukeInenaga (Kyushu University, Japan)
Contents • Edit Distance • Left Increment/Decrement Edit Distance Problem • Related Work • Our Algorithm • Experiments • Summary
Contents • Edit Distance • Left Increment/Decrement Edit Distance Problem • Related Work • Our Algorithm • Experiments • Summary
Edit Distance minimum total cost dfor transforming stringx[1:n]toy[1:m] Example x=prague, y = passage Ins. = Del. = Sub. =1 Edit Distance = Sub. + Ins. + Ins. + Del. = 1+1+1+1 = 4
Contents • Edit Distance • Left Increment/Decrement Edit Distance Problem • Related Work • Our Algorithm • Experiments • Summary
Right Increment/Decrement • Right I/D of Edit Distance • input : D of strings A and B • output : D’ of stringsAandB’( B = B’aorBa= B’) • easy to compute • insert or delete right column of D → D’ :O(m) decrement increment
Left Increment/Decrement • Left I/D of ED • input : D of stringsA andB • output : D of stringsAandB’( B = aB’oraB= B’) • difficult to compute • values of left side effect to the values of right side increment decrement
Contribution • Propose an efficient algorithm for Left I/D problem with any nonnegative integer costs • Left I/D problem • input : ED table D of strings A and B • output : ED table D’ of strings A and B’ • B = aB’ (decrement) • B’ = aB(increment) • costs of operations are nonnegative integers
Applications • Cyclic String Comparison [Landau et. al 1998] • Computing Approximate periods [Schmidt 1998] • Edit distance for sliding window • String Kernel based on Edit distance • kernel is mapping to high dimensional feature space • used in Support Vector Machine(classifier)
Contents • Edit Distance • Left Increment/Decrement Edit Distance Problem • Related Work • Our Algorithm • Experiments • Summary
Related Work • naïve method • computeD’ from scratch • O(nm) time • Kim & Park algorithm [2004] • Each operation has cost 1 • Compute difference representation DRof table D • Using Change TableCh • O(n+m) time
Definition • Left Increment/Decrement Problem • input : DR table of stringAandB • output : DR’ table of stringAandB’ • B = aB’ (decrement) • B’ = aB(increment) • Each cost (Ins., Del., Sub.) is a Non Negative Integer • Kim & Park algorithm : each cost is 1
Difference Representation under minus upper right minus left
DR’ – DR We need not update all cells
Change Table • Ch[i, j] = D’[i, j] – D[i, j] • cost = 1 • values in Ch : –1, 0, 1 • is separated into three areas
Affected Entries • entries whereDR’[i, j] ≠ DR[i, j] • they must be updated • affected entries arealong the borders of three areas in Ch
Sketch of Kim & Park Algorithm • Update affected entries • scan borders in Ch, computing Ch and DR’ • Time Complexity : O(n+m)
Contents • Edit Distance • Left Increment/Decrement Edit Distance Problem • Related Work • Our Algorithm • Experiments • Summary
General Costs • Chcan be separated into more than three areas • the number of areas depends on the costs • the values are not limited to –1, 0, 1 • Kim & Park algorithm • is specialized to the three area case • can not be applied with general costs Example Ins. = 2, Del. = 2, Sub. =1
Our Algorithm • Update only affected entries • without Ch • compute only DR’.U andDR’.L • Time complexity : O(min{c(n+m), nm}) • c is the maximum cost DR’.L – DR.L DR’.U – DR.U D’ – D
Affected Entry • DR’[i, j] ≠ DR[i, j] • Kim & Park Algorithm • computes DR’and Ch for computing Affected Entry • Our Algorithm • compute affected entry by only DR table • use following lemma
comparison of behaviors our algorithm Kim & Park algorithm
Contents • Edit Distance • Left Increment/Decrement Edit Distance Problem • Related Work • Our Algorithm • Experiments • Summary
Experiments • stringsA[1:m]andB[1:m] • Total time of computing representations of edit distance between Aand B[ j:m] for j = m, m–1,…, 1 • left incremental computation • Machine Specifications • CentOS Linux • Xeon 3.0GhHz • 16GB memory
Experiment 1 • Time comparison with naïve algorithm • costs • chosen randomly • Insertion = 137,Deletion = 116, Substitution = 242 • Random data • alphabet size2,3, …, 52 • string length100, 200, …, 5000
Experiment 2 • Time comparison with Kim & Parkalgorithm • costs • Insertion = Deletion = Substitution = 1 • Random data • alphabet size 2, 3, , …, 52 • string length 100, 200, …, 5000
Experiment 3 • TimeCompare with naïve algorithm • Corpus • English(reuters news) • costs • Insertion= 137, Deletion = 116, Substitution = 242 • string length : 1000, 2000, 3000, 4000, 5000 • Protein data(canterbury corpus: E.coli) • costs proposed in [Kurtz 1996] • string length : 1000, 2000, 3000, 4000, 5000
Result 3 English News Protein Data
Summary • Algorithm for Left I/D problem • nonnegative integer costs • O( min{c(n+m), nm} ) • cis the maximum cost • experimentally fast
Related Work • naïve method • computeD’ from scratch • O(nm) time • Kim & Park algorithm [2004] • Each operation has cost 1 • Compute difference representation DR →DR’ • Using Change TableCh • O(n+m) time naïve Kim & Park O(nm) D DR, Ch O(nm) O(n+m) D’ DR’, Ch O(1) O(n+m) Edit Distance