230 likes | 406 Views
Edit Distance and Large Data Sets. Ravi Kumar. Robert Krauthgamer. Ziv Bar-Yossef. T.S. Jayram. IBM Almaden. Technion. Motivating Example: Near-Duplicate Elimination. Web. Syntactic clustering [Broder, Glassman, Manasse, Zweig 97] Group pages into clusters of “similar” pages
E N D
Edit Distance and Large Data Sets Ravi Kumar Robert Krauthgamer Ziv Bar-Yossef T.S. Jayram IBM Almaden Technion
Motivating Example:Near-Duplicate Elimination Web • Syntactic clustering[Broder, Glassman, Manasse, Zweig 97] • Group pages into clusters of “similar” pages • Keep one “representative” from each cluster Crawler Page Repository Duplicate elimination Page Repository
Syntactic Clustering via Sketching[Broder,Glassman,Manasse,Zweig 97] Challenges • Corpus is huge (billions of pages, 10K/page) • Streaming access • Limited main memory • Linear running time p h(p) Locality Sensitive Hashes [Indyk, Motwani 98] Prh[h(p) = h(q)] = sim(p,q) • Can compute sketches in one pass • Sketches can be stored and processed on a single machine Cluster: Collection of pages that have a common sketch
Shingling and Resemblance[Broder,Glassman,Manasse,Zweig 97], [Broder,Charikar,Frieze,Mitzenmacher 98] w-shingling: Sw(p) = all substrings of p of length w Sw(p) Sw(q) resemblancew(p,q) = Pr[min((Sw(p)) = min((Sw(q))] =
The Sketching Model Shared Randomness Bob Alice k vs. r Gap Problem x y Promise: d(x,y) · k or d(x,y) ¸ r (y) (x) Goal: Decide which of the two holds. d(x,y) ¸ r Approximation d(x,y) · k Referee
Applications of Sketching Large data sets • Clustering • Nearest Neighbor schemes • Data streams Management of Files over the Network • Differential backup • Synchronization Theory • Low distortion embeddings • Simultaneous messages communication complexity
Known Sketching Schemes • Resemblance [Broder, Glassman, Manasse, Zweig 97], [Broder, Charikar, Frieze, Mitzenmacher 98] • Hamming distance [Kushilevitz, Ostrovsky, Rabani 98], [Indyk, Motwani 98] [Feigenbaum,Ishai,Malkin,Nissim,Strauss,Wright 01] • Cosine similarity [Charikar 02] • Earth mover distance [Charikar 02] In this talk: Edit Distance
Edit Distance x 2n, y 2m ED(x,y): Minimum number of character insertions, deletions and substitutions that transform x to y. Examples: ED(00000, 1111) = 5 ED(01010, 10101) = 2 Applications • Genomics • Text processing • Web search For simplicity: m = n, = {0,1}.
Computing Edit Distance Exact Computation • Dynamic programming (1970) O(n2) • Masek and Paterson (1980) O(n2/log n) • Impractical for comparing two very long strings. • Natural question 1: can we do it in lineartime? • Impractical for handling massive document repositories. • Natural question 2: are there constant size sketches of edit distance? Focus of this talk Can we solve the above problems if we settle for approximation?
Sketching Schemes for Edit Distance Negative Indications • No known embeddings of Edit distance into a normed space. • Every embedding of Edit distance into L1 incurs ¸ 3/2 distortion [Andoni,Deza,Gupta,Indyk,Raskhodnikova 03] • Weak nearest neighbor schemes [Indyk 04]
Hamming Distance Sketches[Kushilevitz, Ostrovsky, Rabani 98] Ham(x,y) = # of positions in which x,y differ Gap: k vs. 2k Sketch size: O(1) Analysis: Pr[h(x) h(y)] = Pr[h(x) + h(y) = 1] = Pr[i: xi yi ri = 1] = ½(1- (1 – 1/k)Ham(x,y)) Shared randomness: r1,…,rn2 {0,1} are independent and Sketch: h(x) = (i xi ri ) mod 2 h(y) = (i yi ri ) mod 2 (x) = (h1(x),…,ht(x)), (y) = (h1(y),…,ht(y)), t = O(1)
Edit Distance Sketches: Basic Framework Underlying Principle ED(x,y) is small iff x and y share many common substrings at nearby positions. Sx= set of pairs of the form (,h(i)) : a substring of x h(i): a “locality sensitive” encoding of the substring’s position x y common substrings at nearby positions ED(x,y) small iff intersection SxÅ Sy large Sy Sx
Basic Framework (cont.) x y ED(x,y) small iff symmetric difference Sx Sy small Sy Sx • Need to estimate size of symmetric difference • Hamming distance computation of characteristic vectors • Use constant size sketches [KOR] Reduced Edit Distance to Hamming Distance
General Case: Encoding Scheme Gap: k vs. O((kn)2/3) B = n2/3/k1/3, W = n/B 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 x B windows of size W each. b1 b2 b3 y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 … ,(i, win(i)),… (2,1), (3,2), (1,1), Sx = { ,(bi, win(i)),… … (b1,1), (b2,1), (b3,2), Sy = {
Analysis 1 2 3 4 5 6 7 8 9 10 11 12 13 14 i x bj y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 • Case 1: ED(x,y) · k • If i is “unmarked”, it has a matching “companion” j • (i,win(i)) 2 Sxn Sy, only if: • either i is “marked” • or i is unmarked, but win(i) win(j) • At most kB marked substrings • At most k * n/W = kB companions with mismatched windows • Therefore, Ham(Sx,Sy) · 4kB
Analysis (cont.) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 2B+1 1 B+1 x b2 bB-1 y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 • Case 2: Ham(Sx,Sy) · 8kB • If i has a “companion” j and win(i) = win(j), can align i with j using at most W operations • Otherwise, substitute first character of i • At most 8kB substrings of x have no companion • Therefore, ED(x,y) · 8kB + W * n/B = O((kn)2/3)
Non-repetitive Case: Encoding Scheme t ¸ 1 “non-repetitiveness” parameter, W = O(k * t) no substring of length t repeats within a window of size W Gap: k vs. O(k W) Alice and Bob choose a sequence of “anchors” in a coordinated way W 2 3 7 x2 1 x1 4 5 6 x 3 7 1 y1 2 y2 4 5 6 y 1: a random permutation on {0,1}t 1: minimal length-t substring of x1 (under 1) 1: minimal length-t substring of y1 (under 1) W
Encoding scheme (cont.) 2 3 7 1 1 8 3 4 4 5 7 2 5 6 6 x 1 6 8 7 2 4 5 3 6 1 2 7 3 5 4 y Sx = { (1,1),…,(8,8) } Sy = { (y1,1),…,(y8,8) }
Analysis 2 3 7 1 4 5 6 1 3 4 7 2 5 6 8 x 3 7 7 1 6 1 2 4 5 5 4 8 6 2 3 y • Case 1: ED(x,y) · k. • All anchors are “unmarked” with probability 1 - kt/W = (1) • If i,i are unmarked, they are aligned • # of mismatching substrings · 2k • Ham(Sx,Sy) · 2k
Analysis (cont.) 2 3 7 1 4 5 6 1 3 4 7 2 5 6 8 x 3 7 7 1 6 1 2 4 5 5 4 8 6 2 3 y • Case 2: Ham(Sx,Sy) · 4k • # of mismatching substrings · 4k • ED(x,y) · 2 ¢ W ¢ 4k = O(k W).
Approximation in Linear Time Arbitrary Strings Non-repetitive Strings
Summary and Open Problems • Designed efficient approximation schemes for edit distance. • Best sketching and linear-time approximations to date • Subsequent work: • O(n2/3) distortion embedding of edit distance into L1[Indyk 04] [Rabani 04] • Better embeddings of edit distance into L1[Ostrovsky, Rabani, 05] • Embeddings of the Ulam metric into L1[Charikar, Krauthgamer, 05] • Open Problems • Sketch size lower bounds • Constant factor approximations in linear time • Better embeddings of edit distance • Sketching schemes for other distance measures