Edit Distance and Large Data Sets

Edit Distance and Large Data Sets Ravi Kumar Robert Krauthgamer Ziv Bar-Yossef T.S. Jayram IBM Almaden Technion

Motivating Example:Near-Duplicate Elimination Web • Syntactic clustering[Broder, Glassman, Manasse, Zweig 97] • Group pages into clusters of “similar” pages • Keep one “representative” from each cluster Crawler Page Repository Duplicate elimination Page Repository

Syntactic Clustering via Sketching[Broder,Glassman,Manasse,Zweig 97] Challenges • Corpus is huge (billions of pages, 10K/page) • Streaming access • Limited main memory • Linear running time p h(p) Locality Sensitive Hashes [Indyk, Motwani 98] Prh[h(p) = h(q)] = sim(p,q) • Can compute sketches in one pass • Sketches can be stored and processed on a single machine Cluster: Collection of pages that have a common sketch

Shingling and Resemblance[Broder,Glassman,Manasse,Zweig 97], [Broder,Charikar,Frieze,Mitzenmacher 98] w-shingling: Sw(p) = all substrings of p of length w Sw(p) Sw(q) resemblancew(p,q) = Pr[min((Sw(p)) = min((Sw(q))] =

The Sketching Model Shared Randomness Bob Alice k vs. r Gap Problem x y Promise: d(x,y) · k or d(x,y) ¸ r (y) (x) Goal: Decide which of the two holds. d(x,y) ¸ r Approximation d(x,y) · k Referee

Applications of Sketching Large data sets • Clustering • Nearest Neighbor schemes • Data streams Management of Files over the Network • Differential backup • Synchronization Theory • Low distortion embeddings • Simultaneous messages communication complexity

Known Sketching Schemes • Resemblance [Broder, Glassman, Manasse, Zweig 97], [Broder, Charikar, Frieze, Mitzenmacher 98] • Hamming distance [Kushilevitz, Ostrovsky, Rabani 98], [Indyk, Motwani 98] [Feigenbaum,Ishai,Malkin,Nissim,Strauss,Wright 01] • Cosine similarity [Charikar 02] • Earth mover distance [Charikar 02] In this talk: Edit Distance

Edit Distance x 2n, y 2m ED(x,y): Minimum number of character insertions, deletions and substitutions that transform x to y. Examples: ED(00000, 1111) = 5 ED(01010, 10101) = 2 Applications • Genomics • Text processing • Web search For simplicity: m = n,  = {0,1}.

Computing Edit Distance Exact Computation • Dynamic programming (1970) O(n2) • Masek and Paterson (1980) O(n2/log n) • Impractical for comparing two very long strings. • Natural question 1: can we do it in lineartime? • Impractical for handling massive document repositories. • Natural question 2: are there constant size sketches of edit distance? Focus of this talk Can we solve the above problems if we settle for approximation?

Sketching Schemes for Edit Distance Negative Indications • No known embeddings of Edit distance into a normed space. • Every embedding of Edit distance into L1 incurs ¸ 3/2 distortion [Andoni,Deza,Gupta,Indyk,Raskhodnikova 03] • Weak nearest neighbor schemes [Indyk 04]

Hamming Distance Sketches[Kushilevitz, Ostrovsky, Rabani 98] Ham(x,y) = # of positions in which x,y differ Gap: k vs. 2k Sketch size: O(1) Analysis: Pr[h(x)  h(y)] = Pr[h(x) + h(y) = 1] = Pr[i: xi yi ri = 1] = ½(1- (1 – 1/k)Ham(x,y)) Shared randomness: r1,…,rn2 {0,1} are independent and Sketch: h(x) = (i xi ri ) mod 2 h(y) = (i yi ri ) mod 2 (x) = (h1(x),…,ht(x)), (y) = (h1(y),…,ht(y)), t = O(1)

Edit Distance Sketches: Basic Framework Underlying Principle ED(x,y) is small iff x and y share many common substrings at nearby positions. Sx= set of pairs of the form (,h(i)) : a substring of x h(i): a “locality sensitive” encoding of the substring’s position x y common substrings at nearby positions ED(x,y) small iff intersection SxÅ Sy large Sy Sx

Basic Framework (cont.) x y ED(x,y) small iff symmetric difference Sx Sy small Sy Sx • Need to estimate size of symmetric difference • Hamming distance computation of characteristic vectors • Use constant size sketches [KOR] Reduced Edit Distance to Hamming Distance

General Case: Encoding Scheme Gap: k vs. O((kn)2/3) B = n2/3/k1/3, W = n/B 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 x B windows of size W each. b1 b2 b3 y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 … ,(i, win(i)),… (2,1), (3,2), (1,1), Sx = { ,(bi, win(i)),… … (b1,1), (b2,1), (b3,2), Sy = {

Analysis 1 2 3 4 5 6 7 8 9 10 11 12 13 14 i x bj y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 • Case 1: ED(x,y) · k • If i is “unmarked”, it has a matching “companion” j • (i,win(i)) 2 Sxn Sy, only if: • either i is “marked” • or i is unmarked, but win(i)  win(j) • At most kB marked substrings • At most k * n/W = kB companions with mismatched windows • Therefore, Ham(Sx,Sy) · 4kB

Analysis (cont.) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 2B+1 1 B+1 x b2 bB-1 y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 • Case 2: Ham(Sx,Sy) · 8kB • If i has a “companion” j and win(i) = win(j), can align i with j using at most W operations • Otherwise, substitute first character of i • At most 8kB substrings of x have no companion • Therefore, ED(x,y) · 8kB + W * n/B = O((kn)2/3)

Non-repetitive Case: Encoding Scheme t ¸ 1 “non-repetitiveness” parameter, W = O(k * t) no substring of length t repeats within a window of size W Gap: k vs. O(k W) Alice and Bob choose a sequence of “anchors” in a coordinated way W 2 3 7 x2 1 x1 4 5 6 x 3 7 1 y1 2 y2 4 5 6 y 1: a random permutation on {0,1}t 1: minimal length-t substring of x1 (under 1) 1: minimal length-t substring of y1 (under 1) W

Encoding scheme (cont.) 2 3 7 1 1 8 3 4 4 5 7 2 5 6 6 x 1 6 8 7 2 4 5 3 6 1 2 7 3 5 4 y Sx = { (1,1),…,(8,8) } Sy = { (y1,1),…,(y8,8) }

Analysis 2 3 7 1 4 5 6 1 3 4 7 2 5 6 8 x 3 7 7 1 6 1 2 4 5 5 4 8 6 2 3 y • Case 1: ED(x,y) · k. • All anchors are “unmarked” with probability 1 - kt/W = (1) • If i,i are unmarked, they are aligned • # of mismatching substrings · 2k • Ham(Sx,Sy) · 2k

Analysis (cont.) 2 3 7 1 4 5 6 1 3 4 7 2 5 6 8 x 3 7 7 1 6 1 2 4 5 5 4 8 6 2 3 y • Case 2: Ham(Sx,Sy) · 4k • # of mismatching substrings · 4k • ED(x,y) · 2 ¢ W ¢ 4k = O(k W).

Approximation in Linear Time Arbitrary Strings Non-repetitive Strings

Summary and Open Problems • Designed efficient approximation schemes for edit distance. • Best sketching and linear-time approximations to date • Subsequent work: • O(n2/3) distortion embedding of edit distance into L1[Indyk 04] [Rabani 04] • Better embeddings of edit distance into L1[Ostrovsky, Rabani, 05] • Embeddings of the Ulam metric into L1[Charikar, Krauthgamer, 05] • Open Problems • Sketch size lower bounds • Constant factor approximations in linear time • Better embeddings of edit distance • Sketching schemes for other distance measures

Thank You

Edit Distance and Large Data Sets

Edit Distance and Large Data Sets

Presentation Transcript

Algorithms for Large Data Sets

Algorithms for Large Data Sets

Algorithms for Large Data Sets

Algorithms for Large Data Sets

Algorithms for Large Data Sets

u sing large data sets

Manipulating Large Data Sets

Experiences with Large Data Sets

using large data sets

Edit Distance

Very large data sets

Experiences with Large Data Sets

using large data sets

using large data sets

Interacting with Large Data Sets

Manipulating Large Data Sets

Edit Distance