Algorithms for Large Data Sets

Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 14 July 2, 2006 http://www.ee.technion.ac.il/courses/049011

Sketching

Outline • Syntactic clustering of the web • Locality sensitive hash functions • Resemblance and shingling • Min-wise independent permutations • The sketching model • Hamming distance • Edit distance

Motivation: Near-Duplicate Elimination • Many web pages are duplicates or near-duplicates of other pages • Mirror sites • FAQs, manuals, legal documents • Different versions of the same document • Plagiarism • Duplicates are bad for search engines • Increase index size • Harm quality of search results • Question: How to efficiently process the repository of crawled pages and eliminate (near)-duplicates?

Syntactic Clustering of the Web[Broder, Glassman, Manasse, Zweig 97] • U: space of all possible documents • S  U: collection of documents • sim: U × U  [0,1]: a similarity measure among documents • If p,q are very similar sim(p,q) is close to 1 • If p,q are very unsimilar, sim(p,q) is close to 0 • Usually: sim(p,q) = 1 – d(p,q), where d(p,q) is a normalized distance between p and q. • G: a graph on S: • p,q are connected by an edge iff sim(p,q)  t (t = threshold) • Goal: find the connected components of G

Challenges • S is huge • Web has 10 billion pages • Documents are not compressed • Needs many disks to store S • Each sim computation is costly • Documents in S should be processed in a stream • Main memory is tiny relative to |S| • Cannot afford more than O(|S|) time • How to create the graph G? • Naively, requires |S| passes and |S|2 similarity computations

Sketching Schemes • T = a small set (|S| < |T| << |U|) • A sketching scheme for sim: • Compression function: a randomized mapping : U  T • Reconstruction function: : TT  [0,1] • For every pair p,q, with high probability ((p),(q))  sim(p,q)

Syntactic Clustering by Sketching • P  empty table of size |S| • G  empty graph on |S| nodes • for i = 1,…,|S| • read document pi from the stream • P[i]  (pi) • for i = 1,…,|S| • for j = 1,…,|S| • if ((P[i],P[j])  t) • add edge (i,j) to G • output connected components of G

Analysis • Can compute sketches in one pass • Table P can be stored in a single file on a single machine • Creating G requires |S|2 applications of  • Easier than full-fledged computations of sim • Quadratic time is still a problem • Connected components algorithm is heavy but feasible

Locality Sensitive Hashing (LSH)[Indyk, Motwani, 98] • A special kind of sketching schemes • H = { h | h: U  T }: a family of hash functions • H is locality sensitive w.r.t. sim if for all p,q  U, Pr[h(p) = h(q)] = sim(p,q). • Probability is over random choice of h from H • Probability of collision = similarity between p and q

Syntactic Clustering by LSH • P  empty table of size |S| • G  empty graph on |S| nodes • for i = 1,…,|S| • read document pi from the stream • P[i]  h(pi) • sort P and group by value • output groups

Analysis • Can compute hash values in one pass • Table P can be stored in a single file on a single machine • Sorting and grouping takes O(|S| log |S|) simple comparisons • Each group A consists of pages whose hash value is the same • By LSH property, they are likely to be similar to each other

Shingling and Resemblance[Broder et al 97] • tokens: words, numbers, HTML tags, etc. • tokenization(p): strings of tokens produced from document p • w: a small integer • Sw(p) = w-shingling of p = set all distinct substrings of tokenization(p) of length w. • Ex: p = “a rose is a rose is a rose”, w = 4 • Sw(p) = { (a rose is a), (rose is a rose), (is a rose is) } • resemblancew(p,q) =

LSH for Resemblance • resemblancew(p,q) = •  = a random permutation on w •  induces a random order on w •  also induces a random order on any subset X  W • For each such subset and for each x  X, Pr(min ((X)) = x) = 1/|X| • LSH for resemblance: h(p) = min((Sw(p))) Sw(p) Sw(q)

LSH for Resemblance (cont.) • Lemma: • Proof:

Min-Wise Independent Permutations[Broder, Charikar, Frieze, Mitzenmacher, 98] • Usual problem: Storing  takes too much space • O(||w log ||w) bits to represent  • Use small families of permutations • A family  = {  |  is a permutation on w } is min-wise independent, if • For all subsets X  w and for all x  X, Pr(min ((X)) = x) = 1/|X| • Explicit constructions of small families of “approximately” min-wise independent permutations [Indyk 98]

The Sketching Model Shared Randomness Bob Alice k vs. r Gap Problem x y Promise: d(x,y) ≤ k or d(x,y) ≥ r (y) (x) Goal: Decide which of the two holds. d(x,y) ≥ r Approximation d(x,y) ≤ k Referee

Applications Large data sets • Clustering • Nearest Neighbor schemes • Data streams Management of Files over the Network • Differential backup • Synchronization Theory • Low distortion embeddings • Simultaneous messages communication complexity

Known Sketching Schemes • Resemblance [Broder, Glassman, Manasse, Zweig 97], [Broder, Charikar, Frieze, Mitzenmacher 98] • Hamming distance [Kushilevitz, Ostrovsky, Rabani 98], [Indyk, Motwani 98] [Feigenbaum,Ishai,Malkin,Nissim,Strauss,Wright 01] • Cosine similarity [Charikar 02] • Earth mover distance [Charikar 02] • Edit distance [Bar-Yossef, Jayram, Krauthgamer, Kumar 04]

Sketching Algorithm for Hamming Distance [Kushilevitz, Ostrovsky, Rabani 98] • x,y: binary strings of length n • HD(x,y) = # of positions in which x,y differ • HD(x,y) = | { i | xi yi } | • Ex: x = 10101, y = 01010, HD(x,y) = 5 • Goal: • If HD(x,y) ≤ k, output “accept” w.p.  1 -  • If HD(x,y) ≥ 2k, output “reject” w.p.  1 -  • KOR algorithm: O(log(1/)) size sketch.

The KOR Algorithm • Shared randomness: n i.i.d. random bits r1,…,rn, where • Basic sketch: h(x) = (i xi ri ) mod 2 • Full sketch: (x) = (h1(x),…,ht(x)) • t = O(log(1/)) • h1,…,ht are generated independently like h • Reconstruction: • for j = 1,…,t do • if (hj(x) = hj(y)) then • zj 1 • else • zj 0 • if avg(z1,…,zt) > 11/18 output “accept” and else output “reject”

KOR: Analysis • dd • Note: # of terms in the sum = HD(x,y) • Given HD(x,y) independent random bits, each with probability 1/k to be 1, what is the probability that their parity is 0?

KOR: Analysis (cont.) • r1,…,rm: m independent random bits • For each j, Pr(rj = 1) =  • What is Pr[j rj = 0)? • Can view distribution of each bit as a mixture of two distributions: • Dist A (with probability 1 - 2): the bit 0 w.p. 1 • Dist B (with probability 2): a uniformly chosen bit • Note: • If all bits “choose” Dist A, then the parity is 0 w.p. 1 • If one of the m bits “chooses” Dist B, then the parity is 0 w.p. ½ • Hence,

KOR Analysis (cont.) • ff • Therefore, • If HD(x,y) ≤ k, then Pr[h(x) = h(y)] ≥ 1/2 + 1/2e  4/6 = 12/18 • If HD(x,y) ≥ 2k, then Pr[h(x) = h(y)] ≤ 1/2 + 1/2e2 10/18 • Define: • If HD(x,y) ≤ k, then E[Z] ≥ 12/18 • If HD(x,y) ≥ 2k, then E[Z] ≤ 10/18 • By Chernoff, t = O(log(1/)) enough to guarantee: • If HD(x,y) ≤ k, then Z > 11/18 w.p. 1 -  • If HD(x,y) ≥ 2k, then Z ≤ 11/18 w.p. 1 - 

Edit Distance x 2n, y 2m ED(x,y): Minimum number of character insertions, deletions and substitutions that transform x to y. Examples: ED(00000, 1111) = 5 ED(01010, 10101) = 2 Applications • Genomics • Text processing • Web search For simplicity: m = n,  = {0,1}.

Sketching Algorithm for Edit Distance [Bar-Yossef,Jayram,Krauthgamer,Kumar 04] • x,y: binary strings of length n • Goal: • If ED(x,y) ≤ k, output “accept” w.p.  1 -  • If ED(x,y) ≥  ((kn)2/3), output “reject” w.p. ≥ 1 -  • BJKK algorithm: O(log(1/)) size sketch.

Basic Framework Underlying Principle ED(x,y) is small iff x and y share many common substrings at nearby positions. Sx= set of pairs of the form (,h(i)) : a substring of x h(i): a “locality sensitive” encoding of the substring’s position x y common substrings at nearby positions ED(x,y) small iff intersection SxÅ Sy large Sy Sx

Basic Framework (cont.) x y ED(x,y) small iff symmetric difference Sx Sy small Sy Sx • Need to estimate size of symmetric difference • Hamming distance computation of characteristic vectors • Use O(log(1/)) size sketches [KOR] Reduced Edit Distance to Hamming Distance

Encoding Scheme Gap: k vs. O((kn)2/3) B = n2/3/k1/3, W = n/B 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 x B windows of size W each. b1 b2 b3 y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 … ,(i, win(i)),… (2,1), (3,2), (1,1), Sx = { ,(bi, win(i)),… … (b1,1), (b2,1), (b3,2), Sy = {

Analysis 1 2 3 4 5 6 7 8 9 10 11 12 13 14 i x bj y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 • Case 1: ED(x,y) · k • If i is “unmarked”, it has a matching “companion” j • (i,win(i)) 2 Sxn Sy, only if: • either i is “marked” • or i is unmarked, but win(i)  win(j) • At most kB marked substrings • At most k * n/W = kB companions with mismatched windows • Therefore, Ham(Sx,Sy) · 4kB

Analysis (cont.) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 2B+1 1 B+1 x b2 bB-1 y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 • Case 2: Ham(Sx,Sy) · 8kB • If i has a “companion” j and win(i) = win(j), can align i with j using at most W operations • Otherwise, substitute first character of i • At most 8kB substrings of x have no companion • Therefore, ED(x,y) · 8kB + W * n/B = O((kn)2/3)

End of Lecture 14

Algorithms for Large Data Sets