320 likes | 607 Views
Algorithms for Large Data Sets. Ziv Bar-Yossef. Lecture 14 July 2, 2006. http://www.ee.technion.ac.il/courses/049011. Sketching. Outline. Syntactic clustering of the web Locality sensitive hash functions Resemblance and shingling Min-wise independent permutations The sketching model
E N D
Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 14 July 2, 2006 http://www.ee.technion.ac.il/courses/049011
Outline • Syntactic clustering of the web • Locality sensitive hash functions • Resemblance and shingling • Min-wise independent permutations • The sketching model • Hamming distance • Edit distance
Motivation: Near-Duplicate Elimination • Many web pages are duplicates or near-duplicates of other pages • Mirror sites • FAQs, manuals, legal documents • Different versions of the same document • Plagiarism • Duplicates are bad for search engines • Increase index size • Harm quality of search results • Question: How to efficiently process the repository of crawled pages and eliminate (near)-duplicates?
Syntactic Clustering of the Web[Broder, Glassman, Manasse, Zweig 97] • U: space of all possible documents • S U: collection of documents • sim: U × U [0,1]: a similarity measure among documents • If p,q are very similar sim(p,q) is close to 1 • If p,q are very unsimilar, sim(p,q) is close to 0 • Usually: sim(p,q) = 1 – d(p,q), where d(p,q) is a normalized distance between p and q. • G: a graph on S: • p,q are connected by an edge iff sim(p,q) t (t = threshold) • Goal: find the connected components of G
Challenges • S is huge • Web has 10 billion pages • Documents are not compressed • Needs many disks to store S • Each sim computation is costly • Documents in S should be processed in a stream • Main memory is tiny relative to |S| • Cannot afford more than O(|S|) time • How to create the graph G? • Naively, requires |S| passes and |S|2 similarity computations
Sketching Schemes • T = a small set (|S| < |T| << |U|) • A sketching scheme for sim: • Compression function: a randomized mapping : U T • Reconstruction function: : TT [0,1] • For every pair p,q, with high probability ((p),(q)) sim(p,q)
Syntactic Clustering by Sketching • P empty table of size |S| • G empty graph on |S| nodes • for i = 1,…,|S| • read document pi from the stream • P[i] (pi) • for i = 1,…,|S| • for j = 1,…,|S| • if ((P[i],P[j]) t) • add edge (i,j) to G • output connected components of G
Analysis • Can compute sketches in one pass • Table P can be stored in a single file on a single machine • Creating G requires |S|2 applications of • Easier than full-fledged computations of sim • Quadratic time is still a problem • Connected components algorithm is heavy but feasible
Locality Sensitive Hashing (LSH)[Indyk, Motwani, 98] • A special kind of sketching schemes • H = { h | h: U T }: a family of hash functions • H is locality sensitive w.r.t. sim if for all p,q U, Pr[h(p) = h(q)] = sim(p,q). • Probability is over random choice of h from H • Probability of collision = similarity between p and q
Syntactic Clustering by LSH • P empty table of size |S| • G empty graph on |S| nodes • for i = 1,…,|S| • read document pi from the stream • P[i] h(pi) • sort P and group by value • output groups
Analysis • Can compute hash values in one pass • Table P can be stored in a single file on a single machine • Sorting and grouping takes O(|S| log |S|) simple comparisons • Each group A consists of pages whose hash value is the same • By LSH property, they are likely to be similar to each other
Shingling and Resemblance[Broder et al 97] • tokens: words, numbers, HTML tags, etc. • tokenization(p): strings of tokens produced from document p • w: a small integer • Sw(p) = w-shingling of p = set all distinct substrings of tokenization(p) of length w. • Ex: p = “a rose is a rose is a rose”, w = 4 • Sw(p) = { (a rose is a), (rose is a rose), (is a rose is) } • resemblancew(p,q) =
LSH for Resemblance • resemblancew(p,q) = • = a random permutation on w • induces a random order on w • also induces a random order on any subset X W • For each such subset and for each x X, Pr(min ((X)) = x) = 1/|X| • LSH for resemblance: h(p) = min((Sw(p))) Sw(p) Sw(q)
LSH for Resemblance (cont.) • Lemma: • Proof:
Min-Wise Independent Permutations[Broder, Charikar, Frieze, Mitzenmacher, 98] • Usual problem: Storing takes too much space • O(||w log ||w) bits to represent • Use small families of permutations • A family = { | is a permutation on w } is min-wise independent, if • For all subsets X w and for all x X, Pr(min ((X)) = x) = 1/|X| • Explicit constructions of small families of “approximately” min-wise independent permutations [Indyk 98]
The Sketching Model Shared Randomness Bob Alice k vs. r Gap Problem x y Promise: d(x,y) ≤ k or d(x,y) ≥ r (y) (x) Goal: Decide which of the two holds. d(x,y) ≥ r Approximation d(x,y) ≤ k Referee
Applications Large data sets • Clustering • Nearest Neighbor schemes • Data streams Management of Files over the Network • Differential backup • Synchronization Theory • Low distortion embeddings • Simultaneous messages communication complexity
Known Sketching Schemes • Resemblance [Broder, Glassman, Manasse, Zweig 97], [Broder, Charikar, Frieze, Mitzenmacher 98] • Hamming distance [Kushilevitz, Ostrovsky, Rabani 98], [Indyk, Motwani 98] [Feigenbaum,Ishai,Malkin,Nissim,Strauss,Wright 01] • Cosine similarity [Charikar 02] • Earth mover distance [Charikar 02] • Edit distance [Bar-Yossef, Jayram, Krauthgamer, Kumar 04]
Sketching Algorithm for Hamming Distance [Kushilevitz, Ostrovsky, Rabani 98] • x,y: binary strings of length n • HD(x,y) = # of positions in which x,y differ • HD(x,y) = | { i | xi yi } | • Ex: x = 10101, y = 01010, HD(x,y) = 5 • Goal: • If HD(x,y) ≤ k, output “accept” w.p. 1 - • If HD(x,y) ≥ 2k, output “reject” w.p. 1 - • KOR algorithm: O(log(1/)) size sketch.
The KOR Algorithm • Shared randomness: n i.i.d. random bits r1,…,rn, where • Basic sketch: h(x) = (i xi ri ) mod 2 • Full sketch: (x) = (h1(x),…,ht(x)) • t = O(log(1/)) • h1,…,ht are generated independently like h • Reconstruction: • for j = 1,…,t do • if (hj(x) = hj(y)) then • zj 1 • else • zj 0 • if avg(z1,…,zt) > 11/18 output “accept” and else output “reject”
KOR: Analysis • dd • Note: # of terms in the sum = HD(x,y) • Given HD(x,y) independent random bits, each with probability 1/k to be 1, what is the probability that their parity is 0?
KOR: Analysis (cont.) • r1,…,rm: m independent random bits • For each j, Pr(rj = 1) = • What is Pr[j rj = 0)? • Can view distribution of each bit as a mixture of two distributions: • Dist A (with probability 1 - 2): the bit 0 w.p. 1 • Dist B (with probability 2): a uniformly chosen bit • Note: • If all bits “choose” Dist A, then the parity is 0 w.p. 1 • If one of the m bits “chooses” Dist B, then the parity is 0 w.p. ½ • Hence,
KOR Analysis (cont.) • ff • Therefore, • If HD(x,y) ≤ k, then Pr[h(x) = h(y)] ≥ 1/2 + 1/2e 4/6 = 12/18 • If HD(x,y) ≥ 2k, then Pr[h(x) = h(y)] ≤ 1/2 + 1/2e2 10/18 • Define: • If HD(x,y) ≤ k, then E[Z] ≥ 12/18 • If HD(x,y) ≥ 2k, then E[Z] ≤ 10/18 • By Chernoff, t = O(log(1/)) enough to guarantee: • If HD(x,y) ≤ k, then Z > 11/18 w.p. 1 - • If HD(x,y) ≥ 2k, then Z ≤ 11/18 w.p. 1 -
Edit Distance x 2n, y 2m ED(x,y): Minimum number of character insertions, deletions and substitutions that transform x to y. Examples: ED(00000, 1111) = 5 ED(01010, 10101) = 2 Applications • Genomics • Text processing • Web search For simplicity: m = n, = {0,1}.
Sketching Algorithm for Edit Distance [Bar-Yossef,Jayram,Krauthgamer,Kumar 04] • x,y: binary strings of length n • Goal: • If ED(x,y) ≤ k, output “accept” w.p. 1 - • If ED(x,y) ≥ ((kn)2/3), output “reject” w.p. ≥ 1 - • BJKK algorithm: O(log(1/)) size sketch.
Basic Framework Underlying Principle ED(x,y) is small iff x and y share many common substrings at nearby positions. Sx= set of pairs of the form (,h(i)) : a substring of x h(i): a “locality sensitive” encoding of the substring’s position x y common substrings at nearby positions ED(x,y) small iff intersection SxÅ Sy large Sy Sx
Basic Framework (cont.) x y ED(x,y) small iff symmetric difference Sx Sy small Sy Sx • Need to estimate size of symmetric difference • Hamming distance computation of characteristic vectors • Use O(log(1/)) size sketches [KOR] Reduced Edit Distance to Hamming Distance
Encoding Scheme Gap: k vs. O((kn)2/3) B = n2/3/k1/3, W = n/B 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 x B windows of size W each. b1 b2 b3 y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 … ,(i, win(i)),… (2,1), (3,2), (1,1), Sx = { ,(bi, win(i)),… … (b1,1), (b2,1), (b3,2), Sy = {
Analysis 1 2 3 4 5 6 7 8 9 10 11 12 13 14 i x bj y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 • Case 1: ED(x,y) · k • If i is “unmarked”, it has a matching “companion” j • (i,win(i)) 2 Sxn Sy, only if: • either i is “marked” • or i is unmarked, but win(i) win(j) • At most kB marked substrings • At most k * n/W = kB companions with mismatched windows • Therefore, Ham(Sx,Sy) · 4kB
Analysis (cont.) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 2B+1 1 B+1 x b2 bB-1 y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 • Case 2: Ham(Sx,Sy) · 8kB • If i has a “companion” j and win(i) = win(j), can align i with j using at most W operations • Otherwise, substitute first character of i • At most 8kB substrings of x have no companion • Therefore, ED(x,y) · 8kB + W * n/B = O((kn)2/3)