1 / 32

Algorithms for Large Data Sets

This lecture discusses algorithms such as syntactic clustering, locality-sensitive hashing, and min-wise independent permutations for efficiently eliminating near-duplicates in large data sets.

Download Presentation

Algorithms for Large Data Sets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005 http://www.ee.technion.ac.il/courses/049011

  2. Sketching

  3. Outline • Syntactic clustering of the web • Locality sensitive hash functions • Resemblance and shingling • Min-wise independent permutations • The sketching model • Hamming distance • Edit distance

  4. Motivation: Near-Duplicate Elimination • Many web pages are duplicates or near-duplicates of other pages • Mirror sites • FAQs, manuals, legal documents • Different versions of the same document • Plagiarism • Duplicates are bad for search engines • Increase index size • Harm quality of search results • Question: How to efficiently process the repository of crawled pages and eliminate (near)-duplicates?

  5. Syntactic Clustering of the Web[Broder, Glassman, Manasse, Zweig 97] • U: space of all possible documents • S  U: collection of documents • sim: U × U  [0,1]: a similarity measure among documents • If p,q are very similar sim(p,q) is close to 1 • If p,q are very unsimilar, sim(p,q) is close to 0 • Usually: sim(p,q) = 1 – d(p,q), where d(p,q) is a normalized distance between p and q. • G: a graph on S: • p,q are connected by an edge iff sim(p,q)  t (t = threshold) • Goal: find the connected components of G

  6. Challenges • S is huge • Web has 10 billion pages • Documents are not compressed • Needs many disks to store S • Each sim computation is costly • Documents in S should be processed in a stream • Main memory is tine relative to |S| • Cannot afford more than O(|S|) time • How to create the graph G? • Naively, requires |S| passes and |S|2 similarity computations

  7. Sketching Schemes • T = a small set (|S| < |T| << |U|) • A sketching scheme for sim: • Compression function: a randomized mapping : U  T • Reconstruction function: : TT  [0,1] • For every pair p,q, with high probability ((p),(q))  sim(p,q)

  8. Syntactic Clustering by Sketching • P  empty table of size |S| • G  empty graph on |S| nodes • for i = 1,…,|S| • read document pi from the stream • P[i]  (pi) • for i = 1,…,|S| • for j = 1,…,|S| • if ((P[i],P[j])  t) • add edge (i,j) to G • output connected components of G

  9. Analysis • Can compute sketches in one pass • Table P can be stored in a single file on a single machine • Creating G requires |S|2 applications of  • Easier than full-fledged computations of sim • Quadratic time is still a problem • Connected components algorithm is heavy but feasible

  10. Locality Sensitive Hashing (LSH)[Indyk, Motwani, 98] • A special kind of sketching schemes • H = { h | h: U  T }: a family of hash functions • H is locality sensitive w.r.t. sim if for all p,q  U, Pr[h(p) = h(q)] = sim(p,q). • Probability is over random choice of h from H • Probability of collision = similarity between p and q

  11. Syntactic Clustering by LSH • P  empty table of size |S| • G  empty graph on |S| nodes • for i = 1,…,|S| • read document pi from the stream • P[i]  h(pi) • sort P and group by value • output groups

  12. Analysis • Can compute hash values in one pass • Table P can be stored in a single file on a single machine • Sorting and grouping takes O(|S| log |S|) simple comparisons • Each group A consists of pages whose hash value is the same • By LSH property, they are likely to be similar to each other

  13. Shingling and Resemblance[Broder et al 97] • tokens: words, numbers, HTML tags, etc. • tokenization(p): sequence of tokens produced from document p • w: a small integer • Sw(p) = w-shingling of p = set all distinct contiguous subsequences of tokenization(p) of length w. • Ex: p = “a rose is a rose is a rose”, w = 4 • Sw(p) = { (a rose is a), (rose is a rose), (is a rose is) } • resemblancew(p,q) =

  14. LSH for Resemblance • resemblancew(p,q) = •  = a random permutation on w •  induces a random order on all length w sequences of tokens •  also induces a random order on any subset X  W • For each such subset and for each x  X, Pr(min ((X)) = x) = 1/|X| • LSH for resemblance: h(p) = min((Sw(p))) Sw(p) Sw(q)

  15. LSH for Resemblance (cont.) • Lemma: • Proof:

  16. Min-Wise Independent Permutations[Broder, Charikar, Frieze, Mitzenmacher, 98] • Usual problem: Storing  takes too much space • O(||w log ||w) bits to represent  • Use small families of permutations • A family  = {  |  is a permutation on w } is min-wise independent, if • For all subsets X  w and for all x  X, Pr(min ((X)) = x) = 1/|X| • Explicit constructions of small families of “approximately” min-wise independent permutations [Indyk 98]

  17. The Sketching Model Shared Randomness Bob Alice k vs. r Gap Problem x y Promise: d(x,y) ≤ k or d(x,y) ≥ r (y) (x) Goal: Decide which of the two holds. d(x,y) ≥ r Approximation d(x,y) ≤ k Referee

  18. Applications Large data sets • Clustering • Nearest Neighbor schemes • Data streams Management of Files over the Network • Differential backup • Synchronization Theory • Low distortion embeddings • Simultaneous messages communication complexity

  19. Known Sketching Schemes • Resemblance [Broder, Glassman, Manasse, Zweig 97], [Broder, Charikar, Frieze, Mitzenmacher 98] • Hamming distance [Kushilevitz, Ostrovsky, Rabani 98], [Indyk, Motwani 98] [Feigenbaum,Ishai,Malkin,Nissim,Strauss,Wright 01] • Cosine similarity [Charikar 02] • Earth mover distance [Charikar 02] • Edit distance [Bar-Yossef, Jayram, Krauthgamer, Kumar 04]

  20. Sketching Algorithm for Hamming Distance [Kushilevitz, Ostrovsky, Rabani 98] • x,y: binary strings of length n • HD(x,y) = # of positions in which x,y differ • HD(x,y) = | { i | xi yi } | • Ex: x = 10101, y = 01010, HD(x,y) = 5 • Goal: • If HD(x,y) ≤ k, output “accept” w.p.  1 -  • If HD(x,y) ≥ 2k, output “reject” w.p.  1 -  • KOR algorithm: O(log(1/)) size sketch.

  21. The KOR Algorithm • Shared randomness: n i.i.d. random bits r1,…,rn, where • Basic sketch: h(x) = (i xi ri ) mod 2 • Full sketch: (x) = (h1(x),…,ht(x)) • t = O(log(1/)) • h1,…,ht are generated independently like h • Reconstruction: • for j = 1,…,t do • if (hj(x) = hj(y)) then • zj 1 • else • zj 0 • if avg(z1,…,zt) > 11/18 output “accept” and else output “reject”

  22. KOR: Analysis • dd • Note: # of terms in the sum = HD(x,y) • Given HD(x,y) independent random bits, each with probability 1/2k to be 1, what is the probability that their parity is 0?

  23. KOR: Analysis (cont.) • r1,…,rm: m independent random bits • For each j, Pr(rj = 1) =  • What is Pr[j rj = 0)? • Can view distribution of each bit as a mixture of two distributions: • Dist A (with probability 1 - 2): the bit 0 w.p. 1 • Dist B (with probability 2): a uniformly chosen bit • Note: • If all bits “choose” Dist A, then the parity is 0 w.p. 1 • If one of the m bits “chooses” Dist B, then the parity is 0 w.p. ½ • Hence,

  24. KOR Analysis (cont.) • ff • Therefore, • If HD(x,y) ≤ k, then Pr[h(x) = h(y)] ≥ 1/2 + 1/2e  4/6 = 12/18 • If HD(x,y) ≥ 2k, then Pr[h(x) = h(y)] ≤ 1/2 + 1/2e2 10/18 • Define: • If HD(x,y) ≤ k, then E[Z] ≥ 12/18 • If HD(x,y) ≥ 2k, then E[Z] ≤ 10/18 • By Chernoff, t = O(log(1/)) enough to guarantee: • If HD(x,y) ≤ k, then Z > 11/18 w.p. 1 -  • If HD(x,y) ≥ 2k, then Z ≤ 11/18 w.h.p 1 - 

  25. Edit Distance x 2n, y 2m ED(x,y): Minimum number of character insertions, deletions and substitutions that transform x to y. Examples: ED(00000, 1111) = 5 ED(01010, 10101) = 2 Applications • Genomics • Text processing • Web search For simplicity: m = n,  = {0,1}.

  26. Sketching Algorithm for Edit Distance [Bar-Yossef,Jayram,Krauthgamer,Kumar 04] • x,y: binary strings of length n • Goal: • If ED(x,y) ≤ k, output “accept” w.p.  1 -  • If ED(x,y) ≥  ((kn)2/3), output “reject” w.p. ≥ 1 -  • BJKK algorithm: O(log(1/)) size sketch.

  27. Basic Framework Underlying Principle ED(x,y) is small iff x and y share many common substrings at nearby positions. Sx= set of pairs of the form (,h(i)) : a substring of x h(i): a “locality sensitive” encoding of the substring’s position x y common substrings at nearby positions ED(x,y) small iff intersection SxÅ Sy large Sy Sx

  28. Basic Framework (cont.) x y ED(x,y) small iff symmetric difference Sx Sy small Sy Sx • Need to estimate size of symmetric difference • Hamming distance computation of characteristic vectors • Use O(log(1/)) size sketches [KOR] Reduced Edit Distance to Hamming Distance

  29. Encoding Scheme Gap: k vs. O((kn)2/3) B = n2/3/k1/3, W = n/B 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 x B windows of size W each. b1 b2 b3 y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 … ,(i, win(i)),… (2,1), (3,2), (1,1), Sx = { ,(bi, win(i)),… … (b1,1), (b2,1), (b3,2), Sy = {

  30. Analysis 1 2 3 4 5 6 7 8 9 10 11 12 13 14 i x bj y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 • Case 1: ED(x,y) · k • If i is “unmarked”, it has a matching “companion” j • (i,win(i)) 2 Sxn Sy, only if: • either i is “marked” • or i is unmarked, but win(i)  win(j) • At most kB marked substrings • At most k * n/W = kB companions with mismatched windows • Therefore, Ham(Sx,Sy) · 4kB

  31. Analysis (cont.) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 2B+1 1 B+1 x b2 bB-1 y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 • Case 2: Ham(Sx,Sy) · 8kB • If i has a “companion” j and win(i) = win(j), can align i with j using at most W operations • Otherwise, substitute first character of i • At most 8kB substrings of x have no companion • Therefore, ED(x,y) · 8kB + W * n/B = O((kn)2/3)

  32. End of Lecture 11

More Related