330 likes | 353 Views
Learn about Locality Sensitive Hashing for identifying near-duplicate documents in large collections like web search engines. Explore methods like Shingling, Min-Hashing, and Hamming Distance for efficient document comparison.
E N D
Locality Sensitive Hashing Basics and applications
A well-known problem • Given a large collection of documents • Identifythe near-duplicate documents • Web search engines • Proliferation of near-duplicate documents • Legitimate – mirrors, local copies, updates, … • Malicious – spam, spider-traps, dynamic URLs, … • 30% of web-pages are near-duplicates [1997]
Natural Approaches • Fingerprinting: • only works for exact matches • Karp Rabin (rolling hash) – collision probability guarantees • MD5 – cryptographically-secure string hashes • Edit-distance • metric for approximate string-matching • expensive – even for one pair of documents • impossible – for billion web documents • Random Sampling • sample substrings (phrases, sentences, etc) • hope: similar documents similar samples • But – even samples of same document will differ
Basic Idea: Shingling[Broder 1997] • dissect document into q-grams(shingles) T = I leave and study in Pisa, …. Ifwe set q=3 the 3-grams are: <I leave and><leave and study><and study in><study in Pisa>… • represent documents by sets of hash[shingle] The problem reduces to set intersection among int
SA SB Basic Idea: Shingling[Broder 1997] Set intersection Jaccard similarity Doc A Doc B • Claim: A & B are near-duplicates if sim(SA,SB) is high
Sec. 19.6 Sketching of a document From each shingle-set we build a “sketch vector” (~200 size) Postulate: Documents that share ≥tcomponents of their sketch-vectors are claimed to be near duplicates
Sketching by Min-Hashing • Consider • SA, SB P = {0,…,p-1} • Pick a random permutation π of the whole set P (such as ax+b mod p) • Pick the minimal element of SA : = min{π(SA)} • Pick the minimal element of SB : b = min{π(SB)} • Lemma:
Strengthening it… • Similarity sketch sk(A) • d minimal elements under π(SA) • Or take d permutations and the min of each • Note: we can reduce the variance by using a larger d • Typically d is few hundred mins (~200)
Sec. 19.6 Document 1 Computing Sketch[i] for Doc1 264 Start with 64-bit f(shingles) Permute with pi Pick the min value 264 264 264
Sec. 19.6 Document 1 Test if Doc1.Sketch[i] = Doc2.Sketch[i] Document 2 264 264 264 264 264 264 A B 264 264 Are these equal? Claim: This happens with probability Size_of_intersection / Size_of_union Use 200 random permutations (minimum), and thus create one 200-dim vector per document and evaluate the fraction of shared components
It’s even more difficult… So wehavesqueezedfewKbs of data (web page) intofewhundredbytes Butyoustillneed abrute-force comparison (quadratic time) to compute all nearly-duplicate documents • Thisistoomuchevenif it isexecuted in RAM
Locality Sensitive Hashing The case of the Hamming distance How to compute fast the fraction of different compo. in d-dim vectors How to compute fast the hamming distance between d-dim vectors Fraction different components = HammingDist/d
A warm-up • Consider the case of binary (sketch) vectors, thus living in the hypercube {0,1}d • Hamming distance D(p,q)= # coord on which p and q differ • Define hash function h by choosing a set I of k random coordinates h(p) = p|I = projection of p on I Example: if p=01011 (d=5), a pick I={1,4} (with k=2), then h(p)=01 Note similarity with the Bloom Filter
1 2 …. d A key property • Pr[to pick an equal component]= (d - D(p,q))/d • We can vary the probability by changing k equal Pr k=1 Pr k=2 What about False Negatives ? distance distance
Reiterate • RepeatL times the k-projections hi • Declare a «match» ifatleastonehimatches Example:d=5, k=2, p = 01011 and q = 00101 • I1 = {2,4}, we have h1(p) = 11 and h1(q)=00 • I2 = {1,4}, we have h2(p) = 01 and h2(q)=00 • I3 = {1,5}, we have h3(p) = 01 and h3(q)=01 • We set g( ) = < h1( ), h2( ), h3( )> p and q match !!
Measuring the error prob. The g() consists of L independenthasheshi Pr[g(p) matches g(q)] =1 - Pr[hi(p) ≠ hi(q), i=1, …, L] Pr s (1/L)^(1/k)
Find groups of similar items • SOL 1: Bucketsprovide the candidate similaritems «Merge» similar sets ifthey share items p Points in a bucket are possibly similar objects h1(p) h2(p) hL(p) T1 T2 TL
Find groups of similar items • SOL 1: Bucketsprovide the candidate similaritems • SOL 2: • Sortitems by the hi(), and pickassimilar candidate the equalones • Repeat L times, for allhi() • «Merge» candidate sets ifthey share items. Checkcandidates !!! Whataboutclustering ?
LSH versus K-means • What about optimality ? K-means is locally optimal [recently, some researchers showed how to introduce some guarantee] • What about the Sim-cost ? K-means compares items in Q(d) time and space [notice that d may be bi/millions] • What about the cost per iteration and their number? Typically K-means requires few iterations, each costs K n time: I K n d • What about K ? In principle have to iterate K=1,…, n LSH needssort(n) time hence, on disk, fewpasses over the data and with guaranteederrorbounds
Also on-line query Given a query q, check the buckets of hj(q) for j=1,…, L q h1(q) h2(q) hL(q) T1 T2 TL
Locality Sensitive Hashingand its applications More problems, indeed
Another classic problem The problem: Given U users, the goal is to find groups of similar users (or, smilar to user Q) Features = Personal data, preferences, purchases, navigational behavior, followers/ing or +1,… • A feature is typically a numerical value: binary or real • Hammingdistance: #differentcomponents
More than Hamming distance • Example: q P* • q is the query point. • P* is its Nearest Neighbor
Approximation helps r q p*
A slightly different problem Approximate Nearest Neighbor • Given an error parameter >0 • For query q and nearest-neighbor p’, return p such that Justification • Mapping objects to metric space is heuristic anyway • Get tremendous performance improvement
A workable approach • Given an error parameter >0, distance threshold t>0 • (t,)-Approximate NN Query • If no point p with D(q,p)<t, return FAILURE • Else, return anyp’ with D(q,p’)< (1+)t • Application:Approximate Nearest Neighbor • Assume maximum distance is T • Run in parallel for • Time/space – O(log1+ T) overhead
Locality Sensitive Hashingand its applications The analysis
LSH Analysis • For a fixed threshold r, we distinguish between • Near D(p,q) < r • FarD(p,q) > (1+ε)r • A locality-sensitive hash h should guarantee • Near points are hashed together with Pr[h(a)=h(b)] ≥ P1 • Far points may be mapped together but Pr[h(a)=h(c)] ≤ P2 where, of course, wehavethatP1>P2 b a c
What about hamming distance? Family hi(.) = p|c1,…,ck,where the ci are chosen randomly • If D(a,b) ≤ r, then Pr[hi(a)= hi(b)] = (1 – D(a,b)/d)k ≥( 1 – r/d )k = ( p1 )k = P1 • If D(a,c) > (1+e)r, then Pr[hi(a)= hi(c)] = (1 – D(a,c)/d)k <( 1 – r(1+e)/d )k = ( p2 )k = P2 where, of course, wehavethatp1>p2 (as P1 > P2)
LSH Analysis The LSH-algorithm with the L mappings hi() correctly solves the (r,e)-NN problem on a point query q if the following hold: • The total number of points FAR from q and belonging to the bucket hi(q) is a constant. • If p* NEAR to q, then hi(p*)= hi(q) for some I (p* is in a visited bucket) Theorem.Take k=log_{1/p2} n and L=n(ln p1/ ln p2), then the two properties above hold with probability at least 0.298. Repeating the process Q(1/d) times, we ensure a probability of success of at least 1-d. • Space ≈ nL = n1+r, where r = (ln p1 / ln p2) < 1 • Query time ≈ L buckets accessed, they are nr
Proof p* is a point near to q: D(q,p*) < r FAR(q) = set of points ps.t. D(q,p) > (1+e) r BUCKETi(q) = set of points ps.t.hi(p)= hi(q) Let us define the following events: E1 =Num of far points in the visited buckets ≤ 3 L E2 = p* occurs in some visited bucket, i.e. js.t.hj(q) = hj(p*)
Bad collisions more than 3L Let p is a point in FAR(q): Pr[fixed j, hj(p) = hj(q)] < P2 = ( p2 )k Given that k = log1/p2 n Pr[fixed j, a far point p satisfies hj(p) = hj(q)] = 1/n By Markov inequality Pr(X > 3E[x]) ≤ 1/3, follows:
Good collision: p* occurs For any hj the probability that Pr[hj(p*) = hj(q)] ≥ P1 = ( p1 )k = ( p1 )^{ log1/p2 n } = Given that L=n(ln p1/ ln p2), this is actually 1/L. So we have that Pr[not E2] = Pr[not finding p* in q’s buckets] = = (1 - Pr[hj(p*) = hj(q)])L = (1 – 1/L)L ≤ 1/e Finally Pr[E1 and E2] ≥ 1 – Pr[not E1 OR not E2] ≥ 1 – (Pr[not E1] + Pr [not E2]) ≥ 1 - (1/3) - (1/e) = 0.298