1 / 27

CS 361A (Advanced Data Structures and Algorithms)

CS 361A (Advanced Data Structures and Algorithms). Lecture 18 (Nov 30, 2005) Fingerprints, Min-Hashing, and Document Similarity Rajeev Motwani. Game Plan for Week. Fingerprints Document Similarity Shingling Min-Hashing

guri
Download Presentation

CS 361A (Advanced Data Structures and Algorithms)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 361A (Advanced Data Structures and Algorithms) Lecture 18 (Nov 30, 2005) Fingerprints, Min-Hashing, and Document Similarity Rajeev Motwani

  2. Game Plan for Week • Fingerprints • Document Similarity • Shingling • Min-Hashing • Min-Wise Independent Permutations

  3. Fingerprints • W– set of large objects (e.g., URLs) • Goal • avoid storing large objects explicitly • quick-and-dirty equality-testing • Fingerprints? • Short tags for objects • Distinct fingerprints  distinct objects • Distinct objects  probably distinct fingerprints

  4. Formalization • Fingerprint lengthk fingerprint space size N=2k • Fingerprint function family F = { f : W®{0,1}k } • Random f eR F  • f(A) ¹ f(B)  A ¹ B • Collisions: P[ f(A) = f(B) | A ¹ B ]  0 (ideally 2O(-k)) • Typical Application • Adversarialobject-set S with |S| = n << 2k • Goal – |f(S)| = |S| with high probability • n2 pair-wise collisions possible  need 2k > n2 (to avoid Birthday Paradox)

  5. Example – URL Fingerprints • Search Engines • Manage large numbers of URL strings • Long, variable strings (embedded objects/database-queries) • Desiderata • small/fixed-length encodings – hopefully, unique • Some scenarios • Exact string irrelevant • Only need ability to distinguish distinct URLs • Even otherwise, unique IDs useful for indexing • Numbers? • 4 billion webpages  n=232 • N » n2  k=64 • Fingerprints 8-byte representation

  6. Fingerprinting vs Hashing • Hashingh: W ® {0,1}k • Set Membership testing for set S of size n • Desire uniform distribution over bin address {0,1}k • Minimize collisions per bin – reduce lookup time • Minimize hash table size n » N=2k • Fingerprintingf : W ®{0,1}k • Object Equality testing over set S of size n • Distribution over {0,1}kis irrelevant • Avoid collisionsaltogether • Tolerate larger k – typically N > n2

  7. Fingerprinting Strings • Typical Application – but techniques extend to combinatorial objects (database tuples, trees/graphs) • Obvious techniques • Checksum – no worst-case collision probability guarantees • MD5 – cryptographically-secure string hashes • relatively slow • avoids leaking information about original string • Rabin’s Scheme • Algebraic technique – polynomial arithmetic • Efficient – need (1 table lookup + 1 xor + 1 shift) per byte • other nice properties…

  8. Rabin Fingerprints • Consider – m-bit string A=a1 a2 … am • Assume – a1=1 and fixed-length strings (wlog) • Encoding Strings • Degree-m polynomials over Z2 • A(x) = a1 xm-1 +a2 xm-2 + … + am-1 x1 + am • Fingerprints • P(x):random, irreducible deg-k polynomial overZ2 (easy to sample such polynomials) • irreducible unlike x2+x+1, can factor x2+1=(x+1)2 • f(A) = A(x) mod P(x)

  9. Analysis • FixS – n strings of length m • Consider • Collisionf(A)=f(B) A(x)=B(x) mod P(x)  QS=0 mod P(x) • Therefore – P(x) is factor of QS(x) • Collision Probability? • degree(QS) = n2m • number of irreducible degree-k factors of QS(x) is< n2m/k • Fact: Number of irreducible degree-k polynomials > (2k-2k/2)/k • Prob[random P(x) divides QS(x)] < n2m/2k • Prob [fingerprints not distinct] <

  10. Beneficial Properties • Hardware-level implementation • Z2-polynomials same as strings • simple shift-register operations • Distributivity – f(A+B) = f(A) + f(B) over Z2 • Let ¨ = concatenation • f(A ¨ B) = f(f(A) ¨ B) • f(A ¨ B) = A(x)*tm + B(x) mod P(x) • Fingerprint sliding windows over strings – low incremental cost

  11. Duplicate Document Detection • Problem • Given – large collection of arbitrary documents • Identify – near-duplicate documents • Web search engines • Proliferation of near-duplicate documents • Legitimate – mirrors, local copies, updates, … • Malicious – spam, spider-traps, dynamic URLs, … • Mistaken – spider errors • 30% of web-pages are near-duplicates [Broder et al 1997] • Cost – RAM/disk, search quality, unhappy users • Enterprise search – even larger amount of duplication • SCAM – plagiarism detection [Shivakumar et al 1998]

  12. Natural Approaches • Fingerprinting? • only works for exact matches • here – must identify even near-duplicates • Random Sampling? • sample substrings (phrases, sentences, etc) • hope: similar documents  similar samples • No – even samples of same document will differ • Edit-distance? • metric for approximate string-matching • expensive – even for one pair of strings • impossible – for 1032 web documents

  13. Desiderata • Storage • only small sketchesof each document. • Computation • O(n log n)time on ndocuments • Stream Processing • once sketch computed, source is unavailable • Error Guarantees • problem scale  small biases have large impact • need formal guarantees – heuristics will not do

  14. Basic Idea [Broder 1997] • Shingling • dissect document into q-grams (shingles) • represent documents by shingle-sets • near-duplicatesshingle-sets intersection is large • reduce problem to set intersection • Set Intersection • fingerprints of shingles • min-hash to estimate intersections sizes

  15. Shingling • Shingle – q contiguous tokens/words (q-gram) • Consider following “document” a rose is a rose is a rose • Choose q=4 get multi-set of shingles a rose is a rose is a rose is a rose is a rose is a rose is a rose

  16. Multiset of Shingles Multiset of Fingerprints Doc shingling fingerprint Documents  Sets of 64-bit fingerprints • Fingerprints? • Use Rabin fingerprints • Fingerprint space U = [0, …, N-1] • In practice, use 64-bit fingerprints, i.e., N=264 • Result – uniformity in length of strings

  17. SA SB Similarity of Documents Doc A Doc B • Jaccard measure – similarity of SA, SB  U = [0 … N-1] • Claim: A & B are near-duplicates if sim(SA,SB) is high • Claim: A is contained in B if con(SA,SB) is high

  18. Remarks • Multiplicities of q-grams – could retain or ignore • trade-off efficiency with precision • Shingle Size q ε [3 … 10] • Short shingles increase similarity of unrelated documents • With q=1, sim(SA,SB) =1 A is permutation of B • Need larger q to sensitize to permutation changes • Long shingles small random changes have larger impact • Similarity Measure • Similarity is non-transitive, non-metric • But– dissimilarity 1-sim(SA,SB) is a metric [Charikar 02] • [Ukkonen 92] – relate q-gram & edit-distance

  19. Example • A = “a rose is a rose is a rose” • B = “a rose is a flower which is a rose” • Preserving multiplicity • q=1 sim(SA,SB) = 0.7 • SA = {a, a, a, is, is, rose, rose, rose} • SB = {a, a, a, is, is, rose, rose, flower, which} • q=2 sim(SA,SB) = 0.5 • q=3 sim(SA,SB) = 0.3 • Disregarding multiplicity • q=1 sim(SA,SB) = 0.6 • q=2 sim(SA,SB) = 0.5 • q=3 sim(SA,SB) = 0.4285

  20. Min-Hashing • Consider • SA, SB U • Pick – random permutation π of U • Define  = π -1( min{π(SA)} ) andb = π -1( min{π(SB)} ) • Meaning? – minimal element under permutation π • Lemma: • Let δ = min{ π(SASB) } • Claim: = b  π -1(δ)  SASB • Clearly

  21. Min-Hashing • Similarity Sketches • Succinct representation of fingerprint sets SA • Allows efficient estimation of sim(SA,SB) • Basic idea – use min-hash of fingerprints • sk(A) = k minimal elements under π(SA) • Claim: E[ sim(sk(A), sk(B)) ] = sim(SA,SB) • For each sk(A)  sk(B) • Observe • sketch-similarity is unbiased estimator of similarity • reducing variance – use larger k

  22. Remarks • Implementation • shingle/fingerprint/sketch document in streams • Issue– cost of pairwise comparison of sketches? • cluster sketch-streams [Broder et al, Guha et al] • Open? – hashing sketches to identify similarity • [Broder-Mitzenmacher 99] – Min-Hash is only unbiased estimator • [Indyk-Motwani 99] – Locality-Sensitive Hash • collisions more likely for similar items • Min-Hash is special case

  23. Multiple Permutations • Better Variance Reduction • Instead of larger k, stick with k=1 • Multiple, independent permutations • Sketch Construction • Pick p random permutations of U – π1,π2, …,πp • sk(A) = minimal elements under π1(SA), …, πp(SA) • Claim: E[ sim(sk(A),sk(B)) ] = sim(SA,SB) • Earlier lemma true for p=1 • Linearity of expectations • Variance reduction – independence ofπ1, …,πp

  24. Min-Wise Indep Permutations • Problem • Truly-random π over U = [0 … N-1] is infeasible • But – do we really need true randomness? • Solution • Poly-size family of permutations FSN over U • Choosing/representingrandom πF is easy • Min-Wise Independence (MWI) Property: For all sets XU, for all xF,

  25. Minimum-Size MWI Families • [Broder et al 98] • Upper/lower bounds of lcm(1,2,…,n) • Problem – exponential in N • Approximate MWI Families • Relax to • Non-constructive – polynomial-size • Constructive – size NO(log 1/) [Indyk 99] • In practice – 2-universal hashes work well!

  26. References I • Fingerprinting by random polynomials. M. Rabin. Technical Report TR-15-81, Harvard University (1981). • Some applications of Rabin's fingerprinting method. A. Broder. Sequence II (1993). • On the Resemblance and Containment of Documents, A. Broder. SEQUENCES 1997. • Syntactic Clustering of the Web, A. Broder, S. Glassman, M. Manasse, and G. Zweig, WWW 1997. • Finding near-replicas of documents on the web.N. Shivakumar and H. Garcia-Molina.WebDB 1998. • Identifying and Filtering Near-Duplicate Documents,Andrei Broder. CPM 2000.

  27. References II • Approximate String Matching with q-grams and Maximal Matches. E. Ukkonen. Theoretical Computer Science (1992). • Completeness and Robustness Properties of Min-Wise Independent Permutations. A. Broder and M. Mitzenmacher. • Min-Wise Independent Permutations, A. Broder, M. Charikar, A. Frieze and M. Mitzenmacher, JCSS (2000). • A Small Approximately min-wise Independent Family of Hash Functions. P. Indyk. SODA 1999. • Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality, P. Indyk and R. Motwani. STOC 1998. • Similarity Search in High Dimensions via Hashing, A. Gionis, P. Indyk, and R. Motwani. VLDB 1999. • Similarity Estimation Techniques from Rounding Algorithms, M. Charikar, STOC 2002.

More Related