CS 361A (Advanced Data Structures and Algorithms)

CS 361A (Advanced Data Structures and Algorithms) Lecture 18 (Nov 30, 2005) Fingerprints, Min-Hashing, and Document Similarity Rajeev Motwani

Game Plan for Week • Fingerprints • Document Similarity • Shingling • Min-Hashing • Min-Wise Independent Permutations

Fingerprints • W– set of large objects (e.g., URLs) • Goal • avoid storing large objects explicitly • quick-and-dirty equality-testing • Fingerprints? • Short tags for objects • Distinct fingerprints  distinct objects • Distinct objects  probably distinct fingerprints

Formalization • Fingerprint lengthk fingerprint space size N=2k • Fingerprint function family F = { f : W®{0,1}k } • Random f eR F  • f(A) ¹ f(B)  A ¹ B • Collisions: P[ f(A) = f(B) | A ¹ B ]  0 (ideally 2O(-k)) • Typical Application • Adversarialobject-set S with |S| = n << 2k • Goal – |f(S)| = |S| with high probability • n2 pair-wise collisions possible  need 2k > n2 (to avoid Birthday Paradox)

Example – URL Fingerprints • Search Engines • Manage large numbers of URL strings • Long, variable strings (embedded objects/database-queries) • Desiderata • small/fixed-length encodings – hopefully, unique • Some scenarios • Exact string irrelevant • Only need ability to distinguish distinct URLs • Even otherwise, unique IDs useful for indexing • Numbers? • 4 billion webpages  n=232 • N » n2  k=64 • Fingerprints 8-byte representation

Fingerprinting vs Hashing • Hashingh: W ® {0,1}k • Set Membership testing for set S of size n • Desire uniform distribution over bin address {0,1}k • Minimize collisions per bin – reduce lookup time • Minimize hash table size n » N=2k • Fingerprintingf : W ®{0,1}k • Object Equality testing over set S of size n • Distribution over {0,1}kis irrelevant • Avoid collisionsaltogether • Tolerate larger k – typically N > n2

Fingerprinting Strings • Typical Application – but techniques extend to combinatorial objects (database tuples, trees/graphs) • Obvious techniques • Checksum – no worst-case collision probability guarantees • MD5 – cryptographically-secure string hashes • relatively slow • avoids leaking information about original string • Rabin’s Scheme • Algebraic technique – polynomial arithmetic • Efficient – need (1 table lookup + 1 xor + 1 shift) per byte • other nice properties…

Rabin Fingerprints • Consider – m-bit string A=a1 a2 … am • Assume – a1=1 and fixed-length strings (wlog) • Encoding Strings • Degree-m polynomials over Z2 • A(x) = a1 xm-1 +a2 xm-2 + … + am-1 x1 + am • Fingerprints • P(x):random, irreducible deg-k polynomial overZ2 (easy to sample such polynomials) • irreducible unlike x2+x+1, can factor x2+1=(x+1)2 • f(A) = A(x) mod P(x)

Analysis • FixS – n strings of length m • Consider • Collisionf(A)=f(B) A(x)=B(x) mod P(x)  QS=0 mod P(x) • Therefore – P(x) is factor of QS(x) • Collision Probability? • degree(QS) = n2m • number of irreducible degree-k factors of QS(x) is< n2m/k • Fact: Number of irreducible degree-k polynomials > (2k-2k/2)/k • Prob[random P(x) divides QS(x)] < n2m/2k • Prob [fingerprints not distinct] <

Beneficial Properties • Hardware-level implementation • Z2-polynomials same as strings • simple shift-register operations • Distributivity – f(A+B) = f(A) + f(B) over Z2 • Let ¨ = concatenation • f(A ¨ B) = f(f(A) ¨ B) • f(A ¨ B) = A(x)*tm + B(x) mod P(x) • Fingerprint sliding windows over strings – low incremental cost

Duplicate Document Detection • Problem • Given – large collection of arbitrary documents • Identify – near-duplicate documents • Web search engines • Proliferation of near-duplicate documents • Legitimate – mirrors, local copies, updates, … • Malicious – spam, spider-traps, dynamic URLs, … • Mistaken – spider errors • 30% of web-pages are near-duplicates [Broder et al 1997] • Cost – RAM/disk, search quality, unhappy users • Enterprise search – even larger amount of duplication • SCAM – plagiarism detection [Shivakumar et al 1998]

Natural Approaches • Fingerprinting? • only works for exact matches • here – must identify even near-duplicates • Random Sampling? • sample substrings (phrases, sentences, etc) • hope: similar documents  similar samples • No – even samples of same document will differ • Edit-distance? • metric for approximate string-matching • expensive – even for one pair of strings • impossible – for 1032 web documents

Desiderata • Storage • only small sketchesof each document. • Computation • O(n log n)time on ndocuments • Stream Processing • once sketch computed, source is unavailable • Error Guarantees • problem scale  small biases have large impact • need formal guarantees – heuristics will not do

Basic Idea [Broder 1997] • Shingling • dissect document into q-grams (shingles) • represent documents by shingle-sets • near-duplicatesshingle-sets intersection is large • reduce problem to set intersection • Set Intersection • fingerprints of shingles • min-hash to estimate intersections sizes

Shingling • Shingle – q contiguous tokens/words (q-gram) • Consider following “document” a rose is a rose is a rose • Choose q=4 get multi-set of shingles a rose is a rose is a rose is a rose is a rose is a rose is a rose

Multiset of Shingles Multiset of Fingerprints Doc shingling fingerprint Documents  Sets of 64-bit fingerprints • Fingerprints? • Use Rabin fingerprints • Fingerprint space U = [0, …, N-1] • In practice, use 64-bit fingerprints, i.e., N=264 • Result – uniformity in length of strings

SA SB Similarity of Documents Doc A Doc B • Jaccard measure – similarity of SA, SB  U = [0 … N-1] • Claim: A & B are near-duplicates if sim(SA,SB) is high • Claim: A is contained in B if con(SA,SB) is high

Remarks • Multiplicities of q-grams – could retain or ignore • trade-off efficiency with precision • Shingle Size q ε [3 … 10] • Short shingles increase similarity of unrelated documents • With q=1, sim(SA,SB) =1 A is permutation of B • Need larger q to sensitize to permutation changes • Long shingles small random changes have larger impact • Similarity Measure • Similarity is non-transitive, non-metric • But– dissimilarity 1-sim(SA,SB) is a metric [Charikar 02] • [Ukkonen 92] – relate q-gram & edit-distance

Example • A = “a rose is a rose is a rose” • B = “a rose is a flower which is a rose” • Preserving multiplicity • q=1 sim(SA,SB) = 0.7 • SA = {a, a, a, is, is, rose, rose, rose} • SB = {a, a, a, is, is, rose, rose, flower, which} • q=2 sim(SA,SB) = 0.5 • q=3 sim(SA,SB) = 0.3 • Disregarding multiplicity • q=1 sim(SA,SB) = 0.6 • q=2 sim(SA,SB) = 0.5 • q=3 sim(SA,SB) = 0.4285

Min-Hashing • Consider • SA, SB U • Pick – random permutation π of U • Define  = π -1( min{π(SA)} ) andb = π -1( min{π(SB)} ) • Meaning? – minimal element under permutation π • Lemma: • Let δ = min{ π(SASB) } • Claim: = b  π -1(δ)  SASB • Clearly

Min-Hashing • Similarity Sketches • Succinct representation of fingerprint sets SA • Allows efficient estimation of sim(SA,SB) • Basic idea – use min-hash of fingerprints • sk(A) = k minimal elements under π(SA) • Claim: E[ sim(sk(A), sk(B)) ] = sim(SA,SB) • For each sk(A)  sk(B) • Observe • sketch-similarity is unbiased estimator of similarity • reducing variance – use larger k

Remarks • Implementation • shingle/fingerprint/sketch document in streams • Issue– cost of pairwise comparison of sketches? • cluster sketch-streams [Broder et al, Guha et al] • Open? – hashing sketches to identify similarity • [Broder-Mitzenmacher 99] – Min-Hash is only unbiased estimator • [Indyk-Motwani 99] – Locality-Sensitive Hash • collisions more likely for similar items • Min-Hash is special case

Multiple Permutations • Better Variance Reduction • Instead of larger k, stick with k=1 • Multiple, independent permutations • Sketch Construction • Pick p random permutations of U – π1,π2, …,πp • sk(A) = minimal elements under π1(SA), …, πp(SA) • Claim: E[ sim(sk(A),sk(B)) ] = sim(SA,SB) • Earlier lemma true for p=1 • Linearity of expectations • Variance reduction – independence ofπ1, …,πp

Min-Wise Indep Permutations • Problem • Truly-random π over U = [0 … N-1] is infeasible • But – do we really need true randomness? • Solution • Poly-size family of permutations FSN over U • Choosing/representingrandom πF is easy • Min-Wise Independence (MWI) Property: For all sets XU, for all xF,

Minimum-Size MWI Families • [Broder et al 98] • Upper/lower bounds of lcm(1,2,…,n) • Problem – exponential in N • Approximate MWI Families • Relax to • Non-constructive – polynomial-size • Constructive – size NO(log 1/) [Indyk 99] • In practice – 2-universal hashes work well!

References I • Fingerprinting by random polynomials. M. Rabin. Technical Report TR-15-81, Harvard University (1981). • Some applications of Rabin's fingerprinting method. A. Broder. Sequence II (1993). • On the Resemblance and Containment of Documents, A. Broder. SEQUENCES 1997. • Syntactic Clustering of the Web, A. Broder, S. Glassman, M. Manasse, and G. Zweig, WWW 1997. • Finding near-replicas of documents on the web.N. Shivakumar and H. Garcia-Molina.WebDB 1998. • Identifying and Filtering Near-Duplicate Documents,Andrei Broder. CPM 2000.

References II • Approximate String Matching with q-grams and Maximal Matches. E. Ukkonen. Theoretical Computer Science (1992). • Completeness and Robustness Properties of Min-Wise Independent Permutations. A. Broder and M. Mitzenmacher. • Min-Wise Independent Permutations, A. Broder, M. Charikar, A. Frieze and M. Mitzenmacher, JCSS (2000). • A Small Approximately min-wise Independent Family of Hash Functions. P. Indyk. SODA 1999. • Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality, P. Indyk and R. Motwani. STOC 1998. • Similarity Search in High Dimensions via Hashing, A. Gionis, P. Indyk, and R. Motwani. VLDB 1999. • Similarity Estimation Techniques from Rounding Algorithms, M. Charikar, STOC 2002.

CS 361A (Advanced Data Structures and Algorithms)