170 likes | 187 Views
Detect near-duplicate documents on the web using advanced techniques like shingling, min-hash, and locality-sensitive hashing. Efficiently identify and compare documents that share similar content, even with slight variations.
E N D
Document duplication(exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
Sec. 19.6 Duplicate documents • The web is full of duplicated content • Few exact duplicate detection • Many cases of nearduplicates • E.g., Last modified date the only difference between two copies of a page
Near-Duplicate Detection • Problem • Given a large collection of documents • Identifythe near-duplicate documents • Web search engines • Proliferation of near-duplicate documents • Legitimate – mirrors, local copies, updates, … • Malicious – spam, spider-traps, dynamic URLs, … • Mistaken – spider errors • 30% of web-pages are near-duplicates [1997]
Desiderata • Storage: only small sketchesof each document. • Computation:the fastest possible • Stream Processing: • once sketch computed, source is unavailable • Error Guarantees • problem scale small biases have large impact • need formal guarantees – heuristics will not do
Natural Approaches • Fingerprinting: • only works for exact matches • Karp Rabin (rolling hash) – collision probability guarantees • MD5 – cryptographically-secure string hashes • Edit-distance • metric for approximate string-matching • expensive – even for one pair of documents • impossible – for billion web documents • Random Sampling • sample substrings (phrases, sentences, etc) • hope: similar documents similar samples • But – even samples of same document will differ
Karp-Rabin Fingerprints • Consider – m-bit string A = 1 a1 a2 … am • Basic values: • Choose a prime p in the universe U ≈ 264 • Fingerprint: f(A) = A mod p • Rolling hash given B = a2 … am am+1 • f(B) = [2m-1 (A – 2m- a1 2m-1) + 2m + am+1 ] mod p • Prob[false hit]= Prob p divides (A-B) = #div(A-B)/ #prime(U) < (log (A+B)) / #prime(U) ) ≈ (m log U)/U
Basic Idea[Broder 1997] • Shingling • dissect document into q-grams(shingles) • represent documents by shingle-sets • reduce problem to set intersection [ Jaccard ] • They are near-duplicates if large shingle-sets intersect enough
SA SB #1. Doc Similarity Set Intersection Doc A Doc B • Jaccard measure – similarity of SA, SB • Claim: A & B are near-duplicates if sim(SA,SB) is high We need to cope with“Set Intersection” fingerprints of shingles (for space/time efficiency) min-hash to estimate intersections sizes (further efficiency)
Multiset of Shingles Multiset of Fingerprints Doc shingling fingerprint #2. Sets of 64-bit fingerprints Fingerprints: • Use Karp-Rabin fingerprints over q-gram shingles (of 8q bits) • In practice, use 64-bit fingerprints, i.e., U=264 • Prob[collision] ≈ (8q * 64)/264 << 1 This reduces space for storing the multi-sets and the time to intersect them, but...
Sec. 19.6 #3. Sketch of a document • Sets are large, so their intersection is still too costly • Create a “sketch vector” (of size ~200) for each shingle-set • Documents that share ≥t(say 80%) of the sketch-elements are claimed to be near duplicates
Sketching by Min-Hashing • Consider • SA, SB {0,…,p-1} • Pick a random permutation π of the whole set P (such as ax+b mod p) • Define = min{π(SA)} , b = min{π(SB)} • minimal element under permutation π • Lemma:
Strengthening it… • Similarity sketch sk(A) = k minimal elements under π(SA) • We might also take K permutations and the min of each • Note: we can reduce the variance by using a larger k
Sec. 19.6 Document 1 Computing Sketch[i] for Doc1 264 Start with 64-bit f(shingles) Permute with pi Pick the min value 264 264 264
Sec. 19.6 Document 1 Test if Doc1.Sketch[i] = Doc2.Sketch[i] Document 2 264 264 264 264 264 264 A B 264 264 Are these equal? Test for 200 random permutations:p1, p2,… p200
Sec. 19.6 Document 2 Document 1 264 264 264 264 264 264 264 264 However… A B A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (i.e., lies in the intersection) Claim: This happens with probability Size_of_intersection / Size_of_union
#4. Detecting all duplicates • Brute-force (quadratic time): • compare sk(A) vs. sk(B) for all the pairs of docs A and B. • Still (numdocs)^2 istoomuchcomputingevenif it isexecuted in internalmemory • Locality sensitive hashing (LSH) for sk(A) sk(B) • Sample h elements of sk(A) as ID (may induce false positives) • Create t IDs (to reduce the false negatives) • If at least one ID matches with another one (wrt same h-selection), then A and B are probably near-duplicates (hence compare).
#4. do you implement this? GOAL:If at least one ID matches with another one (wrt same h-selection), then A and B are probably near-duplicates (hence compare). SOL 1: • Create t hashtables (with chaining), one per ID [recallthatthisis an h-sample of sk()]. • Insert the docID in the slots of each ID (using some hash) • Scaneverybucket and checkwhichdocID share >=1 ID. SOL 2: • Sortby each ID, and thencheck the consecutive equalones.