Detecting Near-Duplicates for Web Crawling

Detecting Near-Duplicates for Web Crawling Gurmeet Singh Manku (Google) Arvind Jain (Google) Anish Das Sarma (Stanford University)

What are Near-Duplicates? • Identical content, but differ in small portion of document • Advertisements • Counters • Timestamps

Near-Duplicates: Why and How? • Why do we want to detect near-duplicates? • Save storage • Search quality • How to determine if a pair of documents are near-duplicates? • Lots of past work (survey in the paper) • Our work: detect near-duplicate webpages during crawl

Simplified Crawl Architecture Web Web Index one document HTML Document traverse links Near-duplicate? entire index newly-crawled document(s) Yes No trash insert

Near-Duplicate Detection • Why is it hard in a crawl setting? • Scale • Tens of billions of documents indexed • Millions of pages crawled every day • Need to decide quickly!

Single and Batch Modes Web Web Index one document HTML Document traverse links Single document Near-duplicate? entire index OR Batch of documents

Rest of Talk • Simhash overview • Formal definition of the problem • Single and Batch algorithms • Experiments • Conclusions

Simhash [Charikar 02] • Dimensionality-reduction technique • used for near-duplicate detection • Obtain f-bit fingerprint for each document • A pair of documents are near duplicate if and only if fingerprints at most k-bits apart • We experimentally show f=64, k=3 good.

Simhash feature, weight hash, weight w1 w1 w1 -w1 -w1 w1w1-w1 100110 w2 w2 w2 w2-w2 -w2 -w2 -w2 110000 Doc. wn wn -wn-wnwn-wn-wnwn 001001 add sign 13,108,-22,-5,-32,55 110001 fingerprint

Problem Definition • Input: • Set S of f-bit fingerprints (document index) • Query fingerprint Q (new document) • Output: • Exists near-duplicate, or not • Batch Mode Input: Set of query fingerprints • Running Example:f=64, k=3 (Q1,Q2) near-duplicate hamming-distance(simhash(Q1),simhash(Q2)) ≤ k

Attempt One Pre-sorted fingerprints in S Exact Probes 64-bit Q All Q’: hd(Q,Q’)≤k=3 ( ) probes! 64 3

Attempt Two Fingerprints in S S’: All fingerprints at most k-bits away from S Exact Probes 64-bit Q (Sort) |S’| ≈ |S|  ( ) 64 3

Intuition for Our Approach • Observation 1: Consider 2df-bit fingerprints in sorted order • Most 2d combinations in d most significant bits exist • Can quickly do exact probe on first d’ (≤d) bits • Observation 2: Q’ hd(Q,Q’) = 3 Q exact match!

Example exact search on 16 bits 16-bit Q3 Q2 Q1 Q4 64-bit Q A B C D D B C A Q1 Q4 Q2 Q3 64-bit Fingerprints in S

Example: Analysis • 64-bits split into 4 pieces • 4 tables with permuted fingerprints • Exact search on 16 bits • If 234 (≈10 billion) fingerprints • Each probe gives 234-16 fingerprints

Analysis (contd.) • f-bits split into r pieces • tables with permuted fingerprints • Exact search on f(1-k/r) bits • With 2d existing fingerprints, each probe yields 2d- f(1-k/r)fingerprints f=64,k=3,d=34 ( ) r k

Same Idea Recursively 12-bit 12-bit 12-bit 12-bit 16-bit 36-bit 36-bit 36-bit 36-bit 16-bit 16-bit 16-bit 16-bit 48-bit 3 4 2 1 A B C D B D C C C C C A C 64-bit 16 tables 234-28 matches/probe Fingerprints in S

General Solution • Space (#tables) / Time (#matches) tradeoff • Minimum number of tables, with at most 2X matches per probe? • General solution: if d<X 1, X(f,k,d) = otherwise minr>k ∙ X (fk/r, k, d-(r-k)/r), ( ) r k

Compression of Tables • We can efficiently compress tables • In expectation, first d bits are common in successive fingerprints • Exploit this to compress each of the tables • Details in the paper • Brings down space requirements by nearly 50%

Rest of Talk • Simhash overview • Formal definition of the problem • Single algorithm • Batch algorithm • Experiments • Conclusions

Reminder: Batch Problem • Tens of billions of pages indexed • Crawl millions of pages each day • Quickly find all new pages having a near-duplicate in the index

MapReduce Framework • MapReduce framework used within Google • massively parallel • Map phase: • operate individually on a set of objects • Reduce phase • aggregate results of the mapped objects

Batch Algorithm • Suppose 8B existing fingerprints (~32GB after compression): File F • 1M batch query fingerprints (~8MB): File B • F stored in a GFS file system • chunked into roughly 64MB • replicated at 3 random nodes • B stored with much higher replication factor

Batch Algorithm (continued) • Map Phase: • Duplicate detection within each chunk Fi and whole of B • Build multiple tables for B (in memory) • Scan Fi and probe into B • Output near-duplicates in B • Reduce phase • Merge outputs

Batch Algorithm (continued) F1 B1 B1 B1 B2 B2 B2 merge F2 Fn

Experimental Analysis • Promising preliminary results! • Studied: • Choice of simhash parameters • Distribution of fingerprints

Choice of Simhash Parameters

Distribution of Fingerprints

Summary • Addressed near-duplicate detection in a web-crawling system • Proposed algorithms for single and batch cases • Preliminary experiments to validate our techniques and suitability of simhash • Mini-survey of near-duplicate detection in the paper

Thank you!

Detecting Near-Duplicates for Web Crawling