Near-duplicates detection

Comparison of the two algorithms seen in class Romain Colle Near-duplicates detection

Description of algorithms • 1st pass through the data : Both algorithms compute a signature for each document, and perform LSH on these signatures. • 2nd pass through the data : Verification of the relevance of the duplicates pairs found (Jaccard similarity). • Algorithm SH uses Shingles + MinHashing to compute the signatures. • Algorithm SK uses sketches of projections on random hyperplanes to compute the signatures.

Experimentation method • Run both algorithms on the data set (WebBase), and compute precision. • Remove duplicates pairs found from the data set. • Generate and insert large amounts of (near-) duplicates documents (~10% of the data set). • Run both algorithms on the new dataset, and compute precision and recall.

Results (original data set)

Results (modified dataset)

Conclusion • Algorithm SK rocks ! • However, it is computationally more expensive • Tradeoff between speed and recall/precision (given that algorithm SH performs quite well)

Near-duplicates detection