60 likes | 293 Views
Comparison of the two algorithms seen in class Romain Colle. Near-duplicates detection. Description of algorithms. 1 st pass through the data : Both algorithms compute a signature for each document, and perform LSH on these signatures.
E N D
Comparison of the two algorithms seen in class Romain Colle Near-duplicates detection
Description of algorithms • 1st pass through the data : Both algorithms compute a signature for each document, and perform LSH on these signatures. • 2nd pass through the data : Verification of the relevance of the duplicates pairs found (Jaccard similarity). • Algorithm SH uses Shingles + MinHashing to compute the signatures. • Algorithm SK uses sketches of projections on random hyperplanes to compute the signatures.
Experimentation method • Run both algorithms on the data set (WebBase), and compute precision. • Remove duplicates pairs found from the data set. • Generate and insert large amounts of (near-) duplicates documents (~10% of the data set). • Run both algorithms on the new dataset, and compute precision and recall.
Conclusion • Algorithm SK rocks ! • However, it is computationally more expensive • Tradeoff between speed and recall/precision (given that algorithm SH performs quite well)