SpotSigs : Robust and Efficient Near Duplicate Detection in Large Web Collections

SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai TzungRuei Authors: Martin Theobald, Jonathan Siddharth, and Andreas Paepcke 國立雲林科技大學 National Yunlin University of Science and Technology SIGIR. 2008

Outline • Motivation • Objective • Methodology • Experiments • Conclusion • Comments

Motivation • Detecting near-duplicate documents and records in large data sets is a long-standing problem. Syntactically, near duplicates are pairs of items that are very similar along some dimensions, but different enough that simple byte-by-byte comparisons fail.

Objective • To avoid exact duplicates during thecollection of Web archives, near duplicates frequently slipinto the corpus.

Methodology • SPOT SIGNATURE • EXTRACTION • MATCHING document Web Database

Methodology • SPOT SIGNATURE EXTRACTION • A = {aj(dj, cj)} Example a(1,2), an(1,2), the(1,2) and is(1,2) “ At a rally to kick off a weeklong campaign for the South Carolina primary, Obama tried to set the record straight from an attack circulating widely on the Internet that is designed to play into prejudices against Muslims and fears of terrorism.” Result S = {a:rally:kick, a:weeklong:campain, the:south:carolina, the:record:straight, an:attack:circulating, the:internet:designed, is:designed:play}

Methodology • SPOT SIGNATURE MATCHING • JaccardSimilarity for Sets Generalization for Multi-Sets

Methodology • SPOT SIGNATURE MATCHING partition SPOT SIGNATURE Inverted Index Pruning Jaccard Similarity for Sets partition partition

Methodology • Optimal Partitioning

Methodology • Inverted Index Pruning Example d1 = {s1:5, s2:4, s3:4}, with |d1| = 13 d2 = {s1:8, s2:4}, |d2| = 12 d3 = {s1:4, s2:5, s3:5} , |d3| = 14 τ = 0.8 δ1 = 0 δ2 = |d1| − |d3| = −1 partition SPOT SIGNATURE Inverted Index Pruning Jaccard Similarity for Sets partition partition

Experiments • Gold Set of Near Duplicate News Articles • SpotSigs vs. Shingling • Choice of Spot Signatures • SpotSigs vs. Hashing • TREC WT10g • SpotSigs vs. Hashing

Experiments SpotSigs vs. Hashing • Gold Set of Near Duplicate News Articles Choice of Spot Signatures SpotSigs vs. Shingling

Experiments • TREC WT10g • SpotSigs vs. Hashing

Conclusion • MAJOR CINTRIBUTION • SpotSigs proved to provide both increased robustness of signatures as well as highly efficient deduplicationcompared to various state-of-the-art approaches. • FUTURE WORK • Future work will focus on efficient access to disk-based index structures, as well as generalizing the bounding approach toward other metrics such as Cosine.

Comments • Advantage • The SpotSigsdeduplication algorithm runs “right out of the box” without the need for further tuning, while remaining exact and efficient. • Drawback • ….. • Application • information retrieval

SpotSigs : Robust and Efficient Near Duplicate Detection in Large Web Collections