150 likes | 305 Views
SpotSigs : Robust and Efficient Near Duplicate Detection in Large Web Collections. Presenter: Tsai Tzung Ruei Authors: Martin Theobald , Jonathan Siddharth , and Andreas Paepcke. 國立雲林科技大學 National Yunlin University of Science and Technology. SIGIR. 2008. Outline. Motivation Objective
E N D
SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai TzungRuei Authors: Martin Theobald, Jonathan Siddharth, and Andreas Paepcke 國立雲林科技大學 National Yunlin University of Science and Technology SIGIR. 2008
Outline • Motivation • Objective • Methodology • Experiments • Conclusion • Comments
Motivation • Detecting near-duplicate documents and records in large data sets is a long-standing problem. Syntactically, near duplicates are pairs of items that are very similar along some dimensions, but different enough that simple byte-by-byte comparisons fail.
Objective • To avoid exact duplicates during thecollection of Web archives, near duplicates frequently slipinto the corpus.
Methodology • SPOT SIGNATURE • EXTRACTION • MATCHING document Web Database
Methodology • SPOT SIGNATURE EXTRACTION • A = {aj(dj, cj)} Example a(1,2), an(1,2), the(1,2) and is(1,2) “ At a rally to kick off a weeklong campaign for the South Carolina primary, Obama tried to set the record straight from an attack circulating widely on the Internet that is designed to play into prejudices against Muslims and fears of terrorism.” Result S = {a:rally:kick, a:weeklong:campain, the:south:carolina, the:record:straight, an:attack:circulating, the:internet:designed, is:designed:play}
Methodology • SPOT SIGNATURE MATCHING • JaccardSimilarity for Sets Generalization for Multi-Sets
Methodology • SPOT SIGNATURE MATCHING partition SPOT SIGNATURE Inverted Index Pruning Jaccard Similarity for Sets partition partition
Methodology • Optimal Partitioning
Methodology • Inverted Index Pruning Example d1 = {s1:5, s2:4, s3:4}, with |d1| = 13 d2 = {s1:8, s2:4}, |d2| = 12 d3 = {s1:4, s2:5, s3:5} , |d3| = 14 τ = 0.8 δ1 = 0 δ2 = |d1| − |d3| = −1 partition SPOT SIGNATURE Inverted Index Pruning Jaccard Similarity for Sets partition partition
Experiments • Gold Set of Near Duplicate News Articles • SpotSigs vs. Shingling • Choice of Spot Signatures • SpotSigs vs. Hashing • TREC WT10g • SpotSigs vs. Hashing
Experiments SpotSigs vs. Hashing • Gold Set of Near Duplicate News Articles Choice of Spot Signatures SpotSigs vs. Shingling
Experiments • TREC WT10g • SpotSigs vs. Hashing
Conclusion • MAJOR CINTRIBUTION • SpotSigs proved to provide both increased robustness of signatures as well as highly efficient deduplicationcompared to various state-of-the-art approaches. • FUTURE WORK • Future work will focus on efficient access to disk-based index structures, as well as generalizing the bounding approach toward other metrics such as Cosine.
Comments • Advantage • The SpotSigsdeduplication algorithm runs “right out of the box” without the need for further tuning, while remaining exact and efficient. • Drawback • ….. • Application • information retrieval