Detecting Phrase-Level Duplication on the World Wide Web

Detecting Phrase-Level Duplication on the World Wide Web Fetterly, Manasse, Najork Paper Presentation by: Vinay Goel

Introduction • Problem • Identify instances “slice and dice” generation • Example • German spammer • 1 million URLs originating from single IP (but use of many host names) • Pages changed completely on every download • Pages consisted of grammatically well-formed sentences stitched together at random

Goal • Find instances of sentence level synthesis of web pages • More generally, of pages with an unusually large number of popular phrases

The Data • Datasets • DS1 • BFS crawl starting at www.yahoo.com • 151 million HTML pages • DS2 • Large crawl conducted by MSN search • 96 million HTML pages chosen at random

Finding Phrase Replication • Sampling • Reduce each document to a feature vector • Employ a variant of the shingling algorithm of Broder et al. • Significantly reduces the data volume

Sampling method • Replace all HTML markup by white-space • k-phrases of a document: all sequences of k consecutive words • Treat the document as a circle: last word followed by first word • n word document has exactly n phrases

Sampling method • Exploit properties of Rabin fingerprints • Rabin fingerprints support efficient extension and prefix deletion • Fingerprints of distinct bit patterns are distinct

Computing feature vectors • Fingerprint each word in the document - gives n tokens • Compute fingerprint of each k-token phrase - gives n phrase fingerprints • Apply m different fingerprint functions • Retain the smallest of the n resulting values for each function • Vector of m fingerprints representative of document (elements referred to as shingles)

Duplicate Suppression • Replication rampant on the web • Clustered all pages in data set into equivalence classes • Each class contains all pages that are exact or near duplicates of one another

Popular phrases • Occur in more documents than would be expected by chance • Assumptions: • “Normal” web pages characterized by a generative model • Sought web pages - copying model (need to consider number of phrases, length of typical documents…)

Popular Phrases • Limit attention to the shingles chosen by sampling functions • Phrase is popular if selected as shingle in sufficiently many documents • To determine popular phrases, consider triplets (i,s,d)

Popular Phrases • First 24 most popular phrases not very interesting • Starting from the 36th phrase, discover phrases caused by machine generated content • Templatic form: common text, “fill in the blank” slots and optional • 60th phrase - instance of idiomatic phrase

Zipfian Distribution

Histogram of popular shingles per doc

Covering set • Covering sets for shingles of each page • Approximate a minimum covering set using a greedy heuristic

Distribution of covering set sizes

German spammer

Looking for likely sources

Conclusion • Power law distribution • Popular phrases • Often limited by design choices • Legal disclaimers • Navigational phrases • “fill in the blanks” • More replicated than original content

Detecting Phrase-Level Duplication on the World Wide Web

Detecting Phrase-Level Duplication on the World Wide Web

Presentation Transcript

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

ADHD on the World Wide Web

The world wide web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web