190 likes | 282 Views
Detecting Phrase-Level Duplication on the World Wide Web. Fetterly, Manasse, Najork Paper Presentation by: Vinay Goel. Introduction. Problem Identify instances “slice and dice” generation Example German spammer 1 million URLs originating from single IP (but use of many host names)
E N D
Detecting Phrase-Level Duplication on the World Wide Web Fetterly, Manasse, Najork Paper Presentation by: Vinay Goel
Introduction • Problem • Identify instances “slice and dice” generation • Example • German spammer • 1 million URLs originating from single IP (but use of many host names) • Pages changed completely on every download • Pages consisted of grammatically well-formed sentences stitched together at random
Goal • Find instances of sentence level synthesis of web pages • More generally, of pages with an unusually large number of popular phrases
The Data • Datasets • DS1 • BFS crawl starting at www.yahoo.com • 151 million HTML pages • DS2 • Large crawl conducted by MSN search • 96 million HTML pages chosen at random
Finding Phrase Replication • Sampling • Reduce each document to a feature vector • Employ a variant of the shingling algorithm of Broder et al. • Significantly reduces the data volume
Sampling method • Replace all HTML markup by white-space • k-phrases of a document: all sequences of k consecutive words • Treat the document as a circle: last word followed by first word • n word document has exactly n phrases
Sampling method • Exploit properties of Rabin fingerprints • Rabin fingerprints support efficient extension and prefix deletion • Fingerprints of distinct bit patterns are distinct
Computing feature vectors • Fingerprint each word in the document - gives n tokens • Compute fingerprint of each k-token phrase - gives n phrase fingerprints • Apply m different fingerprint functions • Retain the smallest of the n resulting values for each function • Vector of m fingerprints representative of document (elements referred to as shingles)
Duplicate Suppression • Replication rampant on the web • Clustered all pages in data set into equivalence classes • Each class contains all pages that are exact or near duplicates of one another
Popular phrases • Occur in more documents than would be expected by chance • Assumptions: • “Normal” web pages characterized by a generative model • Sought web pages - copying model (need to consider number of phrases, length of typical documents…)
Popular Phrases • Limit attention to the shingles chosen by sampling functions • Phrase is popular if selected as shingle in sufficiently many documents • To determine popular phrases, consider triplets (i,s,d)
Popular Phrases • First 24 most popular phrases not very interesting • Starting from the 36th phrase, discover phrases caused by machine generated content • Templatic form: common text, “fill in the blank” slots and optional • 60th phrase - instance of idiomatic phrase
Covering set • Covering sets for shingles of each page • Approximate a minimum covering set using a greedy heuristic
Conclusion • Power law distribution • Popular phrases • Often limited by design choices • Legal disclaimers • Navigational phrases • “fill in the blanks” • More replicated than original content