200 likes | 364 Views
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms. Author : Monika Henzinger Presenter: Chao Yan. Overview. Two near-duplicate detecting algorithms ( Broder’s & Charikar’s algorithm) are compared on a very large scale (1.6 billion distinct web pages)
E N D
Finding Near-DuplicateWeb Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan
Overview • Two near-duplicate detecting algorithms (Broder’s& Charikar’s algorithm) are compared on a very large scale (1.6 billion distinct web pages) • Need to know the pros and cons of each algorithm when they work in different situations. • Need to find a new approach to get better results of detecting near-duplicates Finding Near-Duplicates in a Large Scale 3/28/2013
Relation to course material • Discuss more details of two algorithms which were introduced in lecture, and draw important conclusions by comparing the experiment results • Broder’s algorithm is basically a minhashing algorithm discussed in lecture. The paper goes further to calculate a supershingle based on the minvalue vector. • Both algorithms obey the general paradigm of finding near-duplicates, which is to generate and compare signature of each file Finding Near-Duplicates in a Large Scale
Broder’s Algorithm • Begin with preprocessing HTML tags and URLs for each document (also used in Charikar) • Use m functions to fingerprint the shingle sequence, and find mminvalues each from the fingerprinted sequence. Finding Near-Duplicates in a Large Scale
Broder’s Algorithm • Divide the mminvalues into m’ groups, each with l elements e.g. m = 84, m’ = 6, l = 14 • Concatenate minvalues in each group to reduce the vector from m entries to m’ entries • Fingerprint each of the m’ entries to generate an m’-dimensional vector (supershingle) Finding Near-Duplicates in a Large Scale
B-Similarity • Definition: The number of identical entries in the supershingle vectors of two pages • Two pages are near-duplicates ifftheir B-similarity is at least 2. e.g. m’ = 6, pairs with more than 2 entry agrees are near-duplicate Finding Near-Duplicates in a Large Scale
Charikar’salgorithm • Extract a set of features (meaningful tokens) from a web page, and each feature is tagged with a weight • Each feature (token) is projected to a b-bit vector that each entry in the vector has value {-1, 1} Finding Near-Duplicates in a Large Scale
Charikar’s algorithm • Sum up all b-bit projections of tokens each multiplied by its weight to form a new b-dimensional vector • Generate the final b-dimensional vector by setting the positive entry to 1 and non-positive entry to 0 Finding Near-Duplicates in a Large Scale
C-Similarity • Definition: The C-similarity of two pages is the number of bits their final projections agree on • Two pages are near-duplicates iffthe number of agreeing bits in their projections lies above a fixed threshold e.g. b = 384, threshold = 372 Finding Near-Duplicates in a Large Scale
Comparison of two algorithms Note: T is the total number of tokens in all web pages. D is the number of web pages. Finding Near-Duplicates in a Large Scale
Comparison of experiment results • Construct a similarity graph in which every page is a node and every edge denotes a near-duplicate pair. • A node is considered a near-duplicate page iffit is incident to at least one edge Finding Near-Duplicates in a Large Scale
Comparison of experiment results Distribution of degree in log-log scale B-similarity C-similarity Finding Near-Duplicates in a Large Scale
Comparison of experiment results • Precision measurement • Precision of results from same sites is low because very often pages on the same site use the same boilerplate text and differ only in the main item in the center of the page. Finding Near-Duplicates in a Large Scale
Comparison of experiment results • Term differences in two algorithms Finding Near-Duplicates in a Large Scale
Comparison of experiment results • Distribution of term differences in two algorithms Broder’s algorithm Charikar’s algorithm Finding Near-Duplicates in a Large Scale
Comparison of experiment results • Error cases: Finding Near-Duplicates in a Large Scale
A combined algorithm • Use Broder’s algorithm to compute all B-similar pairs first. Then use Charikar’s algorithm to filter out those pairs whose C-similarity falls below a certain threshold • The reason: false positives for Broder’s algorithm (consecutive term differences with large boilerplate text) can be filtered by Charikar’s algorithm • Overall precision improves to 0.79 Finding Near-Duplicates in a Large Scale
Pros • Experiment is persuasive and reliable to conclude the pros and cons of the two algorithms. e.g. large data samples, human evaluation, error case analysis • The combined approach includes advantages from both algorithms which can avoid large numbers of false positives. • In the combined approach, Charikar’s algorithm is computed on the fly, which saves much space. Finding Near-Duplicates in a Large Scale
Cons • The experiment focus on the precision of the two algorithm, but do not get statistics on the recall. • The combined algorithm has overhead on time complexity, because finding a near-duplicate pair need to run both algorithm. Finding Near-Duplicates in a Large Scale
Improvement • Consider token order in Charikar’s algorithm by using shingling; • Consider token frequency in Broder’s algorithm with weighted shingle based on frequency Finding Near-Duplicates in a Large Scale