Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms

Finding Near-DuplicateWeb Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan

Overview • Two near-duplicate detecting algorithms (Broder’s& Charikar’s algorithm) are compared on a very large scale (1.6 billion distinct web pages) • Need to know the pros and cons of each algorithm when they work in different situations. • Need to find a new approach to get better results of detecting near-duplicates Finding Near-Duplicates in a Large Scale 3/28/2013

Relation to course material • Discuss more details of two algorithms which were introduced in lecture, and draw important conclusions by comparing the experiment results • Broder’s algorithm is basically a minhashing algorithm discussed in lecture. The paper goes further to calculate a supershingle based on the minvalue vector. • Both algorithms obey the general paradigm of finding near-duplicates, which is to generate and compare signature of each file Finding Near-Duplicates in a Large Scale

Broder’s Algorithm • Begin with preprocessing HTML tags and URLs for each document (also used in Charikar) • Use m functions to fingerprint the shingle sequence, and find mminvalues each from the fingerprinted sequence. Finding Near-Duplicates in a Large Scale

Broder’s Algorithm • Divide the mminvalues into m’ groups, each with l elements e.g. m = 84, m’ = 6, l = 14 • Concatenate minvalues in each group to reduce the vector from m entries to m’ entries • Fingerprint each of the m’ entries to generate an m’-dimensional vector (supershingle) Finding Near-Duplicates in a Large Scale

B-Similarity • Definition: The number of identical entries in the supershingle vectors of two pages • Two pages are near-duplicates ifftheir B-similarity is at least 2. e.g. m’ = 6, pairs with more than 2 entry agrees are near-duplicate Finding Near-Duplicates in a Large Scale

Charikar’salgorithm • Extract a set of features (meaningful tokens) from a web page, and each feature is tagged with a weight • Each feature (token) is projected to a b-bit vector that each entry in the vector has value {-1, 1} Finding Near-Duplicates in a Large Scale

Charikar’s algorithm • Sum up all b-bit projections of tokens each multiplied by its weight to form a new b-dimensional vector • Generate the final b-dimensional vector by setting the positive entry to 1 and non-positive entry to 0 Finding Near-Duplicates in a Large Scale

C-Similarity • Definition: The C-similarity of two pages is the number of bits their final projections agree on • Two pages are near-duplicates iffthe number of agreeing bits in their projections lies above a fixed threshold e.g. b = 384, threshold = 372 Finding Near-Duplicates in a Large Scale

Comparison of two algorithms Note: T is the total number of tokens in all web pages. D is the number of web pages. Finding Near-Duplicates in a Large Scale

Comparison of experiment results • Construct a similarity graph in which every page is a node and every edge denotes a near-duplicate pair. • A node is considered a near-duplicate page iffit is incident to at least one edge Finding Near-Duplicates in a Large Scale

Comparison of experiment results Distribution of degree in log-log scale B-similarity C-similarity Finding Near-Duplicates in a Large Scale

Comparison of experiment results • Precision measurement • Precision of results from same sites is low because very often pages on the same site use the same boilerplate text and differ only in the main item in the center of the page. Finding Near-Duplicates in a Large Scale

Comparison of experiment results • Term differences in two algorithms Finding Near-Duplicates in a Large Scale

Comparison of experiment results • Distribution of term differences in two algorithms Broder’s algorithm Charikar’s algorithm Finding Near-Duplicates in a Large Scale

Comparison of experiment results • Error cases: Finding Near-Duplicates in a Large Scale

A combined algorithm • Use Broder’s algorithm to compute all B-similar pairs first. Then use Charikar’s algorithm to filter out those pairs whose C-similarity falls below a certain threshold • The reason: false positives for Broder’s algorithm (consecutive term differences with large boilerplate text) can be filtered by Charikar’s algorithm • Overall precision improves to 0.79 Finding Near-Duplicates in a Large Scale

Pros • Experiment is persuasive and reliable to conclude the pros and cons of the two algorithms. e.g. large data samples, human evaluation, error case analysis • The combined approach includes advantages from both algorithms which can avoid large numbers of false positives. • In the combined approach, Charikar’s algorithm is computed on the fly, which saves much space. Finding Near-Duplicates in a Large Scale

Cons • The experiment focus on the precision of the two algorithm, but do not get statistics on the recall. • The combined algorithm has overhead on time complexity, because finding a near-duplicate pair need to run both algorithm. Finding Near-Duplicates in a Large Scale

Improvement • Consider token order in Charikar’s algorithm by using shingling; • Consider token frequency in Broder’s algorithm with weighted shingle based on frequency Finding Near-Duplicates in a Large Scale

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms

Presentation Transcript

Large Scale Integrated Circuits

Introduction to the key large-scale government surveys

Pilot Plant Scale-up of Injectables and Liquid Orals

My Favorite Algorithms for Large-Scale Data Mining

Scalable Algorithms for Mining Large Databases

Distributed Shared Memory for Large-Scale Dynamic Systems

Session 14 Management of Large-Scale Disaster Response/Recovery

EMERGING SYSTEMS FOR LARGE-SCALE MACHINE LEARNING

SGI ® Rackable ® Servers

Large Scale Visualization with ParaView

Performance Evaluation of Machine Learning Algorithms

2 nd Quarter Exam Review

Section 10.1 Summary – pages 253-262

Efficiency of Algorithms

Algorithms

Large-Scale Data Processing with MapReduce

Chapter 3

Algorithms for Ultra-large Multiple Sequence Alignment and Phylogeny Estimation

Randomized Algorithms and Motif Finding

Radio Propagation - Large-Scale Path Loss

Mobile Radio Propagation - Large Scale Path Loss

Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems