220 likes | 426 Views
Detecting Near Duplicates for Web Crawling. Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi. Introduction. There are various duplicate documents on the web. Many pages differ in small portion because of advertisement displayed and so on.
E N D
Detecting Near Duplicates for Web Crawling • Authors : • Gurmeet Singh Mank • Arvind Jain • Anish Das Sarma Presented by Chintan Udeshi Udeshi-CS572
Introduction • There are various duplicate documents on the web. • Many pages differ in small portion because of advertisement displayed and so on. • Such pages are irrelevant for crawling point of you. • This paper uses Charikar‘s finnger-printing technique for the same to find out duplicate documents. • This technique is useful for both online queries and batch queries. Udeshi-CS572
Advantages of duplicate detection • Saves B.W. • Reduction in storage cost • Improve quality of search engine • Reduces load on remote host. Udeshi-CS572
Limitations of duplicate detection • Scaling • Speed • Use less resources Udeshi-CS572
FINGERPRINTING WITH SIMHASH • Extract set of features from a document along with corresponding weight for each feature. • We use simhash to generate an f-bit finger-print based on presence or absence of feature in a given document. • When we use simhash, 64-it finger-print will be good enough for 8B we pages. Udeshi-CS572
Idea behind using Simhash algorithm Simhash has 2 properties : • A : The fingerprint of a document is hash of its features. • B :Similar documents have similar hash values. • Our algorithms are designed assuming that Property A holds and we experimentally measure the impact of non-uniformity introduced by Property B on real datasets. Udeshi-CS572
Hamming Distance problem • Consider a collection of 8B 64-bit fingerprints, occupying 64GB. • We have to decide whether existing 8B 64-bit fingerprints differs from F in at most k = 3 bit-positions. • Algorithm is different for online queries and batch queries. Udeshi-CS572
Algorithm for online queries • We have to build t tables: T1, T2,……. Tt. • Table Ti is constructed by applying permutation to each existing fingerprints. • There are 2 steps for the same : • Identify all permuted fingerprints in Ti whose top bit-positions match the other fingerprints top bit-positions. • After following the above step, check if it differs from other by at most k bit-positions. Udeshi-CS572
Design parameters for the algorithm • There is a trade-off between number of tables and selecting value of Pi for the table. • Increasing the number of tables increases Pi and hence reduces the query time. • De-creasing the number of tables reduces storage requirements, but reduces Pi and thus increases the query time. Udeshi-CS572
Algorithm for Batch Queries • Files are first broken into 64 MB chunks. • Each chunk is replicated at three randomly chosen machines in a cluster. • Each chunk is stored as a file in the local system. • First, we solve hamming distance problem for each 64MB chunk. • Later on, we combine output from all the chunks to produce final output. Udeshi-CS572
Broder's shingle-based fingerprints • Broder shingle-based finger-print uses Rabin fingerprints. • The algorithm is such that Given an n-bit message m0,...,mn-1…, fingerprint of m to be the remainder r(x) after division of f(x) by p(x). Udeshi-CS572
Comparison with Broder's shingle-based fingerprints • For the comparison, 6 Rabin fingerprints are calculated. • Later on, it is checked to see if 2 or more finger-prints are matching or not. • Each finger-print takes approximately 24 bytes. • On the other hand, simhash will take 64-bits for 8B web pages. Udeshi-CS572
Experimental Results There is a tradeoff between f and k for detection of duplicates for web pages using simhash. Topics includes : • Choice of parameters • Distribution of finger-prints • Scalability Udeshi-CS572
Choice of parameters • Vary K between 1 to 10. • Divide pages into different categories • False Positive • True Positive • Unknown • There is a trade-off. • K=3 gives reasonable result for 64-bit finger-print. Udeshi-CS572
Distribution of finger-print (1) • Left side of the slide doesn’t drop rapidly as the right side one. • This is due to the fact that some pages are similar to each other. • So, finger prints differ by moderate number. Udeshi-CS572
Distribution of finger-print (2) • More or less uniform with spikes in some places. • Reasons: • Empty pages. • File not found. • Multiple websites uses similar login page. Udeshi-CS572
Nature of Corpus: System is mainly divided into 4 documents : • Web pages. • Files in file system • E-mail • Domain-specific Corpora This paper mainly involves finding near duplicates for web pages. Udeshi-CS572
Scalability • For batch mode, compressed version of file Q occupies almost 32GB. • Usually, computational time for each file is approximately 1GBps. • So, Computation usually finishes in 100 seconds. Udeshi-CS572
Need to detect duplicates • Web Mirror • Clustering for related documents query • Data Extraction • Plagiarism • Spam Detection • Duplicate in domain specific corpora Udeshi-CS572
Feature set per-documents • Shingles from page content • Document vector from page content • Connectivity information • Anchor text and anchor window • Phrases Udeshi-CS572
Future Research • Can we categorize web-pages into categories and search for near duplicates only within the relevant categories. • Feasibility to devise algorithms for detecting portions of web-pages that contains ads or timestamp. • Change sensitivity of simhash algorithm for feature selection and assignment of weights to features. • Algorithm for clustering of the documents. • Can we categories documents based on languages. Udeshi-CS572
Thank you.Q & A ? Udeshi-CS572