1 / 26

Detecting Near-Duplicates for Web Crawling

Detecting Near-Duplicates for Web Crawling. Presentation By: Fernando Arreola. Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. Outline. De-duplication Goal of the Paper Why is De-duplication Important? Algorithm Experiment Related Work Tying it Back to Lecture

Download Presentation

Detecting Near-Duplicates for Web Crawling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detecting Near-Duplicates for Web Crawling Presentation By: Fernando Arreola Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma

  2. Outline • De-duplication • Goal of the Paper • Why is De-duplication Important? • Algorithm • Experiment • Related Work • Tying it Back to Lecture • Paper Evaluation • Questions

  3. De-duplication • The process of eliminating near-duplicateweb documents in a generic crawl • Challenge of near-duplicates: • Identifying exact duplicates is easy • Use checksums • How to identify near-duplicate? • Near-duplicates are identical in content but have differences in small areas • Ads, counters, and timestamps

  4. Goal of the Paper • Present near-duplicate detection system which improves web crawling • Near-duplicate detection system includes: • Simhash technique • Technique used to transform a web-page to an f-bit fingerprint • Solution to Hamming Distance Problem • Given f-bit fingerprint find all fingerprints in a given collection which differ by at most k-bit positions

  5. Why is De-duplication Important? • Elimination of near duplicates: • Saves network bandwidth • Do not have to crawl content if similar to previously crawled content • Reduces storage cost • Do not have to store in local repository if similar to previously crawled content • Improves quality of search indexes • Local repository used for building search indexes not polluted by near-duplicates

  6. Algorithm: Simhash Technique • Convert web-page to set of features • Using Information Retrieval techniques • e.g. tokenization, phrase detection • Give a weight to each feature • Hash each feature into a f-bit value • Have a f-dimensional vector • Dimension values start at 0 • Update f-dimensional vector with weight of feature • If i-th bit of hash value is zero -> subtract i-th vector value by weight of feature • If i-th bit of hash value is one -> add the weight of the feature to the i-thvector value • Vector will have positive and negative components • Sign (+/-) of each component are bits for the fingerprint

  7. Algorithm: Simhash Technique (cont.) • Very simple example • One web-page • Web-page text: “Simhash Technique” • Reduced to two features • “Simhash” -> weight = 2 • “Technique” -> weight = 4 • Hash features to 4-bits • “Simhash” -> 1101 • “Technique” -> 0110

  8. Algorithm: Simhash Technique (cont.) • Start vector with all zeroes 0 0 0 0

  9. Algorithm: Simhash Technique (cont.) • Apply “Simhash” feature (weight = 2) feature’s f-bit value calculation 2 0 1 0 + 2 0 2 1 0 + 2 -2 0 0 0 - 2 0 2 1 0 + 2

  10. Algorithm: Simhash Technique (cont.) • Apply “Technique” feature (weight = 4) feature’s f-bit value calculation -2 2 0 2 - 4 2 6 1 2 + 4 2 -2 -2 + 4 1 -2 2 2 - 4 0

  11. Algorithm: Simhash Technique (cont.) • Final vector: • Sign of vector values is -,+,+,- • Final 4-bit fingerprint = 0110 -2 6 2 -2

  12. Algorithm: Solution to Hamming Distance Problem • Problem: Given f-bit fingerprint (F) find all fingerprints in a given collection which differ by at most k-bit positions • Solution: • Create tables containing the fingerprints • Each table has a permutation (π) and a small integer (p) associated with it • Apply the permutation associated with the table to its fingerprints • Sort the tables • Store tables in main-memory of a set of machines • Iterate through tables in parallel • Find all permutated fingerprints whose top pi bits match the top pi bits of πi(F) • For the fingerprints that matched, check if they differ from πi(F) in at most k-bits

  13. Algorithm: Solution to Hamming Distance Problem (cont.) • Simple example • F = 0100 1101 • K = 3 • Have a collection of 8 fingerprints • Create two tables

  14. Algorithm: Solution to Hamming Distance Problem (cont.)

  15. Algorithm: Solution to Hamming Distance Problem (cont.) Sort Sort

  16. Algorithm: Solution to Hamming Distance Problem (cont.) • F = 0100 1101 π(F) = 1101 0100 π(F) = 0101 0011 Match!

  17. Algorithm: Solution to Hamming Distance Problem (cont.) • With k =3, only fingerprint in first table is a near-duplicate of the F fingerprint F

  18. Algorithm: Compression of Tables • Store first fingerprint in a block (1024 bytes) • XOR the current fingerprint with the previous one • Append to the block the Huffman code for the position of the most significant 1 bit • Append to the block the bits after the most significant 1 bit • Repeat steps 2-4 until block is full • Comparing to the query fingerprint • Use last fingerprint (key) in the block and perform interpolation search to decompress appropriate block

  19. Algorithm: Extending to Batch Queries • Problem: Want to get near-duplicates for batch of query fingerprints – not just one • Solution: • Use Google File System (GFS) and MapReduce • Create two files • File F has the collection of fingerprints • File Q has the query fingerprints • Store the files in GFS • GFS breaks up the files into chunks • Use MapReduce to solve the Hamming Distance Problem for each chunk of F for all queries in Q • MapReduce allows for a task to be created per chunk • Iterate through chunks in parallel • Each task produces output of near-duplicates found • Produce sorted file from output of each task • Remove duplicates if necessary

  20. Experiment: Parameters • 8 Billion web pages used • K = 1 …10 • Manually tagged pairs as follows: • True positives • Differ slightly • False positives • Radically different pairs • Unknown • Could not be evaluated

  21. Experiment: Results • Accuracy • Low k value -> a lot of false negatives • High k value -> a lot of false positives • Best value -> k = 3 • 75% of near-duplicates reported • 75% of reported cases are true positives • Running Time • Solution Hamming Distance: O(log(p)) • Batch Query + Compression: • 32GB File & 200 tasks -> runs under 100 seconds

  22. Related Work • Clustering related documents • Detect near-duplicates to show related pages • Data extraction • Determine schema of similar pages to obtain information • Plagiarism • Detect pages that have borrowed from each other • Spam • Detect spam before user receives it

  23. Tying it Back to Lecture • Similarities • Indicated importance of de-duplication to save crawler resources • Brief summary of several uses for near-duplicate detection • Differences • Lecture focus: • Breadth-first look at algorithms for near-duplicate detection • Paper focus: • In-depth look of simhash and Hamming Distance algorithm • Includes how to implement and effectiveness

  24. Paper Evaluation: Pros • Thorough step-by-step explanation of the algorithm implementation • Thorough explanation on how the conclusions were reached • Included brief description of how to improve simhash + Hamming Distance algorithm • Categorize web-pages before running simhash, create algorithm to remove ads or timestamps, etc.

  25. Paper Evaluation: Cons • No comparison • How much more effective or faster is it than other algorithms? • By how much did it improve the crawler? • Limited batch queries to a specific technology • Implementation required use of GFS • Approach not restricted to certain technology might be more applicable

  26. Any Questions? ???

More Related