Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters

Approximately Detecting Duplicates for Streaming Datausing Stable Bloom Filters Presented by Fan Deng Joint work withDavood Rafiei University of Alberta

Outline • Motivation • The problem - Approximate duplicate detection • Existing solutions - Caching - Bloom filters • Our approach - Stable Bloom filters - Results • Related work • Conclusions University of Alberta

The Motivating Application • Duplicate URL detection in Web crawling [Broder et al. WWW03] - Web search engines fetch web pages continuously - Extract URLs within each downloaded page - Check each URL(duplicate detection), if never seen before, then download it; else skip it • Problem - Huge number of distinct URLs - Memory is usually not large enough, and disks are too slow University of Alberta

The Motivating Application • Errors are usually acceptable - A false positive (false alarms) -- A distinct URL is wrongly reported as a duplicate; -- This URL will not be crawled - A false negative (misses) -- A duplicate URL is wrongly reported as distinct -- This URL will be crawled redundantly or searched in disks University of Alberta

M The Problem Approximate Duplicate Detection • A sequence of elements with order • Storage space M (not large enough to store all distinct elements) • Continual membership query Appeared before? Yes or No …dg a f b e a d c b a • Our goal • Minimize the # of errors • Fast University of Alberta

Existing Solutions – Caching • Store as many distinct elements as possible in a buffer • Duplicate detection process - Seeing an element, search the buffer - if found then report “duplicate” else “distinct” • Update the buffer using some replacement policies - LRU, FIFO, Random, … University of Alberta

Existing Solutions – Caching • False negatives - lead to redundant crawling or searching disks • Need extra space - to speed up the searching, - to maintain the replacement policy (e.g. LRU) - space amount proportional to the buffer size University of Alberta

1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 Existing Solutions – Bloom Filters • A bitmap, originally all “0” • Duplicate detection process - Hash each incoming element into some bits - If any bit is “0” then report “distinct” else “duplicate” • Update process - sets corresponding bits to “1” xh1(x) h2(x)1 2 3 4 5 6 a 1 2 b 1 3 c 2 4 a 1 2 University of Alberta

1 1 1 1 1 1 Existing Solutions – Bloom Filters • False positives (false alarms) • Bloom Filters will be “full” - All distinct URLs will be reported as duplicates, and thus skipped! University of Alberta

1 3 0 0 1 2 0 1 1 3 1 0 Our approach –Stable Bloom Filters(SBF) • Kick “elements” out of the Bloom filters • Change bits to “cells” (“cellmap”) University of Alberta

Stable Bloom Filters(SBF) • A “cellmap”, originally all “0” • Duplicate detection - Hash each element into some cells, check those cells - If any cell is “0”, report “distinct” else “duplicate” • Kick “elements” - Randomly choose some cells and deduct them by 1 • Update the “cellmap” - Set cells into a predefined value, Max > 0 - Use the same hash functions as in the detection stage University of Alberta

Analytical results • SBF will be stable - the expected # of “0”s will become a constant after a number of updates - converge at an exponential rate - monotonic - a lower bound of the expected # of “0”s (a function of the SBF size, # of hash functions, max cell values, and kick-out rates) University of Alberta

Analytical results • Two-sided errors - false positive rates become constant - An upper bound of false positive rates (a function of 4 parameters) - Given a false positive rate and SBF size, find the optimal parameters minimizing the # of false negatives (combining empirical results on setting max cell values) University of Alberta

Experiments • Experimental comparison between SBF, and Caching/Buffering method (LRU) • URL fingerprint data set, originally obtained from Internet Archive (~ 700M URLs) • Synthetic data simulating network traffics using Possion and B-model • To fairly compare, we introduce FPBuffering let Caching generate some false positives, i.e. if an element is not found in the buffer, report “duplicate” with certain probabilities University of Alberta

Experimental Results • SBF generates 3-13% less false negatives than FPBuffering, while having exactly the same # of false positives (<10%) University of Alberta

Experimental Results University of Alberta

Experimental Results • MIN, [Broder et al. WWW03], theoretically optimal - assumes “the entire sequence of requests is known in advance” - beats LRU caching by <5% in most cases • More false positives allowed, SBF gains more University of Alberta

Related work • Duplicate detection in click streams [Metwally et al. WWW05] • URL caching [Broder et al. WWW03] • Other variations of Bloom filters - Counting Bloom filters [Fan et al. SIGCOMM98] - Spectral Bloom filters [Cohen&Matias SIGMOD03] - … • Fuzzy duplicate detection [Ananthakrishna et al. VLDB02], [Chaudhuri et al. ICDE05], [Weis et al. SIGMOD05] University of Alberta

Conclusions • SBF provides false positives/negatives trade-off when the space is limited • SBF is fast and simple • More false positive rates are allowed, SBF gains more University of Alberta

Questions/Comments? Thanks! University of Alberta

Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters

Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters

Presentation Transcript

Detecting Near-Duplicates for Web Crawling

Bloom Based Filters for Hiera r chical Data

Detecting Near Duplicates for Web Crawling

Bloom Filters

Bloom filters

Bloom Filters

Bloom Filters

Bloom Filters

Detecting Near Duplicates for Web Crawling

Detecting Near Duplicates for Web Crawling

Fast Packet Classification Using Bloom filters

Improving State Coverage Using Bloom Filters

Bloom Filters

Bloom Filters

Detecting Near-Duplicates for Web Crawling

Deep Packet Inspection Using Parallel Bloom Filters

Bloom Filters

Detecting Near-Duplicates for Web Crawling

Optimizing Data Popularity Conscious Bloom Filters

Bloom filters

Bloom Filters