210 likes | 410 Views
Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters. Presented by Fan Deng Joint work with Davood Rafiei. Outline. Motivation The problem - Approximate duplicate detection Existing solutions - Caching - Bloom filters Our approach - Stable Bloom filters
E N D
Approximately Detecting Duplicates for Streaming Datausing Stable Bloom Filters Presented by Fan Deng Joint work withDavood Rafiei University of Alberta
Outline • Motivation • The problem - Approximate duplicate detection • Existing solutions - Caching - Bloom filters • Our approach - Stable Bloom filters - Results • Related work • Conclusions University of Alberta
The Motivating Application • Duplicate URL detection in Web crawling [Broder et al. WWW03] - Web search engines fetch web pages continuously - Extract URLs within each downloaded page - Check each URL(duplicate detection), if never seen before, then download it; else skip it • Problem - Huge number of distinct URLs - Memory is usually not large enough, and disks are too slow University of Alberta
The Motivating Application • Errors are usually acceptable - A false positive (false alarms) -- A distinct URL is wrongly reported as a duplicate; -- This URL will not be crawled - A false negative (misses) -- A duplicate URL is wrongly reported as distinct -- This URL will be crawled redundantly or searched in disks University of Alberta
M The Problem Approximate Duplicate Detection • A sequence of elements with order • Storage space M (not large enough to store all distinct elements) • Continual membership query Appeared before? Yes or No …dg a f b e a d c b a • Our goal • Minimize the # of errors • Fast University of Alberta
Existing Solutions – Caching • Store as many distinct elements as possible in a buffer • Duplicate detection process - Seeing an element, search the buffer - if found then report “duplicate” else “distinct” • Update the buffer using some replacement policies - LRU, FIFO, Random, … University of Alberta
Existing Solutions – Caching • False negatives - lead to redundant crawling or searching disks • Need extra space - to speed up the searching, - to maintain the replacement policy (e.g. LRU) - space amount proportional to the buffer size University of Alberta
1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 Existing Solutions – Bloom Filters • A bitmap, originally all “0” • Duplicate detection process - Hash each incoming element into some bits - If any bit is “0” then report “distinct” else “duplicate” • Update process - sets corresponding bits to “1” xh1(x) h2(x)1 2 3 4 5 6 a 1 2 b 1 3 c 2 4 a 1 2 University of Alberta
1 1 1 1 1 1 Existing Solutions – Bloom Filters • False positives (false alarms) • Bloom Filters will be “full” - All distinct URLs will be reported as duplicates, and thus skipped! University of Alberta
1 3 0 0 1 2 0 1 1 3 1 0 Our approach –Stable Bloom Filters(SBF) • Kick “elements” out of the Bloom filters • Change bits to “cells” (“cellmap”) University of Alberta
Stable Bloom Filters(SBF) • A “cellmap”, originally all “0” • Duplicate detection - Hash each element into some cells, check those cells - If any cell is “0”, report “distinct” else “duplicate” • Kick “elements” - Randomly choose some cells and deduct them by 1 • Update the “cellmap” - Set cells into a predefined value, Max > 0 - Use the same hash functions as in the detection stage University of Alberta
Analytical results • SBF will be stable - the expected # of “0”s will become a constant after a number of updates - converge at an exponential rate - monotonic - a lower bound of the expected # of “0”s (a function of the SBF size, # of hash functions, max cell values, and kick-out rates) University of Alberta
Analytical results • Two-sided errors - false positive rates become constant - An upper bound of false positive rates (a function of 4 parameters) - Given a false positive rate and SBF size, find the optimal parameters minimizing the # of false negatives (combining empirical results on setting max cell values) University of Alberta
Experiments • Experimental comparison between SBF, and Caching/Buffering method (LRU) • URL fingerprint data set, originally obtained from Internet Archive (~ 700M URLs) • Synthetic data simulating network traffics using Possion and B-model • To fairly compare, we introduce FPBuffering let Caching generate some false positives, i.e. if an element is not found in the buffer, report “duplicate” with certain probabilities University of Alberta
Experimental Results • SBF generates 3-13% less false negatives than FPBuffering, while having exactly the same # of false positives (<10%) University of Alberta
Experimental Results University of Alberta
Experimental Results • MIN, [Broder et al. WWW03], theoretically optimal - assumes “the entire sequence of requests is known in advance” - beats LRU caching by <5% in most cases • More false positives allowed, SBF gains more University of Alberta
Related work • Duplicate detection in click streams [Metwally et al. WWW05] • URL caching [Broder et al. WWW03] • Other variations of Bloom filters - Counting Bloom filters [Fan et al. SIGCOMM98] - Spectral Bloom filters [Cohen&Matias SIGMOD03] - … • Fuzzy duplicate detection [Ananthakrishna et al. VLDB02], [Chaudhuri et al. ICDE05], [Weis et al. SIGMOD05] University of Alberta
Conclusions • SBF provides false positives/negatives trade-off when the space is limited • SBF is fast and simple • More false positive rates are allowed, SBF gains more University of Alberta
Questions/Comments? Thanks! University of Alberta