380 likes | 496 Views
Approximation Algorithms for Frequency Related Query Processing on Streaming Data. Presented by Fan Deng Supervisor: Dr. Davood Rafiei May 24, 2007. Data stream. A sequence of data records Examples Document/URL streams from a Web crawler IP packet streams
E N D
Approximation Algorithms for Frequency Related QueryProcessing on Streaming Data Presented by Fan Deng Supervisor: Dr. Davood Rafiei May 24, 2007
Data stream • A sequence of data records • Examples • Document/URL streams from a Web crawler • IP packet streams • Web advertisement click streams • Sensor reading streams • ...
Processing in one pass • One pass processing • Online stream (one scan required) • Massive offline stream (one scan preferred) • Challenges • Huge data volume • Fast processing requirement • Relatively small fast storage space
Approximation algorithms • Exact query answers • can be slow to obtain • may need large storage space • sometimes are not necessary • Approximate query answers • can take much less time • may need less space • with acceptable errors
Frequency related queries • Frequency • # of occurrences • Continuous membership query • Point query • Similarity self-join size estimation
Outline • Introduction • Continuous membership query • Motivating application • Problem statement • Existing solutions and our solution • Theoretical and experimental results • Point query • Similarity self-join size estimation • Conclusions and future work
A Motivating Application • Duplicate URL detection in Web crawling • Search engines [Broder et al. WWW03] • Fetch web pages continuously • Extract URLs within each downloaded page • Check each URL(duplicate detection) • If never seen before • Then fetch it • Else skip it
A Motivating Application (cont.) • Problems • Huge number of distinct URLs • Memory is usually not large enough • Disks are slow • Errors are usually acceptable • A false positive (false alarms) • A distinct URL is wrongly reported as a duplicate • Consequence: this URL will not be crawled • A false negative (misses) • A duplicate URL is wrongly reported as distinct • Consequence: this URL will be crawled redundantly or searched on disks
M Problem statement • A sequence of elements with order • Storage space M • Not large enough to store all distinct elements • Continuous membership query Appeared before? Yes or No …dg a f b e a d c b a • Our goal • Minimize the # of errors • Fast
An existing solution (caching) • Store as many distinct elements as possible in a buffer • Duplicate detection process • Upon element arrival, search the buffer • if found then report “duplicate” else “distinct” • Update the buffer using some replacement policies • LRU, FIFO, Random, …
1 1 0 0 0 0 1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 1 0 0 Another solution (Bloom filters) • A bitmap, originally all “0” • Duplicate detection process • Hash each incoming element into some bits • If any bit is “0” then report “distinct” else “duplicate” • Update process - sets corresponding bits to “1” xh1(x) h2(x)1 2 3 4 5 6 a 1 2 b 1 3 c 2 4 a 1 2
1 1 1 1 1 1 Another solution (Bloom filters, cont.) • False positives (false alarms) • Bloom Filters will be “full” - All distinct URLs will be reported as duplicates, and thus skipped!
1 0 1 0 1 1 3 0 2 1 3 0 Our solution (Stable Bloom Filters) • Kick “elements” out of the Bloom filters • Change bits to “cells” (“cellmap”)
Stable Bloom Filters (SBF, cont.) • A “cellmap”, originally all “0” • Duplicate detection • Hash each element into some cells, check those cells • If any cell is “0”, report “distinct” else “duplicate” • Kick “elements” • Randomly choose some cells and deduct them by 1 • Update the “cellmap” • Set cells into a predefined value, Max > 0 • Use the same hash functions as in the detection stage
SBF theoretical results • SBF will be stable • The expected # of “0”s will become a constant after a number of updates • Converge at an exponential rate • Monotonic • False positive rates become constant • An upper bound of false positive rates • (a function of 4 parameters: SBF size, # of hash functions, max cell values, and kick-out rates) • Setting the optimal parameters (partially empirical)
SBF experimental results • Experimental comparison between SBF, and Caching/Buffering method (LRU) • URL fingerprint data set, originally obtained from Internet Archive (~ 700M URLs) • To fairly compare, we introduce FPBuffering • Let Caching generate some false positives • FPBuffering • If an element is not found in the buffer, report “duplicate” with certain probabilities
SBF experimental results (cont.) • SBF generates 3-13% less false negatives than FPBuffering, while having exactly the same # of false positives (<10%)
SBF experimental results (cont.) • MIN, [Broder et al. WWW03], theoretically optimal • assumes “the entire sequence of requests is known in advance” • beats LRU caching by <5% in most cases • More false positives allowed, SBF gains more
Outline • Introduction • Continuous membership query • Point query • Motivating application • Problem statement • Existing solutions and our solution • Theoretical and experimental results • Similarity self-join size estimation • Conclusions and future work
Motivating application • Internet traffic monitoring • Query the # of IP packets sent by a particular IP address in the past one hour • Phone call record analysis • Query the # of calls to a given phone # yesterday
Problem statement • Point query • Summarize a stream of elements • Estimate the frequency of a given element • Goal: minimize the space cost and answer the query fast
3 1 1 1 2 1 4 1 2 0 1 1 2 5 1 1 0 0 1 1 1 2 3 1 Existing solutions • Fast-AGMS sketch [AMS97, Charikar et al. 2002] • Count-min sketch (counting Bloom filters) • e.g. an element is hashed to 4 counters • Take the min counter value as the estimate
3 1 4 2 2 1 Our solution • Count-median-mean (CMM) • Count-min based • Take the value of the counter the element is hashed to • Deduct the median/mean value of all other counters • Remainder from deducting the mean is an unbiased estimate (in the case of deducting mean) • Basic idea: all counters are expected to have the same value Example: • counter value = 3 • mean value of all other counters = 2 (median = 2, more robust) • remainder = 1, so frequency estimate = 3-2 = 1
Theoretical results • Unbiased estimate (deduct mean) • Estimate variance is the same as that of Fast-AGMS (in the case deducting mean) • For less skewed data set • the estimation accuracies of CMM and Fast-AGMS are exactly the same
Experimental results and analysis • For skewed data sets • Accuracy (given the same space): CMM-median = Fast-AGMS > CMM-mean • Time cost analysis • CMM-mean = Fast-AGMS < CMM-median • but the difference is small • Advantage of CMM • More flexible (with estimate upper bound) • More powerful (Count-min can be more accurate for the very skewed data set)
Outline • Introduction • Continuous membership query • Point query • Similarity self-join size estimation • Motivating application • Problem statement • Existing solutions and our solution • Theoretical and experimental results • Conclusions and future work
Motivating application • Near-duplicate document detection for search engines [Broder 99, Henzinger 06] • Very slow (30M pages, 10 days in 1997; 2006?) • Good to predict the time • How? Estimate the number of similar pairs • Data cleaning in general (similarity self-join) • To find a better query plan (query optimization) • Estimates of similarity self-join size is needed
Problem statement • Similarity self-join size • Given a set of records with d attributes, estimate the # of record pairs that at least s-similar • An s-similar pair • A pair of records with s attributes in common • E.g. <Davood, Rafiei, CS, UofA, Canada> & <Fan, Deng, CS, UofA, Canada> are 3-similar
Existing solutions • A straightforward solution • Compare each record with all other records • Count the number of pairs at least s-similar • Time cost O(n2) for n records • Random sampling • Take a sample of size m uniformly at random • Count the number of pairs at least s-similar • Scale it by a factor of c = n(n-1)/m(m-1)
Our solution • Offline SimParCount (Step 1- data processing) • Linearly scan all records once • For each record for each k=s…d • Randomly pick k different attribute values, and concatenate them into one k-super-value • Repeat this process l_k times • Look at all k-super-values as a stream • Store the (d-s+1) super-value streams on disks
Our solution (cont.) • Offline SimParCount (Step 2 - Result generating) • Obtain the self-join size of those 1-dimensional super-value streams • Based on the d-s+1 self-join sizes, estimate the similarity self-join size • Online SimParCount • Use small sketches to estimate stream self-join sizes rather than expensive external sorting
Our solution (cont.) • Key idea • Convert similarity self-join size estimation to stream self-join size estimation • A similar record pair will have certain chance to have a match in the super-value stream records --- 2-super-values <1a,2c,3b,4v> --- <2c,3b> <1e,2c,3b,4v> --- <2c,3b> <1e,2f,3d,4e> --- <1e,3d> …
Theoretical results • Unbiased estimate • Standard deviation bound of the estimate • Time and space cost (For both offline and online SimParCount)
Experimental results • Online SimParCount v.s. Random sampling • Given the same amount of space • Error = (estimate – trueValue) / trueValue • Dataset: • DBLP paper titles • Each converted into a record with 6 attributes • Using min-wise independent hashing
Similarity self-join size estimation – Experimental results (cont.)
Conclusions and future work • Streaming algorithms • found real applications (important) • can lead to theoretical results (fun) • More work to be done • Current direction: multi-dimensional streaming algorithms • E.g Estimating the # of outliers in one pass
Thanks! Questions/Comments?