Approximation Algorithms for Frequency Related Query Processing on Streaming Data

Approximation Algorithms for Frequency Related QueryProcessing on Streaming Data Presented by Fan Deng Supervisor: Dr. Davood Rafiei May 24, 2007

Data stream • A sequence of data records • Examples • Document/URL streams from a Web crawler • IP packet streams • Web advertisement click streams • Sensor reading streams • ...

Processing in one pass • One pass processing • Online stream (one scan required) • Massive offline stream (one scan preferred) • Challenges • Huge data volume • Fast processing requirement • Relatively small fast storage space

Approximation algorithms • Exact query answers • can be slow to obtain • may need large storage space • sometimes are not necessary • Approximate query answers • can take much less time • may need less space • with acceptable errors

Frequency related queries • Frequency • # of occurrences • Continuous membership query • Point query • Similarity self-join size estimation

Outline • Introduction • Continuous membership query • Motivating application • Problem statement • Existing solutions and our solution • Theoretical and experimental results • Point query • Similarity self-join size estimation • Conclusions and future work

A Motivating Application • Duplicate URL detection in Web crawling • Search engines [Broder et al. WWW03] • Fetch web pages continuously • Extract URLs within each downloaded page • Check each URL(duplicate detection) • If never seen before • Then fetch it • Else skip it

A Motivating Application (cont.) • Problems • Huge number of distinct URLs • Memory is usually not large enough • Disks are slow • Errors are usually acceptable • A false positive (false alarms) • A distinct URL is wrongly reported as a duplicate • Consequence: this URL will not be crawled • A false negative (misses) • A duplicate URL is wrongly reported as distinct • Consequence: this URL will be crawled redundantly or searched on disks

M Problem statement • A sequence of elements with order • Storage space M • Not large enough to store all distinct elements • Continuous membership query Appeared before? Yes or No …dg a f b e a d c b a • Our goal • Minimize the # of errors • Fast

An existing solution (caching) • Store as many distinct elements as possible in a buffer • Duplicate detection process • Upon element arrival, search the buffer • if found then report “duplicate” else “distinct” • Update the buffer using some replacement policies • LRU, FIFO, Random, …

1 1 0 0 0 0 1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 1 0 0 Another solution (Bloom filters) • A bitmap, originally all “0” • Duplicate detection process • Hash each incoming element into some bits • If any bit is “0” then report “distinct” else “duplicate” • Update process - sets corresponding bits to “1” xh1(x) h2(x)1 2 3 4 5 6 a 1 2 b 1 3 c 2 4 a 1 2

1 1 1 1 1 1 Another solution (Bloom filters, cont.) • False positives (false alarms) • Bloom Filters will be “full” - All distinct URLs will be reported as duplicates, and thus skipped!

1 0 1 0 1 1 3 0 2 1 3 0 Our solution (Stable Bloom Filters) • Kick “elements” out of the Bloom filters • Change bits to “cells” (“cellmap”)

Stable Bloom Filters (SBF, cont.) • A “cellmap”, originally all “0” • Duplicate detection • Hash each element into some cells, check those cells • If any cell is “0”, report “distinct” else “duplicate” • Kick “elements” • Randomly choose some cells and deduct them by 1 • Update the “cellmap” • Set cells into a predefined value, Max > 0 • Use the same hash functions as in the detection stage

SBF theoretical results • SBF will be stable • The expected # of “0”s will become a constant after a number of updates • Converge at an exponential rate • Monotonic • False positive rates become constant • An upper bound of false positive rates • (a function of 4 parameters: SBF size, # of hash functions, max cell values, and kick-out rates) • Setting the optimal parameters (partially empirical)

SBF experimental results • Experimental comparison between SBF, and Caching/Buffering method (LRU) • URL fingerprint data set, originally obtained from Internet Archive (~ 700M URLs) • To fairly compare, we introduce FPBuffering • Let Caching generate some false positives • FPBuffering • If an element is not found in the buffer, report “duplicate” with certain probabilities

SBF experimental results (cont.) • SBF generates 3-13% less false negatives than FPBuffering, while having exactly the same # of false positives (<10%)

SBF experimental results (cont.)

SBF experimental results (cont.) • MIN, [Broder et al. WWW03], theoretically optimal • assumes “the entire sequence of requests is known in advance” • beats LRU caching by <5% in most cases • More false positives allowed, SBF gains more

Outline • Introduction • Continuous membership query • Point query • Motivating application • Problem statement • Existing solutions and our solution • Theoretical and experimental results • Similarity self-join size estimation • Conclusions and future work

Motivating application • Internet traffic monitoring • Query the # of IP packets sent by a particular IP address in the past one hour • Phone call record analysis • Query the # of calls to a given phone # yesterday

Problem statement • Point query • Summarize a stream of elements • Estimate the frequency of a given element • Goal: minimize the space cost and answer the query fast

3 1 1 1 2 1 4 1 2 0 1 1 2 5 1 1 0 0 1 1 1 2 3 1 Existing solutions • Fast-AGMS sketch [AMS97, Charikar et al. 2002] • Count-min sketch (counting Bloom filters) • e.g. an element is hashed to 4 counters • Take the min counter value as the estimate

3 1 4 2 2 1 Our solution • Count-median-mean (CMM) • Count-min based • Take the value of the counter the element is hashed to • Deduct the median/mean value of all other counters • Remainder from deducting the mean is an unbiased estimate (in the case of deducting mean) • Basic idea: all counters are expected to have the same value Example: • counter value = 3 • mean value of all other counters = 2 (median = 2, more robust) • remainder = 1, so frequency estimate = 3-2 = 1

Theoretical results • Unbiased estimate (deduct mean) • Estimate variance is the same as that of Fast-AGMS (in the case deducting mean) • For less skewed data set • the estimation accuracies of CMM and Fast-AGMS are exactly the same

Experimental results and analysis • For skewed data sets • Accuracy (given the same space): CMM-median = Fast-AGMS > CMM-mean • Time cost analysis • CMM-mean = Fast-AGMS < CMM-median • but the difference is small • Advantage of CMM • More flexible (with estimate upper bound) • More powerful (Count-min can be more accurate for the very skewed data set)

Outline • Introduction • Continuous membership query • Point query • Similarity self-join size estimation • Motivating application • Problem statement • Existing solutions and our solution • Theoretical and experimental results • Conclusions and future work

Motivating application • Near-duplicate document detection for search engines [Broder 99, Henzinger 06] • Very slow (30M pages, 10 days in 1997; 2006?) • Good to predict the time • How? Estimate the number of similar pairs • Data cleaning in general (similarity self-join) • To find a better query plan (query optimization) • Estimates of similarity self-join size is needed

Problem statement • Similarity self-join size • Given a set of records with d attributes, estimate the # of record pairs that at least s-similar • An s-similar pair • A pair of records with s attributes in common • E.g. <Davood, Rafiei, CS, UofA, Canada> & <Fan, Deng, CS, UofA, Canada> are 3-similar

Existing solutions • A straightforward solution • Compare each record with all other records • Count the number of pairs at least s-similar • Time cost O(n2) for n records • Random sampling • Take a sample of size m uniformly at random • Count the number of pairs at least s-similar • Scale it by a factor of c = n(n-1)/m(m-1)

Our solution • Offline SimParCount (Step 1- data processing) • Linearly scan all records once • For each record for each k=s…d • Randomly pick k different attribute values, and concatenate them into one k-super-value • Repeat this process l_k times • Look at all k-super-values as a stream • Store the (d-s+1) super-value streams on disks

Our solution (cont.) • Offline SimParCount (Step 2 - Result generating) • Obtain the self-join size of those 1-dimensional super-value streams • Based on the d-s+1 self-join sizes, estimate the similarity self-join size • Online SimParCount • Use small sketches to estimate stream self-join sizes rather than expensive external sorting

Our solution (cont.) • Key idea • Convert similarity self-join size estimation to stream self-join size estimation • A similar record pair will have certain chance to have a match in the super-value stream records --- 2-super-values <1a,2c,3b,4v> --- <2c,3b> <1e,2c,3b,4v> --- <2c,3b> <1e,2f,3d,4e> --- <1e,3d> …

Theoretical results • Unbiased estimate • Standard deviation bound of the estimate • Time and space cost (For both offline and online SimParCount)

Experimental results • Online SimParCount v.s. Random sampling • Given the same amount of space • Error = (estimate – trueValue) / trueValue • Dataset: • DBLP paper titles • Each converted into a record with 6 attributes • Using min-wise independent hashing

Similarity self-join size estimation – Experimental results (cont.)

Conclusions and future work • Streaming algorithms • found real applications (important) • can lead to theoretical results (fun) • More work to be done • Current direction: multi-dimensional streaming algorithms • E.g Estimating the # of outliers in one pass

Thanks! Questions/Comments?

Approximation Algorithms for Frequency Related Query Processing on Streaming Data

Approximation Algorithms for Frequency Related Query Processing on Streaming Data

Presentation Transcript

Approximation Algorithms

Approximation Algorithms

Approximation Algorithms

Approximation Algorithms

Approximation Algorithms

Approximation Algorithms

Algorithms For Data Processing

Approximation Algorithms

Approximation Algorithms for Frequency Related Query Processing on Streaming Data

Approximation Algorithms

Processing Related Data

Approximation Algorithms

Approximation Algorithms

Approximation Algorithms

Approximation Algorithms

Approximation Algorithms

Approximation Algorithms