250 likes | 353 Views
Approximation Algorithms for Frequency Related Query Processing on Streaming Data. Presented by Fan Deng Supervisor: Dr. Davood Rafiei May 30, 2007. Outline. Introduction Continuous membership query Point query Similarity self-join size estimation Conclusions and future work.
E N D
Approximation Algorithms for Frequency Related QueryProcessing on Streaming Data Presented by Fan Deng Supervisor: Dr. Davood Rafiei May 30, 2007
Outline • Introduction • Continuous membership query • Point query • Similarity self-join size estimation • Conclusions and future work
Data stream • A sequence of data records • Examples • Document/URL streams from a Web crawler • IP packet streams • Web advertisement click streams • Sensor reading streams • ...
Processing in one pass • One pass processing • Online stream (one scan required) • Massive offline stream (one scan preferred) • Challenges • Huge data volume • Fast processing requirement • Relatively small fast storage space
Approximation algorithms • Exact query answers • can be slow to obtain • may need large storage space • sometimes are not necessary • Approximate query answers • can take much less time • may need less space • with acceptable errors
Frequency related queries • Frequency • # of occurrences • Continuous membership query • Point query • Similarity self-join size estimation
Outline • Introduction • Continuous membership query[SIGMOD’06] • Motivating application • Problem statement • Our theoretical and experimental results • Point query • Similarity self-join size estimation • Conclusions and future work
A Motivating Application • Duplicate URL detection in Web crawling • Search engines [Broder et al. WWW03] • Fetch web pages continuously • Extract URLs within each downloaded page • Check each URL(duplicate detection) • If never seen before • Then fetch it • Else skip it
A Motivating Application (cont.) • Problems • Huge number of distinct URLs • Memory is usually not large enough • Disks are slow • Errors are usually acceptable • A false positive (missed URLs) • A false negative (redundant crawls or disk search)
M Problem statement • A sequence of elements with order • Storage space M • Not large enough to store all distinct elements • Continuous membership query Appeared before? Yes or No …dg a f b e a d c b a • Our goal • Minimize the # of errors • Fast
SBF theoretical results • SBF will be stable • The expected # of “0”s will become a constant after a number of updates • Converge at an exponential rate • Monotonic decreasing • False positive rates become constant • An upper bound of false positive rates • (a function of 4 parameters: SBF size, # of hash functions, max cell values, and kick-out rates) • Setting the optimal parameters (partially empirical)
SBF experimental results (cont.) • Comparison SBF, and FPBuffering method (LRU) • ~ 700M real URL fingerprints • SBF generates 3-13% less false negatives, same # of false positives (<10%) • MIN,[Broder et al. WWW03], theoretically optimal • assumes “the entire sequence of requests is known in advance” • beats LRU caching by <5% in most cases • More false positives allowed, SBF gains more
Outline • Introduction • Continuous membership query • Point query[to be submitted] • Motivating application • Problem statement • Theoretical and experimental results • Similarity self-join size estimation • Conclusions and future work
Motivating application • Internet traffic monitoring • Query the # of IP packets sent by a particular IP address in the past one hour • Phone call record analysis • Query the # of calls to a given phone # yesterday
Problem statement • Point query • Summarize a stream of elements • Estimate the frequency of a given element • Goal: minimize the space cost and answer the query fast
CMM theoretical results • Unbiased estimate (deduct mean) • Estimate variance is the same as that of Fast-AGMS, a well-known method (in the case deducting mean) • For less skewed data set • the estimation accuracies of CMM and Fast-AGMS are exactly the same
CMM experimental results and analysis • For skewed data sets • Accuracy (given the same space): CMM-median = Fast-AGMS > CMM-mean • Advantage of CMM – 2 estimates from 1 sketch • More flexible (with estimate upper bound) • More powerful (Count-min can be more accurate for the very skewed data set)
Outline • Introduction • Continuous membership query • Point query • Similarity self-join size estimation [submitted to VLDB’07] • Motivating application • Problem statement • Theoretical and experimental results • Conclusions and future work
Motivating application • Near-duplicate document detection for search engines [Broder 99, Henzinger 06] • Very slow (30M pages, 10 days in 1997; 2006?) • To predict the processing time, necessary to estimate the number of similar pairs • Data cleaning in general (similarity self-join) • To find a better query plan (query optimization) • Estimates of similarity self-join size is needed
Problem statement • Similarity self-join size • Given a set of records with d attributes, estimate the # of record pairs that at least s-similar • An s-similar pair • A pair of records with s attributes in common • E.g. <Davood, Rafiei, CS, UofA, Canada> & <Fan, Deng, CS, UofA, Canada> are 3-similar
Theoretical results • Unbiased estimate • Standard deviation bound of the estimate • Time and space cost (For both offline and online SimParCount)
Experimental results • Online SimPairCount v.s. Random sampling • Given the same amount of space • Error = (estimate – trueValue) / trueValue • Dataset: • DBLP paper titles • Each converted into a record with 6 attributes • Using min-wise independent hashing
Similarity self-join size estimation – Experimental results (cont.)
Conclusions and future work • Streaming algorithms • found real applications (important) • can lead to theoretical results (fun) • More work to be done • Current direction: multi-dimensional streaming algorithms • E.g Estimating the # of outliers in one pass
Thanks! Questions/Comments?