400 likes | 525 Views
Faculty of Computer Science, Institute of System Architecture, Database Technology Group. Sampling Time-Based Sliding Windows in Bounded Space Rainer Gemulla Wolfgang Lehner Technische Universität Dresden. Motivation: Ad-hoc Queries. Query a data stream. SELECT SUM( size ) AS num_bytes
E N D
Faculty of Computer Science, Institute of System Architecture, Database Technology Group Sampling Time-Based Sliding Windows in Bounded SpaceRainer GemullaWolfgang LehnerTechnischeUniversität Dresden
Motivation: Ad-hoc Queries Query a data stream • SELECT SUM(size) ASnum_bytes • FROM packets [Range 60 Minutes] window width (fixed) syntheticsine curve (24h) plus peak window size (varying)
Sampling Time-Based Windows • Approaches • Exact: Store entirewindow • Approximate • Usespecializedsynopses • Random sampling • Challenges • Preserve uniform samplingcharacteristics • Ensurestatisticalcorrectness • Considerspacebounds • Effectiveresourcemanagement • Maximize sample size • Achievebestpossibleestimates
Outline • Introduction • AvailableSchemes • BoundedPriority Sampling • Analysis & Experimental Results • Conclusion
ExistingTechniques • Bernoulli sampling(coin-flip sample) • each item isincludedwithprobabilityq (=sampling rate) • sample sizeisqN in expectation, whereNiswindowsize • not a bounded-spacescheme • Example: 40byte items, 32kbyte space max 819 items q = 0.0276
ExistingTechniques • Prioritysampling • Assigns a randomprioritytoeacharriving item • Item withthehighestpriority = random sample ofsize 1 • Larger samples multiple copies • O(log N) items in expectation unbounded Brian Babcock, MayurDatar, andRajeevMotwani. Sampling from a moving window over streaming data. In Proc. SODA, pages 633–634, 2002.
Example: Priority Sampling Sample size Sample space k = 113 items
Sample Synopsis • Sample size • Fixed • Bounded • Unbounded • Sample space • Bounded • Unbounded Overhead Sample • Space
Outline • Introduction • ExistingTechniques • BoundedPriority Sampling • Analysis & Experimental Results • Conclusion
A Negative Result • Fixed sample size in boundedspaceimpossible • Sample size 1 • Ij = item j reportedat time j • Different items: at least Ij • Expected: E[Ij] = E[Ij] = 1+1/2+…+1/N = O(log N) • Worstcase ≥ averagecase ... • Event: • Probability: • I1 • 1/N • IN • 1 • I2 • 1/(N-1) • IN-1 • 1/2
Sample Synopsis • Sample size • Fixed • Bounded • Unbounded • Sample space • Bounded • Unbounded Overhead Sample • Space
BoundedPriority Sampling • Data structure • Candidate = highest-priority item since last expiration • Test item = expiredcandidate • Sample extraction • Notest item: REPORT • Candidate < Test: DO NOT REPORT • Candidate > Test: REPORT
Proofof Correctness • Outline • emax: thehighest-priority item in thewindow (random) • e:candidateatstartofcurrentwindow (nowexpired) • Itcanbeshownthat • Does not depend on positionof item in stream • Thus: P(S={ej} | |S|=1) = P(ej=emax) = 1/N
Example: BoundedPriority Sampling Sample size Sample space k = 585 items
Sample Synopsis • Sample size • Fixed • Bounded • Unbounded • Sample space • Bounded • Unbounded Overhead Sample • Space
Outline • Introduction • ExistingTechniques • BoundedPriority Sampling • Analysis & Experimental Results • Conclusion
Analysis of Sample Size • Setting • emax: highestpriority item in currentwindow (sizeN) • emax: highestpriority item in previouswindow (sizeN) • Observation • emaxisreportedifitspriorityishigherthanthatofemax • Successprobability (lowerbound) • P(|S|=1) = P(S={emax}) P(pmax>pmax) = N/(N+N) • Example • N=2, N=4 • 66% Windowsizeratio
Example: BoundedPriority Sampling Expected size
Experiments: Sample Size • NETWORK • Network trafficdata, bursty • Min: 289 ― Avg: 11,724 ― Max: 1,180,077 • Items 22 byte 32kbyte correspond to k = 862
Experiments: Sample Size • SEARCH • Usagestatisticsofsearchengine, slowlychanging • Min: 0 ―Avg: 16,482 ― Max: 37,947 • Items 12 bytes: 32kbyte correspond to k = 1,170
Sampling Multiple Items • Maintainkcopiesofthe BPS datastructure • Slow: O(kN) time forwindowofsizeN • Maintainthekhighest-priorityitems • Fast: O(N + k logklogN) in expectation NETWORK
Outline • Introduction • ExistingTechniques • BoundedPriority Sampling • Analysis & Experimental Results • Conclusion
Conclusion • Sampling time-basedwindows • Challengingbecausewindowsizefluctuates • Existingschemes do not providespaceguarantees • Impossibletoguaranteefixed sample size • Boundedprioritysampling • Proceed in a best-effortmanner • Probabilistic sample sizeguarantees • Whatelseis in thepaper? • Estimationofwindowsize • Stratifiedsamplingscheme
Thank you! Questions?
Backup: Stratified Sampling
Existingtechniques • Stratifiedsampling • Partition thestreamintoconsecutivestrata (partitions) • Store stratumsize, expirytimestampand uniform sample • Whenapplicable, higherstatisticalefficiencypossible • Equi-Width Stratification • Start newstratumeveryΔt time units N1=2 N2=1 N3=6 N4=0 50% 100% 16%
Effectofstratumsizes • Example: WindowAverage • Attribute isnormallydistributed, mean , variance 2 • Estimatorvariancefor per-stratasamplesofsizen • Minimizedwhen all stratahavethe same size
Solution • Optimum Stratification • Stratahaveequalsize • Not possiblebecausewecannotmoveboundariesarbitrary • But: wecanmergestrata • Merge-BasedStratification • Idea: Applymerges so astominimize QS at time ofexpirationoffirststratum N1=3 N2=3 N3=3 N4=0 33% 33% 33% Merge
Algorithm • Assumption (preliminary) • NumberN+ofarrivalstillnextstratumexpirationknown • Goal • Partition thesetintol-1 partitions so thatsumofsquaresisminimized • Dynamic programming • Knownalgorithms: O(l(l+N+)2) time • Here: O(l3) time • Details in thepaper 2, 1, 3, 1,1,1 N+=3
N+ • Estimation • TimespantillexpirationofR1: • Idea: estimate = numberofarrivals in the last time units • Find j such thatt-tj> andt-tj+1 • EstimateN+asNj+1,l/(t-tj) • Robustness • Estimatesmaybewrong • But weobservewrongestimates • Algorithm • EstimateN+andexpected timeofnextmerge • IfN+itemsarrivebeforethat time: recompute • IfN+itemsarrivearoundthat time: merge • IflessthenN+itemshavearrived: recompute
Stratifiedsampling • Results
Stratifiedsampling • Time per item
Backup: Sampling Multiple Items
Sampling Multiple Items • So far: Withreplacement • Maintainkcopiesofthe BPS datastructure • kpriorities per item • Slow: O(kN) time forwindowofsizeN NETWORK
Sampling Multiple Items • Withoutreplacement • maintainthekhighest-priorityitems • kcandidates, • ktestitems • 1 priority per item • Sample extraction • Generalizationforsingle item case • Report: top-k (Scand Stest) Scand top-k
Sampling Multiple Items • Cost • Naive: O(kN) time as well • Withtreaps: expected O(N + k logklogN) NETWORK
Backup: Olderslides
Data streams • Data stream • High speed • Processed on thefly • Recentitemsmoreimportant • Statisticsofinterest • Arrivalrates • Selectivities • Quantiles • Heavy hitters • Subset sums • Distinctcounts • Clustering • For a recent time interval(e.g., 4 hours)
Sampling datastreams • Approximation • Requiredtocopewith (worst-case) load • Manyspecializedtechniquesexist • Random sampling • Approach: Maintain a sample oftherecentitems • Lessaccurate but versatile • Problem • Given a memorybudget, maintain a sample oftheitemsthatarrived in a recent time interval
Sampling fromslidingwindows • Method 1: Sequence-based sampling • Sample from window of fixed size, then select recent items • Method 2: Time-based sampling • Sample directly from window of fixed width Outdated Not representative How to maintain?