1 / 24

Approximation and Load Shedding for QoS in DSMS*

This document provides notes and tutorials on approximation and load shedding techniques for maintaining quality of service in data stream management systems (DSMS). It covers topics such as synopses and approximation, sampling, histograms, wavelets, and approximate algorithms.

ronaldking
Download Presentation

Approximation and Load Shedding for QoS in DSMS*

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Approximation and Load Sheddingfor QoS in DSMS* CS240B Notes By Carlo Zaniolo CSD--UCLA ________________________________________ * Notes based on a VLDB’02 tutorial by Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi

  2. Synopses and Approximation • Synopsis:bounded-memory history-approximation • Succinct summary of old stream tuples • Like indexes/materialized-views, but base data is unavailable • Examples • Sliding Windows • Samples • Histograms • Wavelet representation • Sketching techniques • Approximate Algorithms: e.g., median, quantiles,… • Fast and light Data Mining algorithms

  3. Overview of Stream Synopses • Windows: logical, physical (covered) • Samples: Answering queries using samples • Histograms: Equi-depth histograms, On-line quantile computation • Wavelets: Haar-wavelet histogram construction & maintenance

  4. Sampling: Basics • Idea: A small random sample S of the data often well-represents all the data • For a fast approx answer, apply “modified” query to S • Example:select agg from R where odd(R.e) (n=12) • If agg is avg, return average of odd elements in S • If agg is count, return average over all elements e in S of • 1 if e is odd • 0 if e is even Data stream: 9 3 5 2 7 1 6 5 8 4 9 1 Sample S: 9 5 1 8 answer: 5 answer: 12*3/4 =9 • Unbiased: For expressions involving count, sum, avg: the estimator • is unbiased, i.e., the expected value of the answer is the actual answer

  5. Probabilistic Guarantees • Example: Actual answer is within 5 ± 1 with prob  0.9 • Use Tail Inequalities to give probabilistic bounds on returned answer • Markov Inequality • Chebyshev’s Inequality • Hoeffding’s Inequality • Chernoff Bound

  6. Sampling—some background • Reservoir Sampling [Vit85]:Maintains a sample S having a pre-assigned size M on a stream of arbitrary size • Add each new element to S with probability M/n, where n is the current number of stream elements • If add an element, evict a random element from S • Instead of flipping a coin for each element, determine the number of elements to skip before the next to be added to S • Concise sampling [GM98]: Duplicates in sample S stored as <value, count> pairs (thus, potentially boosting actual sample size) • Counting Samples [GM98]: for answering hot list queries (k most frequent values) • Window Sampling [BDM02,BOZ08]. Maintains a sample S having a pre-assigned size M on a window on a stream—reservoir sampling with expiring tuples.

  7. Load Shedding Using Samples • Given a complex Query graph how to use/manage the sampling process [BDM04] • More about this later [LawZ02]

  8. Overview • Windows: logical, physical (covered) • Samples: Answering queries using samples • Histograms: Equi-depth histograms, On-line quantile computation • Wavelets: Haar-wavelet histogram construction & maintenance • Sketches

  9. Histograms • Histograms approximate the frequency distribution of element values in a stream • A histogram (typically) consists of • A partitioning of element domain values into buckets • A count per bucket B (of the number of elements in B) • Widely used in DBMS query optimization Many Types of Proposed: • Equi-Depth Histograms: select buckets such that counts per bucket are equal • V-Optimal Histograms: select buckets to minimize frequency variance within buckets • Wavelet-based Histograms

  10. Types of Histograms • Equi-Depth Histograms • Idea: Select buckets such that counts per bucket are equal • V-Optimal Histograms [IP95] [JKM98] • Idea: Select buckets to minimize frequency variance within buckets Count for bucket 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Domain values Count for bucket 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Domain values

  11. Equi-Depth Histogram Construction • For histogram with b buckets, compute elements with rank n/b, 2n/b, ..., (b-1)n/b • Example: (n=12, b=4) Data stream: 9 3 5 2 7 1 6 5 8 4 9 1 After sort: 1 1 2 3 4 5 5 6 7 8 9 9 rank = 9 (.75-quantile) rank = 3 (.25-quantile) rank = 6 (.5-quantile)

  12. Answering Queries Histograms [IP99] answer: 3.5 * • (Implicitly) map the histogram back to an approximate relation, & apply the query to the approximate relation • Example: select count(*) from R where 4 <= R.e <= 15 • For equi-depth histograms, maximum error: Count spread evenly among bucket values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 4  R.e  15

  13. Approximate Algorithms • Quantiles Using Samples • Quantiles from Synopses • One pass algorithms for approximate samples … • Much work in this area … omitted

  14. Overview • Windows: logical, physical (covered) • Samples: Answering queries using samples • Histograms: Equi-depth histograms, On-line quantile computation • Wavelets: Haar-wavelet histogram construction & maintenance • Sketches

  15. One-Dimensional Haar Wavelets [1.5, 4] [0.5, 0] [2.75] [-1.25] Haar wavelet decomposition: [2.75, -1.25, 0.5, 0, 0, -1, -1, 0] • Wavelets: Mathematical tool for hierarchical decomposition of functions/signals • Haar wavelets: Simplest wavelet basis, easy to understand and implement • Recursive pairwise averaging and differencing at different resolutions Resolution Averages Detail Coefficients 3 [2, 2, 0, 2, 3, 5, 4, 4] ---- 2 [2, 1, 4, 4] [0, -1, -1, 0] 1 0

  16. Haar Wavelet Coefficients 2.75 - 2.75 + -1.25 -1.25 - 0 -1 0.5 -1 0 0 + - 0.5 + 0 - + - + 0 + - -1 + - -1 + 0 • Hierarchical decomposition structure (a.k.a. “error tree”) Coefficient “Supports” + + - + - + - + - + - + - + - 2 2 0 2 3 5 4 4 Original frequency distribution

  17. Compressed Wavelet Representations Key idea: Use a compact subset of Haar/linear wavelet coefficients for approximating frequency distribution Steps • Compute cumulative frequency distribution C • Compute linear wavelet transform of C • Greedy heuristic methods • Retain coefficients leading to large error reduction • Throw away coefficients that give small increase in error

  18. Overview • Windows: logical, physical (covered) • Samples: Answering queries using samples • Histograms: Equi-depth histograms, On-line quantile computation • Wavelets: Haar-wavelet histogram construction & maintenance • Sketches

  19. Sketches • Conventional data summaries fall short: • Quantiles and 1-d histograms: Cannot capture attribute correlations • Samples (e.g., using Reservoir Sampling) perform poorly for joins • Multi-d histograms/wavelets: Construction requires multiple passes over the data • Different approach: Randomized sketch synopses • Only logarithmic space • Probabilistic guarantees on the quality of the approximate answer • Can handle extreme cases.

  20. Overview • Windows: logical, physical (covered) • Samples: Answering queries using samples • Histograms: Equi-depth histograms, On-line quantile computation • Wavelets: Haar-wavelet histogram construction & maintenance • Sketches • QoS by load shedding.

  21. QoS and Load Schedding • When input stream rate exceeds system capacity a stream manager can shed load (tuples) • Load shedding affects queries and their answers: drop the tasks and the tuples that will cause least loss • Introducing load shedding in a data stream manager is a challenging problem • Random load shedding or semantic load shedding

  22. Load Shedding in Aurora • QoS for each application as a function relating output to its utility – Delay based, drop based, value based • Techniques for introducing load shedding operators in a plan such that QoS isdisrupted the least – Determining when, where and how much load to shed

  23. Load Shedding in STREAM • Formulate load shedding as an optimization problem for multiple sliding window aggregate queries – Minimize inaccuracy in answers subject to output rate matching or exceeding arrival rate • Consider placement of load shedding operators in query plan – Each operator sheds load uniformly with probability pi

  24. References [BDM02] B. Babcock, M. Datar, R. Motwani, ”Sampling from a moving window over streaming data”,Proceedingsof the thirteenth annual ACM-SIAM Symposium on Discrete Algorithms, p.633–634, 2002. [BOZ 08]Vladimir Braverman, Rafail Ostrovsky, Carlo Zaniolo Succinct Sampling on Streams, submitted for publication. [Vit85] J. S. Vitter. “Random Sampling with a Reservoir”. ACM TOMS, 1985. [GM98] P. B. Gibbons and Y. Matias. “New Sampling-Based Summary Statistics for Improving Approximate Query Answers”. ACM SIGMOD 1998. [BDM04] Brian Babcock, Mayur Datar, Rajeev Motwani: Load Shedding for Aggregation Queries over Data Streams. ICDE 2004: 350-361. [lawZ08] Yan-Nei Law and Carlo Zaniolo: Improving the Accuracy of Continuous Aggregates and Mining Queries on Data Streams under Load Shedding. International Journal of Business Intelligence and Data Mining, 2008.

More Related