Data Stream Management Techniques

Data Stream Management Techniques Dimitris Sacharidis Supervisor: Timos Sellis

outline • data streams • introduction • wavelet synopses • Haar synopses • error metrics • streaming synopses • time-series streams (shift-split) • update streams (GCS) • hierarchically compressed synopses • future work • range-sum wavelet synopses

data streamsintroduction modern applications require real-time (on-line) analysis of data • … but aren’t real-time systems supposed to do that? • not exactly: we need general purpose systems, offering DBMS-like functionality • Example: Telecommunication Companies Network • traffic monitoring systems execute few predefined tasks • need for ad-hoc complex queries, express them in real time and monitor results • see AT&T’s Gigascope data streams: large volumes of data arriving at high unpredictable rates that require continuous on-line processing • continuous (persistent) queries • data stream management systems (DSMS)

data streamsrequirements continuous monitoring imposes strong requirements • small processing time for each incoming streaming tuple • ability to consume tuples at a high rate • small memory footprint • computation usually takes place in main memory (disk I/Os are expensive) • fast query evaluation • update query results to reflect new data when it is not possible to satisfy the above: • restrict computation scope • windows, load shedding • approximate query answering • sampling, synopses, sketches

wavelet synopsesintroduction • wavelet decomposition (transformation) is a mathematical tool for the hierarchical decomposition of functions • applications in signal/image processing • used extensively as a data reduction tool in db scenarios: • selectivity estimation for large aggregate queries • fast approximate query answers • general purpose streaming synopsis • features • efficient: computed in linear time and space (vs. histograms ~N2)) • high compression ratio: a few terms suffice • generalizes to multiple dimensions

wavelet synopsesHaar transformation assume a data vector d of 8 values iterativelyperform pair-wise averaging and semi differencing every node contributes positively to the leaves in its left subtree andnegatively to the leaves in its right subtree averages are not needed d4 = c0 - c1 + c3 + c6 wavelet tree (a.k.a. error tree)

wavelet synopsesHaar synopsis • any set of B coefficients constitutes a B-term wavelet synopsis • stored as <index,value> pairs • implicitly all non-stored coefficients are set to zero • introduces reconstruction error per point estimate e = |d-d|

wavelet synopsesmeasure error • use some norm to aggregate individual errors • L2 norm: Σei2 is the sum squared error (sse) • sse = 224 • L∞norm: max eiis the maximum absolute error • max-abs-error = 10 • generalized to any weighted Lpnorm: Σwieip • e.g. max-rel-error = max (1/di)ei= 10/4 = 250% vector of point errors e vector of data values d

wavelet synopsesoptimizing for error metrics a B-term wavelet synopsis can be optimized for any error metric • sse optimal synopses are straightforward • wavelet transformation is orthonormal (after normalization)  by Parseval’s theoremL2 norm is preserved • choose the highest in absolute (normalized) value coefficients • general (weighted or non) Lp norm optimal synopses require superlinear (at least quadratic) time in N • dynamic programming over the wavelet tree • other decomposition methods • unrestricted wavelet synopses • Haar+ synopses our focus

streaming synopses problem: maintain a wavelet synopsis over a data stream • investigate space-time trade-off consider one dimensional data vector a • two models depending on how a is rendered update model stream elements are updates of type (i,±u) which implies a[i]  a[i] ± uand, further, do not appear ordered in i time series model stream elements are vector values of type (i,a[i]) and appear ordered in i (e.g., time) a is growing with stream a has fixed size

streaming synopsestime-series model (shift-split) goal: maintain a B-term sse optimal wavelet synopsis in the time-series model sse optimal means that we must choose the B highest in absolute (normalized) value coefficients there are three sets of coefficients, depending on the current stream item coeffs whose value is finalized (solid white) coeffs whose value is affected by current stream item (grey) coeffs whose value is not yet known (dashed white) • intuition: • store coeffs of type 2 • keep in a heap the B highest coeffs of type 1

streaming synopsestime-series model (shift-split) investigate space-time tradeoff • wait for M values • transform them in O(M) time • calculate contributions to the path in O(log(N/M)) time O(logN) O(B+logN) (amortized) per-item cost O(1/Mlog(N/M)) space required O(B+M+log(N/M)) for M=1

streaming synopsestime-series model (shift-split) additional contributions in [JSS05] introduction of general purpose operators shift, split first knows results for multi-dimensional streams • standard and non-standard form of transformation results for maintenance of wavelet transformed data • optimal allocation of coefficients into disk blocks • I/O efficient algorithms for transformation and updating of wavelets • for any given memory and disk block size • improvements over state of the art transformation algorithms

streaming synopsesupdate model (gcs) • problem: maintain an sse-optimal wavelet synopsis over update streams • algorithmic requirements: • small memory footprint (sublinear in data size) • fast per stream-item process time (sublinear in required memory) • fast query time (sublinear in data size) • quality guarantees on query answers stream processing model assume data vector a of fixed size N stream items are of the form (i,±u) denoting a net change of ±u in the a[i] entry a[i] := a[i] ± u interpretation u insertions/deletions of the ithentry (we also allow entries to take negative values) important items are only seen once in the fixed order of arrival and do not come ordered in i

streaming synopsesupdate model (gcs) • typically B < main memory << N • cannot fit entire vector (or coefficients) in memory • updates come in arbitrary locations • cannot solve problem exactly => resort to approximation • use sketches, i.e., randomized projections • improve two shortcomings of earlier approach (GKMS) • updating the sketch requires O(|sketch|) updates per streaming item • querying for the largest coefficients requires Ω(NlogN) time

streaming synopsesupdate model (gcs) • we introduce [CGS06] the GCS algorithm that relies on two ideas: • (1) sketch the wavelet domain • (2) quickly identify large coefficients • (1) is easy to accomplish: translate updates in the original domain to updates in the wavelet domain • just polylog more updates are required, even for multi-d

streaming synopsesupdate model (gcs) • for (2) we would like to perform a binary-search-like procedure • enforce a hierarchical grouping on coefficients • prune groups of coefficients that are not L2-heavy, as they may not contain L2-heavy coefficients • only the remaining groups need to be examined more closely • iteratively keep pruning until you reach singleton groups • but, how do we estimate the L2 (energy) for groups of coefficients? • this is a difficult task, requiring a novel technical result • use group count sketch [CGS06]

hierarchically compressed synopses aim: improve indexing of wavelet coefficients => space efficiency • wavelet coeffs are inefficiently stored as <index,value> pairs • for D-dimensional data index consumes D/(D+1) fraction of the total space required per coefficient • idea: compress the index • look for redundancies in index • exploit access patterns and binary tree structure • solution: store sets of coefficients lying on paths • but when do significant coefficients lie on paths? • spikes, sudden changes • sparse data, or dense regions among sparse areas

hierarchically compressed synopsesdefinition a Hierarchically Compressed Wavelet Coefficient (HCC) is the triplet <Bit, C, V> • C is the index of the bottommost coeff of the path • Bit is the unary (in 1s) representation of the number of coeffs in the path (last bit is 0 and acts as the stop bit) • V is the set of coefficient values path shown in green: c7, c3, c1, c0 HCC representation <1110, 7, {-11, -5, -5.5, 35.5}> a Hierarchically Compressed Wavelet Synopsis (HCWS) consists of a set of HCCs

hierarchically compressed synopsesexample consider example with N=16 values, assume budget B of 41 bytes • conventional synopsis stores 5 coeffs in orange • HCWS stores 8 coeffs (2 HCCs) in green over 60% reduction is sum squared error (sse) • conventional synopsis has sse 752 • HCWS has sse 294

hierarchically compressed synopsesalgorithms • optimal dynamic programming algorithm • O(NB) time O(NlogB) space • ε-approximate algorithm based on sparse dynamic programming • tunable guarantees (ε) and time/space requirements • greedy heuristic algorithm – no guarantees • very fast and efficient O(N+BlogN) time, O(N) space • various extensions in [SDS07] • streaming variants (time-series model) for all algorithms • multi-dimensional data • other error metrics, sum squared relative error

future work range-sum wavelets • devise algorithms for optimizing errors from range-sum queries • little previous work • optimize for sse of range queries • hierarchical queries • why is it difficult? • there are ~N2 possible range queries • organize them according to wavelet tree • multiple cases to consider • hard to apply dynamic programming, resort to PODP

trick or treat??

Data Stream Management Techniques

Data Stream Management Techniques

Presentation Transcript

Data Management Techniques

Stream Data Management System Prototypes

Stream-based Data Management

The Stanford Data Stream Management System

Stream Restoration Techniques

Stream Data

Effective Data Management Techniques - In the view of Stream data

Chapter 10: Stream-based Data Management

Stream Assessment Techniques

Data Stream Management

Data Stream Management Systems Checkpoint

STREAM: The Stanford Data Stream Management System

Chapter 10: Stream-based Data Management

Load Shedding Techniques for Data Stream Systems

Multimedia Data Stream Management System

Building a Data Stream Management System

Stream Data Management System Prototypes

STREAM: The Stanford Data Stream Management System

Data Stream Management Systems

Secure Dependable Stream Data Management

Data Stream Management Systems Checkpoint