340 likes | 501 Views
Approximate Query Processing using Wavelets. Kaushik Chakrabarti ( Univ Of Illinois) Minos Garofalakis (Bell Labs) Rajeev Rastogi (Bell Labs) Kyuseok Shim(KAIST and AITrc ) Presented at 26 th VLDB Conference, Cairo, Egypt Presented By Supriya Sudheendra. Outline. Introduction.
E N D
Approximate Query Processing using Wavelets KaushikChakrabarti(Univ Of Illinois) MinosGarofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26th VLDB Conference, Cairo, Egypt Presented By SupriyaSudheendra
Introduction • Approximate Query Processing is a viable solution for: • Huge amounts of data • High query complexities • Stringent response-time requirements • Decision Support Systems • Support business and organizational decision-making activities • Helps decision makers compile useful information from raw data, solve problems and make decisions
Introduction… • DSS users pose very complex queries to the DBMS • Requires complex operations over GB or TBs of disk-resident data • Very long time to execute and produce exact answers • Number of scenarios where users prefer a fast, approximate answers
Prior Work • Previous Approximate query processing techniques • Focused on specific forms of aggregate queries • Data reduction mechanism – how to obtain the synopses of data • Sampling-based Techniques • A join-operator on 2 uniform random samples results in a non-uniform sample having very few tuples • For non-aggregate queries, it produces a small subset of the exact answer which might be empty when joins are involved.
Prior Work… • Histogram Based Techniques • Problematic for high-dimensional data • Storage overhead • High construction cost • Wavelet Based Techniques • Mathematical tool for hierarchical decomposition of functions • Apply wavelet decomposition to input data collection –> data synopsis • Avoids high construction costs and storage overhead
Contribution of the Paper • Viability and effectiveness of wavelets as a generic tool for high-dimensional DSS • New, I/O-efficient wavelet decomposition algorithm for relational tables • Novel Query processing algebra for Wavelet-Co-Efficient Data Synopses • Extensive Experiments
Background • Mathematical tool to hierarchically decompose functions • Coarse overall approximation together with detail coefficients that influence function at various scales • Haar wavelets are conceptually simple, fast to compute • Variety of applications like image editing and querying
One-Dimensional Haar Wavelets • How to compute, given a data array: • Average the values together pairwiseto get a “lower-resolution” representation of data • Detailed coefficients-> differences of the averages from the computed pairwise average • Reconstruction of the data array possible • Why Detail Coefficients
One-dimensional Haar Wavelets • Wavelet Transform: Overall average followed by detail coefficients in increasing order of resolution. Each entry->wavelet coefficient • WA = [4, -2, 0, -1] • For vectors containing similar values, • most detail coefficients have small values that can be eliminated • Introduces only small errors
One-dimensional Haar Wavelets • Overall average more important than any detail coefficient • To normalize the final entries of WA, each wavelet coefficient is divided by 2l • l: level of resolution • WA = [4, -2, 0, -1/2]
Multi-dimensional Haar Wavelets • Haar wavelets can be extended to multi-dimensional array • Standard Decomposition • Fix an ordering for the data dimensions(1,2,…d) • Apply complete 1-D wavelet transform for each 1-d row of array cells along dimension k • Nonstandard Decomposition • Alternates between dimensions during successive steps of pairwise averaging and differencing for each 1-D row of array cells along dimension k • Repeated recursively on quadrant containing all averages across all dimensions
Non-standard Decomposition • Pairwise averaging and differencing for one positioning of 2x2 box with root [2i1, 2i2] • Distribution of the results in the wavelet transform array • Process is recursed on lower-left quadrant of WA
Multi-dimensional Haar coefficients: Semantics and Representation • D-dimensional Haar basis function corresponding to w is defined by: • D-dimensional rectangular support region • Quadrant sign information
Support Regions for 16 Nonstandard 2-D HaarBasis Function • Blank areas – regions of A whose reconstruction is independent of the coefficient • WA[0,0] – overall average • WA[3,3] – contributes only to upper right quadrant
HaarCoEfficients: Semantics and Representation • W = <R, S, v> • W.R – d-dimensional support hyper-rectangle of W encloses all cells in A to which W contributes • Hyper-rectangle – represented by low and high boundaries across each dimension j, 1<= j <=d • W.R.boundary[j].lo and W.R.boundary[j].hi • W contributes to each data cell A[i1,……id] where • W.R.boundary[j].lo <= ij <= W.R.boundary[j].hi for all j
W.S – sign infromation for all d-dimensional quadrants of W.R • Denoted by W.S.sign[j].lo and W.S.sign[j].hi corresponding to lower and upper half of W.R’s extent along j • Computed as the product of d sign-vector entries that map to that quadrant • W.v – scalar magnitude of W • Quantity that W contributes to all data array cells enclosed in W.R
Building Wavelet Coefficient Synopses • Relation R with d attributes X1, X2, ………Xd • Can represent R as a d-dimensional array AR • Jth dimension is indexed by the values of attribute Xj • Cells contain the count of tuples in R having the corresponding combination of attribute values • AR – joint frequency distribution of all attributes of R
Chunk-based organization of relational tables • Joint frequency array AR – split into d-dimensional chunks • Tuples of R of same chunk are stored contiguously on disk • If R is not chunked, one extra pre-processing step to reorganize R on disk
ComputeWavelet Algorithm • When a chunk is loaded for the first time, ComputeWavelet can perform entire computation for decomposing • Pairwise averaging and differencing is performed as soon as 2d averages are accumulated • Memory efficient- no more than one active sub-array at a time for each level of resolution
Processing Relational Queries in Wavelet Coefficient Domain Wavelet-Coefficient Synopses WT1, WT2,…WTk Wavelet-Coefficient Synopses WT1, WT2,…WTk Render(WT1…WTk) Op(WT1,….WTk) RS of Wavelet Coefficients WS Approximate Relations T1, T2,….Tk Op(T1, T2…. Tk) Render(WS) Approx. Result Relation S Approx. Result Relation S
Selection Operator Our selection operator has the general form selectpred(WT), wherepredrepresents a generic conjunctive predicate on a subset of the d attributesin T; that is, pred = (li1 ≤ Xi1 ≤ hi1 ) ∧ . . . ∧ (lik ≤ Xik ≤ hik ), wherelijand hijdenote the low and high boundaries of the selected range along each selection dimension Dij , j = 1, 2, · · · , k, k ≤ d.
Relation Selection - Relational Domain Joint Data Distribution Array 3 2 1 3 Dim. D1 2 3 1 7 3 4 6 8 6 Dim. D2 Query Range • In relational domain, interested in only those cells inside query range • In wavelet domain, interested in only the coefficients that contribute to those cells
Experimental Study • Improved answer quality • Low synopsis construction costs • Fast query execution
Conclusion • Multidimensional wavelets as an effective tool for general purpose approximate query processing in modern, high dimensional applications • The query processing algorithms operate directly on the wavelet-coefficient synopses of relational data, thus allowing for very fast processing of arbitrarily complex queries entirely in the wavelet-coefficient domain • Extensive experimental study with synthetic as well as real-life data sets that verifies the effectiveness of the wavelet-based approach compared to both sampling and histograms