330 likes | 347 Views
Learn efficient wavelet-based approach for OLAP aggregation queries using compact data cube. Improve speed and accuracy of range sum queries in high-dimensional datasets.
E N D
Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang
Guidelines • Overview • Preliminaries • The New Approach • The construction of the Algorithm • Experiments and Results • Summery
The problem Computing multidimensional aggregates in high dimensions is a performance bottleneck for many On-Line Analytical Processing (OLAP) applications. Obtaining the exact answer to an aggregation query can be prohibitively expensive in terms of time and/or storage space in data warehouse environment. Obviously, it is advantageous to have fast, approximate answers to OLAP aggregation queries.
Processing Methods • Exact MethodsFocus on how to compute the exact data cube • Approximate MethodsBecoming attractive in OLAP applications. They have been used in DBMS for a long time. In choosing proper approximation technique, there are two concerns: • Efficiency • Accuracy There are two classes of methods for processing OLAP queries:
Histograms and Sampling Methods Advantage: Simple and natural Construction procedure is very efficient Disadvantage: Inefficient to construct in high dimensional Can not fit in internal memory Histograms and sampling are used in a variety of important applications where quick approximations of an array of values are needed. Use of Wavelet-based techniques to construct analogs of histograms in databases has showed substantial improvements in accuracy over random sampling and other histogram based approaches:
The Intended Solution The proposed method provides approximate answers to high dimensional OLAP aggregation queries in MASSIVE SPARSE DATA SETS in time efficient and space efficient manner. • Traditional Histogram infeasible for massive high dimensional data sets • Previously Developed Wavelet Technique efficient only for dense data • Previously Approximation Technique not accurate enough results for typical queries
The Compact Data Cube The performance of this method depends in the compact data cube, which is an approximate and space efficient representation of the underlying multidimensional array, based upon multiresolution wavelet decomposition. In the on-line phase, each aggregation query can generally be answered using the compact data cube in one I/O or a small number of I/Os, depending upon the desired accuracy.
The Data Set A particular characteristics of the data sets is that they are MASSIVE AND SPARSE Denotes the set of dimensions S d-dimensional array which represent the underlying data Denotes the total size of array S where |D i| is the size of dimension D i Is defined to be the number of populated entries in S Contains the value of the measure attribute for the corresponding combination of the functional attribute
Range Sum Queries An important class of aggregation queries are the so called rangesum queries, which are defined by applying the sum operation over a selected continuos range in the domain of some of the attributes. A range sum query can generally be formulated as follows:
The d’-Dimensional Range Sum Queries An interesting subset of the general range sum queries are d’-dimensional range sum queries in which d’<<d. In this case ranges are specified for only d’ dimensions, and the ranges for the other d-d’ dimensions are implicitly set to be the entire domain
Traditional vs. New approach In traditional approaches of answering range sum queries using data cube, all the subcubes of the data cube need to be precomputed and stored. When a query is given, a search is conducted in the data cube and relevant information is fetched. In the new approach, as usual some preprocessing work is done on the original arrays, but instead of computing and storing all the subcubes, only one, much smaller compact data cube is stored. The compact data cube usually fits in one or small number of disk blocks.
Approximation Advantages • Storage space for both the precomputation and the storage of the precomputed data cube. • Even when a huge amount of storage space is avaliable and all the data cubes can be stored comfortably, it may take too long to answer a range sun query, since all cells covered by the range need to be accessed. This approach is preferable to the traditional approaches in two important respects:
I/O Model The convential parallel disk model Restriction: I=1
The Method Outline 1. Decomposition 2. Thresholding 3. Reconstruction The method can be divided into three sequential phases:
Decomposition • Compute the wavelet decomposition of the multidimensional array S • Obtaining a set of C’ wavelet coefficients (C’ ~ Nz) As in practice, it is assumed that the array is very sparse
Thresholding and Ranking • Keep only C (C’) wavelet coefficients corresponds to the desired storage usage and accuracy. • Rank only the C wavelet coefficients according to their importance in the context of accurately answering typical aggregation queries. • The C ordered coefficients compose the compact data cube.
Reconstruction • In the on line phase, an aggregation query is processed by using the K most significant coefficients to reconstruct an approximate answer. • The choice of K depends upon the time the user is willing to spend. Notes: • More accurate answers can be provided upon request. • The efficiency is crucial, since it affects the query response time directly.
Wavelet Decomposition Wavelets are a mathematical tool for the hierarchical decomposition of functions in a space efficient matter. HAAR Wavelets: • Conceptually very simple wavelet basis functions • fast to compute • easy to implement
HAAR Wavelet - Example Suppose we have a one dimensional signal of N=8 data items S = [2,2,0,2,3,5,4,4] Wavelet transform [2,1,4,4,0,-1,-1,0] By repeating this process recursively on the average, we get the full decomposition:
Wavelet Transform The wavelet transform is a single coefficient representing the over all average of the original signal , followed by the detail coefficients in the order of increasing resolution • The individual entries are called the wavelet coefficients. • Coefficients at the lower resolution are weighted more than the one at the higher resolution. • The decomposition is very efficient: • O(n) CPU time • O(N/B) I/Os Increasing resolution
Building The Compact Data Cube The goal of this step is to compute the wavelet decomposition of the multidimensional array S, obtaining a set of C’ wavelet coefficients. 1. Partition the d dimensions into g groups, for some 1gd Where i0=0 ig=d Gj must satisfy 2. The algorithm for constructing the compact data cube consists of g passes: • Gj is read into memory • multidimensional decomposition is performed • results are written out to be used for the next pass
Eliminating Intermediate Results One problem is that the density of the intermediate results will increase from pass to pass, since performing wavelet decomposition on sparse data usually results in more nonzero data. The natural solution is truncation keeping roughly only Nz entries Learning process: • During each pass, an on-line statistics of wavelet coefficients are kept to maintain cutoff value. • Any entry with its absolute value below the cutoff value will be thrown away on the fly.
Thresholding and Ranking Given the storage limitation for the compact data cube, it is possible to keep only several number of wavelet coefficients: let C’ - number of wavelet coefficients. C - number of wavelet coefficients that can be stored. Since C<<C’, the goal is to determine which are the best C coefficients to keep, so as to minimize the error of approximation.
P-norm Once the error rate is decided for individual queries, it is meaningful to choose a norm by which to measure the error of a collection of queries. let be the vector of error over a sequence of Q queries.
Choosing the Coefficients Choosing the C largest (absolute value) wavelet coefficients after normalization is provably optimal in minimizing the 2-norm. But if coefficient Ci is more likely to contribute more than another one then its w(C) will be greater, where: Finally: 1. Pick C’’ (C<C’’<C’) largest wavelet coefficients 2. Among the C’’ coefficients choose the C with the largest weight 3. Order the C coefficients in decreasing order to get the compact data cube.
Answering On-Line Queries The error tree is built based upon the wavelet transform procedure. • Mirrors the wavelet transform • It is bottom up process • S(l:H) denotes the range sum between s(l) and s(h)
Constructing The Original Signal The original signal S can be constructed from the tree nodes by the following formulas: Not all terms are always being evaluated, only the true contributors are quickly evaluated for answering a query.
Answering A Query To answer a query form Using k coefficients, Of the compact data cube R, the following algorithm is used: • AnswerQuery(R,k,l1,h1,…,ld’,hd’) • answer = 0; • for I=1,2…k do • if Contribute(R[I],l1,h1,…,ld’,hd’) • answer=answer + • Compute_Contribute (R[I],l1,h1,…,ld’,hd’) • for j=d’+1,….,d do • answer = answer x |Dj| • return answer ;
Experiments Description The experimental results were performed using real-world data from the U.S. Census Bureau. • The data file contains 372 attributes. Measure attribute is income. Functional attributes include among others: age, sex, education, race, origin. • Although the dimensions size are generally small, the highdimensionality results in 10-dimensional array with more than 16,000,000 cells, density~0.001, Nz=15,985. • Platform: Digital Alpha work station running Digital unix 4.0 512 MB internal memory (only 1-10 MB are used for the program) logical block transfer size 2*4 KB
Experiments Sets - variable density • Dimensions groups were partitioned to satisfy M/2B condition • For all data sets g=2 • the small differences in running time were mainly caused by the on-line cutoff effect.
Experiments Sets - fixed density Running time scales almost linearly with respect to the input data size
Accuracy of the Approximations Answers • Comparison with traditional histogram has no meaning, because they are too inefficient to construct for high dimensional data. • Comparison with random sampling algorithms depends on the distribution of the non zero entries (random sampling performs better for uniform distribution).
Summery A new wavelet technique for approximate answer to an OLAP range sum queries was presented. Four important issues were discussed and resolved: • I/O efficiency of the data cube construction, especially when the underlying multidimensional array is very sparse. • Response time in answering an on-line query • Accuracy in answering typical OLAP queries. • Progressive refinement