420 likes | 527 Views
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab and Sub-Totals. Gray et Al. Presented By: Priya Rajan. Introduction. Increasing popularity of Data Warehouses has increased the demand for querying, reporting and OLAP tools
E N D
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab and Sub-Totals Gray et Al. Presented By: Priya Rajan
Introduction • Increasing popularity of Data Warehouses has increased the demand for querying, reporting and OLAP tools • Need to extract maximum information from data and have the ability to analyze it and aggregate it across multiple dimensions • Cross Tabs over various dimensions are particularly useful because aggregates convey a lot of information when the amount of data is large
Traditional Representation of N-Dimensional Data • N-Dimensional data represented as relation with N-attribute columns • Example: Here Sales Quantity is a function of Supplier,Product and Date Represents sales quantity by supplier S1 of product P2 on Date_ID 1& at Loc5 Product Date_ID Loc_ID Sales Quantity Supplier S1 P1 1 4 10 Sales Table S1 P3 2 2 18 S2 P2 3 3 39 S2 P2 3 1 11
Aggregation Queries • Standard SQL provides COUNT(), SUM(), MIN(), MAX(), AVG() • Example:SELECT AVG(Sales Quantity) FROM SALES • SQL also provides a GROUP-BY operator-produces 0 & 1 dimensional aggregates • Example: SELECT SUPPLIER, PRODUCT, AVG(SALES QUANTITY) AS A FROM SALES GROUP BY SUPPLIER, PRODUCT • Result:
Inadequacy of GROUP BY • Cannot perform aggregation over computed categories (histograms) directly • Example: Suppose suppliers and products together define a ‘factory’ and a Factory() function maps suppliers and Loc_Ids to a factory. It would be nice to define the query to find maximum daily sales for each factory as: SELECT Date, factory, MAX(SALES) FROM SALES GROUP BY Date, Factory(suppliers,Loc_ID) as factory; • Instead we have to compute a table indirectly and then perform the aggregation (a nested query): SELECT Date, factory, MAX(SALES) FROM (SELECT Date, factory(suppliers, LOC_ID) as factory, sales FROM SALES) AS Foo GROUP BY Date, factory;
Inadequacy of GROUP BY • In analyzing data we often need to “roll-up” (example: go from sales/citysales/state) or “drill-down” (vice-versa) • Problem with Group By for roll-ups and drill-downs: have to store the subtotal at each level of the aggregation
Need 2N columns to represent a rollup of N elements! Introduce ‘ALL’ value Problem: Need to union many GROUP Bys Rollup Representations
SELECT Model, ALL, ALL, SUM(Sales) FROM Sales WHERE Model = 'Chevy' GROUP BY Model UNION SELECT Model, Year, ALL, SUM(Sales) FROM Sales WHERE Model = 'Chevy' GROUP BY Model, Year UNION SELECT Model, Year, Color, SUM(Sales) FROM Sales WHERE Model = 'Chevy' GROUP BY Model, Year, Color; Note that aggregation by year is not included To add that in we would have to Union in SELECT Model, ALL, Color, SUM(Sales) FROM Sales WHERE Model = 'Chevy' GROUP BY Model, Color; Query is really complex-for a cross tab in 6-dimensions need 26=64unions of GROUP BY statements! This is the motivation for a new operator! Rollup Representations To arrive at the previous table we would have to do the following complex query:
The CUBE Operator • The CUBE operator builds a table with all possible aggregated values constructed by grouping—it is the generalization of a simple aggregate function in N-dimensions • The 0-D cube is the aggregate of all values • Ex: Total Sales over all models,years and colors • The 1-D cube is a Group By on one dimension • Ex: Group By Color • The 2-D cube is a cross-tab • The 3-D cube is an intersection of 3 2-D cubes • This can be generalized to N-Dimensions
Intuition behind the CUBE 0-D cube 1-D cube 2-D cube 3-D cube
Creating a CUBE • The syntax for a cube is: SELECT <select-list> FROM <relation> GROUP BY CUBE <select-list> • To create the global cube need to generate the power set of aggregation columns {Model, Year, Color} {Model,Year} {Model,Color} {Year,Color} {Model} {Year} {Color} { }
Creating a CUBE • The CUBE operator first aggregates over the <select-list> just as the GROUP BY would. It then Unions in the 2N-1 super aggregates shown in the previous picture. • ‘ALL’ is substituted for the Aggregating column. Every domain has an extra ‘ALL’ value now. The cube has (C1+1)*(C2+1)…..*(CN+1) values, where Ci is the size of each domain. • Idea is that Since Ci is usually large, CUBE will be only a little bigger compared to corresponding GROUP BY but provides a lot more information (not really true—see Conclusion)
ROLLUP Operator • If the user only needs a roll-up report or a drill down report computing the entire cube is wasteful • The ROLLUP Operator is used to compute only the super aggregates • In the previous example the if ROLLUP were used in place of CUBE only the following will be returned (marked in the figure in red) • Regular aggregation rows that would be produced by GROUPBY without using ROLLUP • First-level subtotals aggregating across Color for each combination of Model and Year • Second-level subtotals aggregating across Year and Color for each Model value • A grand total row
Integrating CUBE into SQL • GROUP BY <select-list> ROLLUP <select-list> CUBE<select-list> • Could support histograms by extending GROUP BY • Use of ‘ALL’ to indicate ‘aggregate’ could cause problems • ALL is set valued other things are not • Could use NULL instead of ALL • Use Grouping function to distinguish between NULL and ALL
Computing Cubes and Rollups • Aggregation Computation Techniques • Compute aggregates at lowest possible system level to minimize processing cost • Use hashing to organize aggregation columns in memory • If aggregates don’t fit in memory use hybrid-hashing to organize data by value • If data is spread over several disks exploit parallelism for each partition and then combine results
Computing Cubes and Rollups • Simplest Algorithm 2N algorithm • Allocate handle for every cell in the cube • When a tuple is passed in a function Iter is called on the handle and tuple value. Iter is called 2N times • After all tuples have been computed a function, final is called to complete the aggregation • If base table has T tuples Iter is called Tx 2N times—can improve!
Types of Aggregate Functions • Distributive Functions: F(X)=G(F(X)) Where X: {Xij | i = 1,...,I; j=1,...,J} • Example: MIN(), COUNT(),SUM(), MAX() • Algorithm Idea: Can compute N-1th dimension by aggregating Nth dimension • Example: Cross Tab—Here SUM() is the Distributive Function 1st Dimension=sum of 2nd Dim rows 2-D cube Oth Dimension=sum of 1st dimension
Types of Aggregation Functions • Algebraic Functions: F(X)=H(G(X)) • Example: Average(), Standard Dev. In Average G would be the sum and the count and H would be sum/count • Algorithm Idea: A handle is maintained for each cube cell and is passed on to the N-1dimensional superaggregates. A function Iter(&handle, &handle) is used to pass the aggregates. At the end Final() is called. • Example: Cross-Tab. Here Avg() is the algebraic function
Aggregation Functions • Holistic Functions: Functions where the sub-aggregates cannot be described by a fixed storage bound • Example: Median(), Rank() • No known algorithm better than 2N
Comments & Conclusion • The CUBE operator is useful in Multi-Dimensional data analysis. Popular in OLAP tools • Can now answer queries that were either previously impossible or very complicated in SQL (using GROUP BY) • Claim that CUBE is only slightly larger than corresponding GROUP BY for large domains may not always be valid-often a large number of data values are null, so CUBE is much larger than the GROUP BY • Next logical issue is the issue of maintaining cubes • If Cube is materialized need to consider updates. Need to only small scan for some functions but need re-computation for others. • Example: Max() for SELECT, INSERT and DELETE
Implementing Data Cubes Efficiently Venky Harinarayan Anand Rajaraman Jeffery D. Ullman
Introduction • Decision Support Systems(DSS) rely a great deal on aggregations made on data in data-warehouses • These queries can be very complex and could take a substantial amount of time. • Need queries to be computed very quickly for Decision Support Systems • Idea: Can materialize or pre-compute certain queries • Pre-compute frequently asked queries • Pre-compute infrequent queries so that the time saved can be used to compute other queries
Materializing the Cube • Data Warehouses represent data in a multi-dimensional cube to answer aggregation queries and enable analysis in multiple dimensions • Computing every cell on request increases the response time, making it unacceptable for DSS • Need to materialize cube. Options: • Materialize the entire cube: Although pre-computing every cell gives excellent response times but space consumption makes it infeasible • Materialize parts of the cube: This is the problem explored in this paper-pick the optimal (close to optimal) cells to materialize
Dependencies in the Cube • Cells in the data cube that have the value “ALL” can be computed from other cells in the cube. Those that don’t are independent—need to query raw data to compute • Can materialize dependent cells(‘views’) • Interesting Questions: • How many views to materialize for good performance? • Which views to materialize to minimize average query cost?
Example {Part,Supplier,Customer} {part, customer} {part, supplier} {supplier,customer} {part} {supplier} {customer} { } • We need to materialize the {Part, Supplier, Customer} because we cannot derive this from any other view. Say the cost of this is 6 M rows • Suppose we want to answer a query on supplier-customer. If the cost of this view is also 6M, there is no point in materializing it since we can derive it from the {Part, Supplier, Customer} view. On the other hand say the part view costs 0.2 M and we need to query parts. There might be some reason in materializing it . If we used the {Part, Supplier, Customer} view we would have to process 6M rows.
Lattice Structure & Hierarchies • We say Query1 “is dependent on” Query2 if the result set of Query2 can be used to answer Query1 • For example (part) “is dependent on” (part, customer) • Can form a lattice diagram from these dependencies • In addition each dimension of the data-cube can have a set of associated values, forming a hierarchy • Example: Day Week Month Year none • Week is “dependent” on Day
Combining Dimension Hierarchies with Query Dependencies • Need to combine these two concepts when arriving at query dependencies. The results from one query can contain any element in the hierarchy for that dimension • Suppose we have two tuples T1 and T2 then T1 is dependent on T2 iff every element in the tuple T1 is lower in the dimension hierarchy compared to the corresponding element in T2.
Cost Model • If view V has been materialized and V can be used to process V’ then the cost of answering V is assumed to be the number of rows in V’ • In reality more or less than 1 entire scan of V’ may be required. V’ may be clustered on a particular attribute or may have an index on an attribute, eliminating the need to scan it once. • This uniform cost model is adopted to simplify analysis • Experimentally showed the linear relationship between query size and running time
Optimizing Data Cube Lattices • There is a space-time tradeoff. Need to optimize this tradeoff • Problem: • Minimize Time taken to evaluate queries • Have to materialize k views (this can be easily converted to a space restraint and the algorithm is the same) • This problem is NP-Complete. • Greedy Algorithm used to approximate optimal solution.
Algorithm Let C(v) be the cost of a view v If S is the set of views chosen so far the the benefit of adding a view v to s is B(v,S) • For each w that is depended on v let Bw be defined as follows: • If u is in S and w is dependent on u(u is the least cost such view in S) & C(v)<C(u) then Bw=C(v)-C(u) otherwise Bw=0 • B(v,S) is then w dependent on v Bw • Intuitively what is being measured is the sum of the benefits other views will gain if v is included in S
Algorithm S={top view} Select v such that B(v,S) is maximized K times S is the set of ‘optimal’ views
Performance of Greedy Algorithm Greedy guarantees at least (e-1/e) of the benefit of the optimal algorithm • The greedy algorithm is exactly optimal when the benefit of the first picked view is much greater the subsequent views • If the benefits of all the views picked are equal then the greedy algorithm is optimal
Performance of Greedy Algorithm & Extensions to the Algorithm • The views in the lattice should have some probability associated with it so that the algorithm uses these probabilities as weights • Have to restrict space not number of views • Now need to consider benefit/space • Problem is that a small view may have high benefit/space compared to large view and large view may be excluded because of lack of space— views picked in the future are affected by this. Treat this as a boundary case and ignore!
Hypercube Lattice • In a hypercube typically, grouping occurs on n attributes and aggregation on the n+1th attribute. • Example: SELECT Mode, Year, Color, SUM(Sales) AS Sales FROM SALES GROUP BY CUBE Model, Year, Color • Exploit this regularity in selecting views • If size of each domain is r & only m cells of the top-most view are filled. If we group on I attributes then: If ri>=m then use m as size for any view where If ri>m If ri<m then use ri as size for any view where If ri <m
Time and Space Optimality • Space is minimized when only the top query is materialized • Each query takes time m. Total cost is mx2N • Time is minimized if we materialize every query
Conclusion • Materialization of parts of the cube is very important in increasing response times for DSS • Pick the ‘optimal’ views that strike a balance between the space lost due to materialization and the increase in response time due to computing on the fly. • The greedy algorithm gives a close approximation of the optimal solution • Future Work: Dynamic materialization strategies