Online Analytical Processing (OLAP)

Online Analytical Processing (OLAP) An Overview Kian Win Ong, Nicola Onose Mar 3rd 2006

Overview • Motivation • Multi-Dimensional Data Model • Research Areas • Optimizations • Materializing multiple aggregates simultaneously • Materialization strategy

Motivation • Aggregation, summarization and exploration • Of historical data • To help management make informed decisions

Different Goal • Aggregation, summarization and exploration • Of historical data • To help management make informed decisions • Find the total sales for each product and month • Find the percentage change in the total monthly sales for each product

Different Requirements • OLTP – On-Line Transaction Processing • OLAP – On-Line Analytical Processing

Query Language Extensions • In the real world, data is stored in RDBs.

Query Language Extensions • In the real world, data is stored in RDBs. • How to express N-dimensional problems using 2D tables?

Query Language Extensions • In the real world, data is stored in RDBs. • How to express N-dimensional problems using 2D tables? • Can we combine OLAP and SQL queries? • Jim Gray et al:Data Cube: A Relational Aggregation Operator1997

Query Language Extensions • histograms Problems with GROUP BY SELECT sales, prod_name, population FROM sales_history GROUP BY Population(City, State) as population

Query Language Extensions • histograms • rollup/drilldown Problems with GROUP BY non relational representation

Query Language Extensions • histograms • rollup/drilldown Problems with GROUP BY relational, but the rollup is huge

Query Language Extensions • histograms • rollup/drilldown • cross tabulations Problems with GROUP BY 2-D aggregation is more compact and more natural:

Query Language Extensions Reducing the number of attributes

Query Language Extensions • introduce a new value: “ALL” Reducing the number of attributes “ALL” = the set over which we aggregate

Query Language Extensions • GROUP BY (1D) General approach

Query Language Extensions • GROUP BY (1D) • Cross Tab (2D) General approach the corresponding relation:

Query Language Extensions • GROUP BY (1D) • Cross Tab (2D) • Cube (3D) General approach By cat. andname (does it make sense?) By cat.andmonth By monthand name

Query Language Extensions • GROUP BY (1D) • Cross Tab (2D) • Cube (3D) • Any hypercube can be represented as a relation! General approach

Query Language Extensions • a CUBE relation, with aggregation function f(.) (x1, x2, …, xn-1, xn, f() ) …………………………… (x1, xn-1, …, xn, ALL, f() ) …………………………… (x1, x2, …, ALL, xn, f() ) …………………………… • after ROLLUP , reduce to a linear # of tuples (x1, x2, …, xn-1, xn, f() ) ………………………………… (x1, xn-1, …, xn, ALL, f() ) ………………………………… (x1, x2, …, ALL, ALL, f() ) ………………………………… (ALL, ALL, …, ALL, ALL, f() ) General approach

Query Language Extensions The new operators: CUBE, ROLLUP SELECT prod_category, prod_name, month, SUM(sales) AS sales FROM sales_history GROUP BY CUBE prod_category, prod_name, month Idea: Group by the CUBE list. Union the aggregates. Introduce the ALL values.

Query Language Extensions The new operators: CUBE, ROLLUP SELECT prod_category, month, day, state, prod_name, SUM(sales) AS sales FROM sales_history GROUP BY prod_category ROLLUP month, day CUBE city, state

Research Areas • SQL language extensions • Server architecture • Parallel processing • Index structures • Materialized views

Simultaneous Multi-Dimensional • Optimization to calculate multiple aggregates simultaneously • Useful for materialization of aggregate views Aggregates • Y. Zhao, P. Deshpande, J. NaughtonAn Array-Based Algorithm for Simultaneous Multidimensional AggregatesSIGMOD 1997

Multiple Aggregates Aggregate on…

Multiple Aggregates Aggregate on… • Sales by Product / City • Sales by Product / Month • Sales by Month / City • Sales by Product • Sales by City • Sales by Month • Sales (Total) Is it possible to • make a single pass over the transactional table? • calculate multiple aggregates simultaneously?

Chunking 64 Partition transactional data into array chunks 14 15 16 13 12 9 10 11 42 Dimension B City 36 5 8 7 6 Array Chunk 20 1 2 3 4 1 Dimension C 12 Month Dimension A Product

Naïve Algorithm 64 14 15 16 13 Dimension A 12 9 10 11 42 Dimension B 36 5 8 7 6 20 4 1 2 3 Dimension C Pivot on ABaggregate on all C Dimension A

Naïve Algorithm 64 13 14 15 16 9 12 10 11 42 Dimension B 36 5 8 7 6 20 1 2 3 4 Dimension C Pivot on ABaggregate on all C Dimension A Pivot on ACaggregate on all B Pivot on BCaggregate on all A

Single Pass Algorithm 64 AB 14 15 16 13 AC 12 9 10 11 42 B 36 5 8 7 6 20 1 2 3 4 Dimension C BC Dimension A Make a single pass over data

Single Pass Algorithm 64 AB 14 15 16 13 AC 12 9 10 11 42 B 36 5 8 7 6 20 1 2 3 4 Dimension C BC Dimension A Simultaneously maintain multiple aggregates

Single Pass Algorithm 64 AB 14 15 16 13 AC 12 9 10 11 42 B 36 5 8 7 6 20 1 2 3 4 Dimension C BC Dimension A Write out completed aggregates

Single Pass Algorithm 64 AB 14 15 16 13 AC 12 9 10 11 42 B 36 5 8 7 6 20 1 2 3 4 Dimension C BC Dimension A Only allocate memory that is necessary

Single Pass Algorithm AB Array Chunk ABC 4 x 4 x 4 AC AB 16 x 4 x 4 BC 4 x 4 AC 4 x 4 x 4 B 4 C 4 A 4 x 4 BC Minimummemory spanning tree all 1

Multi Pass Algorithm Recursively aggregate ABCD ABC ACD ABD BCD AB AC BC BD CD AD D A C B all

Implementing Data Cubes • Biggest problem for data warehouses: the size • Space / time trade-off:accelerate queries by materializing the cube

Implementing Data Cubes • Biggest problem for data warehouses: the size • Space / time trade-off:accelerate queries by materializing the cube • The size of the relations gets even bigger!

Implementing Data Cubes • Biggest problem for data warehouses: the size • Space / time trade-off:accelerate queries by materializing the cube • The size of the relations gets even bigger! • M(ultidimensional)OLAP: good query performance, but bad scalability • R(elational)OLAP: very scalable; query performance improved by materializing (partial) results

Implementing Data Cubes • V. Harinarayan, A. Rajaraman, J.D. Ullman:Implementing Data Cubes EfficientlySIGMOD 1996Presents a materialization strategy for the cells of the cube.

Implementing Data Cubes

Implementing Data Cubes • casted as particular case of the rewriting using views problem • what cells to materialize  what SQL views to materialize

Implementing Data Cubes ptc pc pt tc t p c none • casted as particular case of the rewriting using views problem • what cells to materialize  what SQL views to materialize p = product t = time c = city • simple idea: Q1 depends on Q2(Q1≤Q2) if Q1 can be fully answered using the results of Q2

Implementing Data Cubes city product_name day state product_category week month none year none none ptc pt tc pc … pwc pmc pts pcatt … … … ps pyc … • but cube dimensions are usually hierarchical X X p = product t = time c = city • direct-product lattice

Implementing Data Cubes • Def. cost of answering Q = # of rows in the table of ancestor(Q) • It can be estimated w/o materializing the views • Assume that all queries are identical to some view in the lattice

Implementing Data Cubes • For a set S and a view vB(v,S) = ∑w≤v, (w not in S) max{cost(w)-cost(v), 0} • Greedy algorithm for selecting k views to materialize from the lattice: • S := {top view} • For i=1 to k, add v to S s.t. B(v,S) is maximized • The greedy algorithm is an (e-1)/e ≈ 0.63 approx. of the optimum.

Discussion • Questions from the audience…

Online Analytical Processing (OLAP)