Parallel Data Cube Mining: OLAP Views for Decision Support

Parallel Data Cube Data Mining OLAP (On-line analytical processing) cube / group-by operator in SQL

Operational data collected into DW DW used to support multi-dimensional views Views form the basis of OLAP processing Our focus: the OLAP server Data Warehousing for Decision Support

Collection of feature attributes Aggregate along one or more measure attributes Reduce the granularity by collapsing dimensions Multi-dimensional views

Proposed by Gray et al (Microsoft) in 1995 Exploits the relationship between cuboids to compute all 2d cuboids In OLAP views are typically pre-computed to improve query response time Data Cube Generation

Top Down Cube Compute high dimension views first Exploit shared dimensions Pipesort PipeHash Bottom Up Cube Minimizes external memory sorting by partitioning first on single attributes ArrayCube Sequential Solutions

ROLAP relational data representation harde to build and query smaller storage no translation from/to relational model MOLAP array representation easy to build and query large storage needs translation from/to relational model Sequential Solutions

Construct the data cube lattice Estimate the edge costs Find a least cost spanning tree Compute the views by following the “pipes” Top Down Cube (Pipesort)

Optimizations • Share-sorts - sharing sorting cost across multiple group-bys. • Smallest parent - computing a cuboid from the smallest previously computed parent. • Cache results - reduce I/O by caching (in memory) parent views from which other cuboids are computed. • Amortize disk-scans - compute as many child views as possible when scanning each parent.

Bottom Up Cube

Partition large view into memory-sized units Perform sorting operations in memory May significantly reduce external memory processing Bottom Up Cube

Our Results • Parallel top-down ROLAP cube construction for shared disks (Distributed and Parallel Databases, 2002) • Parallel top-down ROLAP cube construction for distributed disks (IPDPS 2002) • Parallel bottom-up ROLAP cube construction for shared and distributed disks (Distributed and Parallel Databases, 2002) • Parallel ROLAP cube indexing for distributed disks (CCGrid 2003)

Parallel top-down ROLAP cube Our approach: • Partition the load in advance and assign cuboids to individual processors • Local computation exploits existing optimized sequential algorithms (ROLAP) • Communication is reduced to a single phase in which work lists are distributed

Cut the process tree into p “equal weight” sub-trees Each processor independently generates cuboids from its own sub-tree Load balance/stripe the output Parallel top-down ROLAP cube

Tree Partitioning • Optimal tree partitioning is NP-complete • Min-max tree k-partitioning: Given a tree T with n vertices and a positive weight assigned to each vertex, delete k edges in the tree to obtain k connected components T1, T2, ... Tk+1 such that the largest total weight of a resulting sub-tree is minimized. • O(n) time, Frederickson 1990 • O(Rk(k + log d)+n) time - Becker, Perl and Schach ‘82

Over Sampling

Time vs. #Proc

For more information ... http://cgm.dehne.net

Parallel Data Cube Mining: OLAP Views for Decision Support

Parallel Data Cube Mining: OLAP Views for Decision Support

Presentation Transcript

Data Cube and OLAP Server

Data Warehouse and Data Cube

OMEGA DATA CUBE ORBNNNN_M.QUB

Parallel Data Cubing

Parallel Data Mining

DATA CUBE

Parallelizing the Data Cube

Parallel Data Structures

DATA CUBE

Data Cube Interface

Data Cube Training

Parallel digital data

Compressing Data Cube in Parallel OLAP Systems

Parallel Prefix and Data Parallel Operations

OMEGA DATA CUBE ORBNNNN_M.QUB

OMEGA DATA CUBE ORBNNNN_M.QUB

Cube Materialization: Full Cube, Iceberg Cube, Closed Cube, and Shell Cube

Data Parallel Pattern