PrefixCube: Prefix-sharing Condensed Data Cube

PrefixCube: Prefix-sharing Condensed Data Cube Jianlin Feng Qiong Fang Hulin Ding Huazhong Univ. of Sci. & Tech. fengjl@mail.hust.edu.cn Nov 12, 2004

Outline • Introduction • Related Work • ODM: Ordered Datacube Model • BST-Condensed Cube • Prefix-sharing Condensed Cube • Comparisons • Conclusions 2

Introduction • Data Cube (ICDE’96) • N-dimensional cube(A1, A2, …, AN) • 2N cuboids, i.e. GROUP-BYs • The Huge Size Problem • When R is sparse, the size of a cuboid is possibly close to the size of R. • The I/O cost even for storing the cube result tuples becomes dominative. 3

Related Work • Condensed Cube (ICDE’02) • Dwarf (SIGMOD’02) • Quotient Cube (VLDB’02) • QC-Tree (SIGMOD’03) • Basic idea: remove redundancies existing among cube tuples. • prefix redundancy • suffix redundancy 4

Prefix redundancy • Given an example cube(A, B, C) • Each value of dimension A occurs in 4 cuboids: cuboid(A), (AB), (AC) and (ABC) • Possibly many times in each cuboid except cuboid(A) • Inter-cuboid and Intra-cuboid prefix redundancy 5

Suffix Redundancy • Occurs when cube tuples belonging to different cuboids are actually aggregated from the same group of base relation tuples. • An extreme case • Let the source relation R have only one single tuple r(a1, a2, …, an, m); • 2n cube tuples can be condensed into one physical tuple: (a1, a2, …, an, V), where V = aggr(r); • together with some information indicating that it is a representative tuple. 6

Thinking… • Condensed cube • It condenses those cube tuples, aggregated from one single base tuple, into a physical tuple in order to reduce cube’s size. • Dwarf • Besides suffix coalescing, i.e. multi-base-tuple condensing, it also realized full prefix-sharing so as to achieve high cube size reducing effectiveness. 7

Motivation • HOW to further reduce condensed cube’s size while taking into account query characteristics we intend to answer - range query? • Augmenting BST-condensing with removing of intra-cuboid prefix redundancy! 8

Ordered Datacube Model • Value ALL(or *) is encoded as 0. • A dimension D and its cardinality C • each dimension value is one-to-one mapped to an integer value between 1 and C inclusively. • N dimensions form a N-dimensional space. • The origin O(0, 0, …, 0) represents the grand total. 9

Ordered Datacube Model • Under ODM, a range query against a data cube can actually be reduced to a sub-query against only one particular cuboid in the cube or a union of such sub-queries. 10

BST-Condensed Cube • Base Single Tuple (BST) • t1 is a BST on SD {A} and {B} • t2 is a BST on SD {B} • A unique minimal BST-Condensed Cube can be got when fully taking advantage of each BST with all of its SDs - MinCube. 11

BU-BST Condensed Cube • BottomUpBST algorithms (ICDE’02) • Each BST corresponds to only one SD. • It’s easier to compute and to restore normal cube tuple from condensed cube compared with MinCube. Note: BST Condensing is a special kind of Prefix-sharing! A group of cube tuples with sharing prefix are represented by a BST! 12

A BU-BST Condensed Cube Example Note: Intra-cuboid prefix redundancy: ct3 and ct4 Inter-cuboid prefix redundancy: ct2, ct3 and ct5 13

Prefix-sharing Condensed Cube - PrefixCube Prefix-sharing BST Condensing + Intra-cuboid prefix-sharing PrefixCube 14

A PrefixCube Example 15

Corresponding Dwarf 16

PrefixCube vs. Dwarf 17

Effectiveness of Size Reduction • Datasets • synthetic datasets with uniform distribution • # of tuples: 1,000,000 (a) Cardinality = 100 (b) Cardinality = 1000 18

Effectiveness of Size Reduction • PrefixBUC • Full Cube (computed by BUC) • Prefix-sharing 19

Impact of Data Density • Datasets • Uniform distribution • # of dimensions: 6 • Cardinality of dimensions: 100 • # of tuples: range from 1,000 to 1,000,000 20

Impact of Data Skewness • Datasets • Zipf distribution • # of tuples: 1,000,000 • Cardinality of dimensions: range from 1,000 to 500 with 100 interval • Zipf factor: range from 0 to 0.8 with 0.2 interval 21

Real-world Dataset • Datasets • Weather Datasets • # of tuples: 1,015,367 22

Conclusion • A new cube structure PrefixCube was proposed by augmenting BU-BST condensing with intra-cuboid prefix-sharing. • It can greatly reduce data cube’s size compared with BU-BST condensed cube. • It can also reduce the impact of data skew on BU-BST condensing. • It can make a quite stable size reduction on both dense and sparse datasets. 23

The End Thank u! Any question? 24

PrefixCube: Prefix-sharing Condensed Data Cube