290 likes | 451 Views
Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7. Client. Client. Query & Analysis. Metadata. Warehouse. Integration. Source. Source. Source. Processing. ROLAP servers vs. MOLAP servers Index Structures Cube computation What to Materialize? Algorithms. ROLAP
E N D
Cube Computation and Indexes for Data Warehouses CPS 196.03Notes 7
Client Client Query & Analysis Metadata Warehouse Integration Source Source Source Processing • ROLAP servers vs. MOLAP servers • Index Structures • Cube computation • What to Materialize? • Algorithms
ROLAP server utilities relational DBMS ROLAP Server • Relational OLAP Server tools Special indices, tuning; Schema is “denormalized”
Sales City B A milk soda eggs soap Product 1 2 3 4 Date utilities MOLAP Server • Multi-Dimensional OLAP Server M.D. tools multi-dimensional server could also sit on relational DBMS
Date 2Qtr 1Qtr sum 3Qtr 4Qtr TV Product U.S.A PC VCR sum Canada Country Mexico sum All, All, All MOLAP Total annual sales of TV in U.S.A.
C c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c 0 B 60 13 14 15 16 b3 44 28 56 9 b2 40 24 52 5 b1 36 20 1 2 3 4 b0 a0 a1 a2 a3 A MOLAP B
Challenges in MOLAP • Storing large arrays for efficient access • Row-major, column major • Chunking • Compressing sparse arrays • Creating array data from data in tables • Efficient techniques for Cube computation Topics are discussed in the paper for reading
Index Structures • Traditional Access Methods • B-trees, hash tables, R-trees, grids, … • Popular in Warehouses • inverted lists • bit map indexes • join indexes • text indexes
Inverted Lists . . . data records inverted lists age index
Using Inverted Lists • Query: • Get people with age = 20 and name = “fred” • List for age = 20: r4, r18, r34, r35 • List for name = “fred”: r18, r52 • Answer is intersection: r18
Bit Maps . . . age index data records bit maps
Bitmap Index • Index on a particular column • Each value in the column has a bit vector: bit-op is fast • The length of the bit vector: # of records in the base table • The i-th bit is set if the i-th row of the base table has the value for the indexed column • not suitable for high cardinality domains Base table Index on Region Index on Type
Using Bit Maps • Query: • Get people with age = 20 and name = “fred” • List for age = 20: 1101100000 • List for name = “fred”: 0100000001 • Answer is intersection: 010000000000 • Good if domain cardinality small • Bit vectors can be compressed
Join • “Combine” SALE, PRODUCT relations • In SQL: SELECT * FROM SALE, PRODUCT WHERE ...
Join Indexes join index
Counting Exercise • How many cuboids are there in a cube? • The full or nothing case • When dimension hierarchies are present • What is the size of each cuboid?
day 2 day 1 Lattice of Cuboids 129 all city product date city, product city, date product, date city, product, date
Dimension Hierarchies all state city
Dimension Hierarchies all product city date product, date city, product city, date state city, product, date state, date state, product state, product, date not all arcs shown...
Efficient Data Cube Computation • Data cube can be viewed as a lattice of cuboids • The bottom-most cuboid is the base cuboid • The top-most cuboid (apex) contains only one cell • How many cuboids in an n-dimensional cube with L levels? • Materialization of data cube • Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization) • Selection of which cuboids to materialize • Based on size, sharing, access frequency, etc.
Derived Data • Derived Warehouse Data • indexes • aggregates • materialized views (next slide) • When to update derived data? • Incremental vs. refresh
does not exist at any source Idea of Materialized Views • Define new warehouse tables/arrays
Efficient OLAP Processing • Determine which operations should be performed on available cuboids • Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice = selection + projection • Determine which materialized cuboid(s) should be selected for OLAP: • Let the query to be processed be on {brand, province_or_state} with the condition “year = 2004”, and there are 4 materialized cuboids available: 1) {year, item_name, city} 2) {year, brand, country} 3) {year, brand, province_or_state} 4) {item_name, province_or_state} where year = 2004 Which should be selected to process the query? • Explore indexing structures & compressed vs. dense arrays in MOLAP
What to Materialize? • Store in warehouse results useful for common queries • Example: total sales day 2 . . . day 1 129 materialize
Materialization Factors • Type/frequency of queries • Query response time • Storage cost • Update cost Will study a concrete algorithm later
Iceberg Cube • Computing only the cuboid cells whose count or other aggregates satisfying the condition like HAVING COUNT(*) >= minsup • Motivation • Only a small portion of cube cells may be “above the water’’ in a sparse cube • Only calculate “interesting” cells—data above certain threshold
Challenges in MOLAP • Storing large arrays for efficient access • Row-major, column major • Chunking • Compressing sparse arrays • Creating array data from data in tables • Efficient techniques for Cube computation Topics are discussed in the paper for reading