240 likes | 337 Views
ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis). Physical Data Warehouse Design. Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of Business. It’s all about trading storage for speed!. Fundamentals Aggregates (Ch. 16, pp. 356 - 357)
E N D
ACCTG 6910Building Enterprise & Business Intelligence Systems(e.bis) Physical Data Warehouse Design Olivia R. Liu Sheng, Ph.D.Emma Eccles Jones Presidential Chair of Business
It’s all about trading storage for speed! • Fundamentals • Aggregates (Ch. 16, pp. 356 - 357) • Indexes (Ch. 16, p. 357)
Fundamentals: the Storage Hierarchy CPU 500-1000 MIPS Small Fast Cache 10-8 second 512 KB Access Speed Storage Capacity Memory 10-7 second 512 MB 512 GB Disk 10-2 second Large Slow
Fundamentals: the Storage Hierarchy Cache CPU Bus Disk Drive (I/O Channel) Memory Disk How long does it take to query sales by city? How large is the Sales Fact table? How long does it take to access the Sales Fact table?
Fundamentals How large is the fact table? e.g., 1 million records/day, 0.2KB/record 0.2 GB/day
Fundamentals How long does it take to access all the fact records? E.g., the small fact table is 1 Terabyte in size! • 0.01s*1012=325 years LONG!!!!!!!!!!!!!
Fundamentals: the Storage Hierarchy Cache CPU Bus Disk Drive (I/O Channel) Memory Disk • The logical unit of data transferred between disk and memory is block (e.g., 4k bytes)
Fundamentals How long does it take to access all the fact records? E.g., the small fact table is 1 Terabyte in size! • Number of blocks: 2.5 millions • Access time = 0.01s*2500000= < 7 hrs!!!
Aggregate • In data warehouse design, we choose the gain of fact table to be the possible lowest level. Grain: orderline
Aggregate • The reasons to choose the lowest level of fact: • (X) Analysts want to query on single record • (O) Analysts want to flexibly cut and group records.
Aggregate • However, keeping the most detailed fact records could result in • huge-size fact table: TeraBytes?! (1 million records/day, 256 Bytes/record -> 0.2 GB/day) • slow query
Aggregate • To keep s data warehouse flexible, fact tables need to store facts in their lowest levels of detail. • To improve query performance, another type of fact table which stores pre-computed summaries of detailed facts helps. • Reduced to a logical design solution
Aggregate • An aggregate fact table is a fact table that summarizes base-level fact table records along one or several dimensions. • An aggregate dimension table is a dimension table that summarizes base-level dimension table records. • E.g., marketing managers check daily product sales by city --- aggregate by city in customer dimension
Aggregate Aggregate fact table Aggregate dimension table
Indexes How long does it take to find out the total purchase Amt by Tom Jones?
Indexes • Customer table • 1M records, each record 0.200 Kbytes long • Block is 4K size, block access time is 0.01s • Number of records/block: 4/0.2=20 • Number of blocks: 1M/20=50K • Sequential search • Time: 25K*0.01s=250s=4min.
Indexes • Binary search • Time: log(50K)*0.01s=16*0.01s=0.16s • B+ tree index • Create index pn on customer(cname); • If each node (block) in B+ tree has 117 keys, then • # of access to indexes: log117(1M)=3 (i.e.height of the tree) • # of access to Customer Dimension: 1 • Total time = 4*0.01 = 0.04s
B+-trees - P=12 Indexes to indexes (11 key values, 12 pointers) ... ... ………. Indexes to customer records
Indexes How long does it take to find out the total sales of Desktop computers?
Performance Improvement • Suppose there are only 4 product categories for 1M products • Create a B+ tree index??? • Suppose the size of product category and block ID is 10 bytes • Size of index = 1M * 10 = 10 M bytes
Performance Improvement • A bitmap index for an attribute A is a collection of bit vectors, one for each possible value of A. The vector for value v has 1 in position i if the ith record has v for attribute A.
Bitmaps Product record 1 record 2 record 3 A bitmap index for an attribute A is a collection of bit vectors, one for each possible value of A. The vector for value v has 1 in position i if the ith record has v for attribute A.
Performance Improvement • Bitmap index is suitable for low cardinality attribute. • Cardinality(A) = # of possible values for A/#of records • Compared with B+ tree index, bitmap index has the following advantages for low cardinality attributes • Storage space saving (1M*4/8=500K bytes) • Efficient for boolean operations • CREATE BITMAP INDEX bitpc ON PRODUCT (PCNAME);