240 likes | 517 Views
Query Optimization In Compressed Database Systems. Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs. Why Compression? . CPU speed outpaces Disk speed exponentially! x10 / decade (bandwidth), x100 / decade (latency) Trade CPU for I/O: improve query performance
E N D
Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs
Why Compression? • CPU speed outpaces Disk speed exponentially! • x10 / decade (bandwidth), x100 / decade (latency) • Trade CPU for I/O: improve query performance + Save bandwidth for sequential I/O + Improve buffer pool hit ratio - Pay decompression cost • Environment • Decision support queries • Lossless compression
Issues • Database compression methods • Efficient query processing
General-purpose compression Only compression ratio matters Large decompression unit (whole file) Database compression Both compression ratio and decompression cost matter Small decompression unit (attribute or tuple) Database Compression Methods Our setting: allow to decompress a single attribute
Efficient Query Processing • Compared to uncompressed DB • When to decompress • Assumption: no compression in query processing • Our story • Different strategies of when to decompress • None of them is always optimal • Combined optimization problem: Query plan + decompression placement • Solutions • Experiments
Lazy Transient All uncompressed AB uncompressed All compressed R.A = S.B d(R.A) = d(S.B) D(R) D(R.A) D(S.B) D(S) R S R S Different Decompression Strategies Eager R.A = S.B Mem Disk R S
Which Strategy Is Optimal? • Lazy vs. eager • Lazy is always better • Transient vs. Lazy • Transient: more I/O savings • Lazy: lower decompression cost • In practice • Numerical attributes: transient is always better • String attributes: no clear winner • Expensive to decompress • High I/O savings if compressed
An Example With TPCH Data Select S_NAME, S_ADDRESS, C_NAME, C_PHONE From Supplier, Customer Where S_ADDRESS = C_ADDRESS Order by S_NAME, C_NAME Sort(S_N, C_N) S_A = C_A Supplier Customer
Transient sort (3s) Transient sort (0.5s) 3 attributes compressed All attributescompressed Lazy BNL (2s) Transient BNL (42s) Transient vs. Lazy Lazy sort (7s) 1 attribute compressed Lazy BNL (2s) An optimization problem!
Transient sort (0.5s) All attributes compressed Pruned by System R Transient SM (2.5s) Interactions With Traditional Optimization Algorithm: run System R, then decide when to decompress. Transient sort (3s) 3 attributes compressed Lazy BNL (2s) Optimal plan returned by System R is no longer optimal!
Compression Aware Optimization • Given a query and a compressed DB: Find the optimal query plan • New operators • Explicit decompression operators • Transient versions of existing relational operators • Search space: O (nm) factor over old search space • n is the depth of the plan • m is the number of attributes • Each attribute explicitly decompressed at most once • For each attribute, n places to decompress explicitly
Dynamic Programming - OPT Extend system R optimizer • Bottom up, one minimal plan per interesting property • What attributes remain compressed as a new property Lazy BNL (2s)Property: S_A, C_A uncompressed Transient SM join (2.5s)Property: all compressed Customer Supplier Customer Supplier Blowup reduced from nm to 2m
Min-K Heuristic Algorithm • Store plans for k rather than 2m properties • The k properties whose plans are cheapest • Storage blowup reduced from 2m to k • Time: still exponential blowup in the worst case Stored plans: Join on S_A, C_A Lazy: S_A, C_ATransient: S_A, C_ALazy: S_A, transient: C_ATransient: S_A, Lazy: C_A S_A,… C_A,…
Min-K Heuristics (2) • If transient decompression is bad for one join attribute, often so for the other • BNL join: both S_A and C_A decompressed N2 times • Only consider two cases Stored plans: Join on S_A, C_A Lazy: S_A, C_A Transient: S_A, C_A S_A,… C_A,… • Time blowup is 2k
Experiments • Setup • Modify Predator query engine & optimizer • Algorithms • Uncompressed, Eager, Lazy, Transient-Only,Two-Step, OPT, Min-1, Min-2 • 100 MB TPCH data • 50% compression ratio • Pentium III 550 Mhz, vary buffer pool size
Experimental Setup (2) • Randomly add join conditions on string attributes • Divide queries into workloads • Number of string join conditions, number of join tables • Metrics: for algorithm X • Average relative-cost: Average(cost of plan returned by X / cost of opt plan) • Average blowup factor: Average(# plans searched by X / # plans by System R)
Average Relative Cost Queries with 3-4 join tables, buffer pool 10% of compressed DB
Distribution of Query Performance Percentage of Good plans (cost within twice of OPT) for all queries
Optimization Cost Queries with 3-4 join tables
Related Work • How to compress • Roth&Horn93, Iyer&Wilhite94, Goldstein98 • How to query • Graefe&Shapiro91, Westmann00, Greer99 • Query optimization • Compressed MOLAP aggregates: Li99 • Compressed Bitmap indices:Amer-Yahia&Johnson00 • Expensive predicates: • Chaudhuri&Shim99, Hellerstein93
Conclusions & Future Work • Novel optimization problem • Search for regular query plan + when to decompress • Separate search sub-optimal • OPT and Min-K heuristic • Up to an order improvement in experiments • Future work • Caching decompressed values • Updates
3 places to place D(S_A) Regular sort Before: convert to transient After: as it is D(S_A) Transient join Search Space Sort(S_A) 3 extended plans (3 is depth) nm blow up over old space • n: depth of plan • m: number of attributes S_A = C_A S_A, …
Relative-Cost - Varying Buffer Pool Size Queries with 3- 4 join tables, 2 additional string joins
Relative Performance (2) Queries with more than 5 join tables