280 likes | 420 Views
Compression Aware Physical Database Design. Microsoft Research . Brown University . Hideaki Kimura *. Vivek Narasayya Manoj Syamala. hkimura@cs.brown.edu. { viveknar,manojsy }@ microsoft.com. (*) Graduates soon. On Job Market. Background: Compression in DB. Every Major DBMS Supports
E N D
Compression Aware Physical Database Design Microsoft Research Brown University Hideaki Kimura* Vivek Narasayya Manoj Syamala hkimura@cs.brown.edu {viveknar,manojsy}@microsoft.com (*) Graduates soon. On Job Market.
Background: Compression in DB • Every Major DBMS Supports • Saves Storage Consumption • Saves I/O Bandwidth DBMS A: 4x! DBMS B: 10x! DBMS C: 12x! Query Process Engine Tables, Indexes SELECT Decompress Compressed Data Compress INSERT
Compression Schemes in DB Dictionary Encoding NULL Suppression • Local dict. (Oracle, SQL Server) • Global dict. (DB2) + Prefix Suppression, LZO, RLE…
Two Types of Compression in DB Order Independent Order Dependent • NULL-Supp. • Global dict. • … IAB IBA IAB IBA IAB IBA page fragmented = ≠ • Run Length Enc. • Local dict. • …
Benefits and Overheads • Saves Storage Space, I/O • CPU Overhead to Compress & Decompress • Different Compression Scheme= Different Saving ↔ Overhead How Do We Use It? DBA
Issue 1: To Compress or not.. • Depends on Data • Depends on Workload • SELECTs/INSERTs Frequency • CPU bottleneck? IO bottleneck? 10GB 10GB 1GB 9GB -10% -90% Low Compression Ratio High Compression Ratio
Q1 Q2 Syntactically Relevant Indexes Issue 2: What Index to Create I4 I3 I5 I2 I1 Select Candidate Configurations I3 I5 I1 Physical DB Design Tool DBMS Configuration Enumerate BestConfiguration I1 I5 Hypothetical Indexes Prune What-if Analysis Query Optimizer Estimate Runtime
Naïve Solution: Staged Design • Run Design Tool to Select Indexes • Compress them, then Repeat. Stage 2 Stage 1 Compress! MV MV MV Workload Idx Idx Idx 100 MB Budget 100 MB 50 MB 100 MB
Problem in tight space budget SELECT SUM(Price*Discount) FROM Sales WHERE State='CA' and Jul 01 < Shipdate < Sep 01 • Misses an index that makes sense only with compression Sales Choice for 100 MB? I1 (State, Shipdate): 95 MB → 50 MB I2 (State, Shipdate) Include (Price, Discount): 170 MB → 90 MB
Example: Tight Space Budget Good design: 175MB CREATE COMPRESSED INDEX (L_PARTKEY,L_ORDERKEY,L_SUPPKEY) INCLUDE (L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT) Staged: 155MB CREATE INDEX (L_ORDERKEY) INCLUDE(L_SUPPKEY,L_COMMITDATE,L_RECEIPTDATE) ?
Problem in plenty space budget Choice for 200 MB? • Result in too high CPU overheads for compression/decompression. I1 (State, Shipdate): 95 MB → 50 MB INSERT INTO Sales … UPDATE Sales SET Price=.. I2 (State, Shipdate) Include (Price, Discount): 170 MB→ 90 MB CPU Overheads
Example: Plenty Space Budget Worse with More Budget!
IntegratedSolution Needed! • How to Estimate Index-size after compression? • How to Evaluate benefits/overheads of compression? • How Compression affects Candidate Selection/Enumeration?
Size Estimation • Essential Metric of Indexes • To Fit Space Budget • To Estimate I/O cost • Need Compression Fraction Table #tuple=1M Col-A Width=8 Col-B Width=4 Col-C Width=10 Clust. Key Width=4 Stats Size (IABC) = (8 + 4 + 10 + 4) * 1M = 26 MB Comp. Size (IABC) = 26 MB * CF (IABC)
Prior work SampleCF Overheads • SampleCF[Idreoset al. ICDE'10] Table 1GB Sample 10MB CREATE COMPRESSED INDEX • Sample Size: Cost ↔ Accuracy • Still Expensive for 1,000s of indexes
Index Size Deduction Local dict. (ORD-DEP) NULL supp. (ORD-IND) IBA IAB Ia Ib Ia,b Sum-up Savings SampleCF Ia Ib Col-Set Deduction Col-Ext Deduction Ia,b Ib,a Estimate From Run-Length More Details in paper
Optimize Accuracy-Cost Trade-off • Size-Estimation Strategy • Sample Size? • Deduction Path? • Expected Errors? • Formulate as Graph Problem • Greedy algorithm to solve(details in the paper)
Issues in Design Tool • Query Cost model to consider (De)Compression CPU cost • Candidate Selection/Enumeration Key Challenge:Space-Performance Trade-off
Candidate Selection:Space-Performance Trade-off Q1 Q2 • Add CompressedIndexes IB ID IA IC Compressed Versions Most of them are Ignored! ID IB IC IA Select Fastest Compressed Indexes are often Slower-but-Smaller IA IC (exception: very highcompression ratio)
Skyline Candidate Selection • Construct Skyline of Configurations • Pick Both Fast-Indexes and Small-Indexes
Comp. IC IB 10MB ICB 10MB 5MB Enumeration: Problem ICB IB IA IC IA Seed IB IC IA IA IB IA Optimal Design 15MB Room IA ICB IC IA ICB • Greedy picks un-compressed indexes too early
IC IA ICB IB IA ICC … Local Backtrack in Enumeartion Recover If Oversized IB IB IC IA IA IA • Recover oversized configurations • Compress indexes in the config.
Experimental Results • Implemented on SQL Server 2008 • Modified Database Tuning Advisor (DTA) "DTAc" • Modified Query Cost Model • TPC-H Scale-1 (more results in paper) • SELECT-intensive/UPDATE-intensive • Compared Estimated Runtime
Candidate Selection/Enumeration • Both Skyline & Backtrack are required esp. for tight budget Clustered/2ndary Indexes
DTAc vs. DTA • Especially better in tight budget • Choose lightly compressed designs in UPDATE-intensive Clustered/2ndary/MV Indexes
Overhead in DTA • Reduce Size Estimation Overheads for a factor of 3 • Mostly <10% Estimation Error
Conclusion • Opportunities and Challenges • Integrated Approach to exploit compression in physical design • Space-Performance Tradeoff • Size Estimation • Open Issues • Column-Store