190 likes | 330 Views
MIDDLEWARE SYSTEMS. RESEARCH GROUP. Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance. MSRG .ORG. ICPE 2013, Prague, 24/04/2013. Tilmann Rabl , Meikel Poess , Hans-Arno Jacobsen, Patrick and Elizabeth O’Neil.
E N D
MIDDLEWARE SYSTEMS RESEARCH GROUP Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance MSRG.ORG ICPE 2013, Prague, 24/04/2013 Tilmann Rabl, Meikel Poess, Hans-Arno Jacobsen, Patrick and Elizabeth O’Neil
Real Life Data is Distributed Uniformly… Well, Not Really • Customers zip codes typically clustered around metropolitan areas • Seasonal items (lawn mowers, snow shovels, …) sold mostly during specific periods • US retail sales: • peak during Holiday Season • December sales are 2x of January sales Source: US Census Data Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations
Student Seminar Signup Distribution Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations
How Can Skew Effect Database Systems? • Data placement • Partitioning • Indexing • Data structures • Tree balance • Bucket fill ratio • Histograms • Optimizer finding the optimal query plan • Index vs. non-index driven plans • Hash join vs. merge join • Hash group by vs. sort group by Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations
Agenda • Data Skew in Current Benchmarks • Star Schema Benchmark (SSB) • Parallel Data Generation Framework (PDGF) • Introducing Skew in SSB Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations
Data Skew in Benchmarks • TPC-D (1994-1999): only uniform data • SIGMOD 1997 - “Successor of TPC-D should include data skew” • No effect until … • TPC-DS (released 2012) • Contains comparability zones • Not fully utilized • TPC-D/H variations • Chaudhuri and Narayasa: Zipfian distribution on all columns • Crolotte and Ghazal: comparability zones • Still lots of open potential Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations
Star Schema BenchmarkI • Star schema version of TPC-H • Merged Order and Lineitem • Date dimension • Dropped Partsupp • Selectivity hierarchies • C_City C_Nation C_Region • … Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations
Star Schema Benchmark II • Completely new set of queries • 4 flights of 3-4 queries • Designed for functional coverage and selectivity coverage • Drill down in dimension hierarchies • Predefined selectivity select sum(lo_extendedprice*lo_discount) as revenue fromlineorder, date wherelo_orderdate = d_datekey andd_year = 1993 andlo_discountbetween 1 and 3 andlo_quantity < 25; Drilldown Q1.1 select sum(lo_extendedprice*lo_discount) as revenue fromlineorder, date wherelo_orderdate = d_datekey andd_yearmonthnum= 199301 andlo_discountbetween 1 and 3 andlo_quantitybetween 26 and 35; Q1.2 Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations
Parallel Data Generation Framework • Generic data generation framework • Relational model • Schema specified in configuration file • Post-processing stage for alternative representations • Repeatable computation • Based on XORSHIFT random number generators • Hierarchical seeding strategy Frank, Poess, and Rabl: Efficient Update Data Generation for DBMS Benchmarks. ICPE '12. Rabl and Poess: Parallel Data Generation for Performance Analysis of Large, Complex RDBMS. DBTest'11. Poess, Rabl, Frank, and Danisch: A PDGF Implementation for TPC-H. TPCTC '11. Rabl, Frank, Sergieh, and Kosch: A Data Generator for Cloud-Scale Benchmarking. TPCTC '10. Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations
Configuring PDGF PDGF DB XML • Schema configuration • Relational model • Tables, fields • Properties • Table size, characters, … • Generators • Simple generators • Metagenerators • Update definition • Insert, update, delete • Generated as change data capture <table name="SUPPLIER"> <size>${S}</size> <field name="S_SUPPKEY" size="" type="NUMERIC“ primary="true" unique="true"> <gen_IdGenerator /> </field> <field name="S_NAME" size="25" type="VARCHAR"> <gen_PrePostfixGenerator> <gen_PaddingGenerator> <gen_OtherFieldValueGenerator> <reference field="S_SUPPKEY" /> </gen_OtherFieldValueGenerator > <character>0</character> <padToLeft>true</padToLeft> <size>9</size> </gen_PaddingGenerator> <prefix>Supplier </prefix> </gen_PrePostfixGenerator> </field> [..] Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations
Opportunities to Inject Data Skew in • Foreign key relations • E.g., L_PARTKEY • One fact table measures • E.g., L_Quantity • Single dimension hierarchy • E.g., P_Brand→ P_Category → P_Mfgr • Multiple dimension hierarchies • E.g., City → Nation in Supplier and Customer • Experimental methodology • One experiment series for each of the above • Comparison to original SSB • Comparison of index-forced, non-index, and automatic optimizer mode • SSB scale factor 100 (100 GB), x86 server Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations
Skew in Foreign Key Relations • Very realistic • Easy to implement in PDGF • Just add a distribution to the reference • But! • Dimension attributes uniformly distributed • Dimension keys uncorrelated to dimension attributes • Very limited effect on selectivity • Focus on attributes in selectivity predicates <distribution name="Exponential“ lambda="0.26235" /> Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations
Skew in Fact Table Measure – Lo_Quantity • Lo_Quantity distribution • Values range between 0 and 50 • Originally uniform distribution with: • P(X=x)=0.02 • Coefficient of variation of 0.00000557 • Proposed skewed distribution with: • Query 1.1 • lo_quantity < x, x ∈ [2, 51] • Results • Switches too early to non-index plan • Switches too late to non-index plan • Optimizer agnostic to distribution Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations
Skew in Single Dimension Hierarchy - Part • P_Category distribution • Uniform P(X=x)=0.04 • Skewed P(X=x)= 0.01 - 48.36 • Probabilities explicitly defined • Query 2.1 • Restrictions on two dimensions • Results uniform case • Index driven superior • Optimizer chooses non-index driven • Results skewed case • Switches too early to non-index plan Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations
Skew in Multiple Dimension Hierarchies – S_City & C_City • Skewed S_City & C_City • Probabilites exponentially distributed • Query 3.3 • Restrictions on 3 dimensions • Variation on Supplier and Customer city • Results uniform and skewed cases • Automatic plan performs best • Cross over between automatic uniform and skewed too late Join Cardinality Elapsed Time Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations
Conclusion & Future Work • PDGF implementation of SSB • Introduction of skew in SSB • Extensive performance analysis • Several interesting optimizer effects • Performance impact of skew • Future Work • Further analysis on impact of skew • Skew in query generation • Complete suite for testing skew effects Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations
Thanks • Questions? • Download and try PDGF: • http://www.paralleldatageneration.org • (scripts used in the study available on website above) Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations
Back-up Slides Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations
Configuring PDGF Generation • Generation configuration • Defines the output • Scheduling • Data format • Sorting • File name and location • Post processing • Filtering of values • Merging of tables • Splitting of tables • Templates (e.g. XML / queries) <table name="QUERY_PARAMETERS" exclude="false" > <output name="CompiledTemplateOutput" > [..] <template ><!-- inty = (fields [0]. getPlainValue ()).intValue (); intd = (fields [1]. getPlainValue ()).intValue (); intq = (fields [2]. getPlainValue ()).intValue (); String n = pdgf.util.Constants.DEFAULT_LINESEPARATOR; buffer.append("-- Q1.1" + n); buffer.append("select sum(lo_extendedprice *"); buffer.append(" lo_discount) as revenue" + n); buffer.append(“ from lineorder , date" + n); buffer.append(“ where lo_orderdate = d_datekey" + n); buffer.append(“ and d_year = " + y + n); buffer.append(“ and lo_disc between " + (d - 1)); buffer.append(“ and " + (d + 1) + n); buffer.append(“ and lo_quantity < " + q + ";" + n); --></template > </output > </table > Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations