1 / 19

Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

MIDDLEWARE SYSTEMS. RESEARCH GROUP. Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance. MSRG .ORG. ICPE 2013, Prague, 24/04/2013. Tilmann Rabl , Meikel Poess , Hans-Arno Jacobsen, Patrick and Elizabeth O’Neil.

serena
Download Presentation

Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MIDDLEWARE SYSTEMS RESEARCH GROUP Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance MSRG.ORG ICPE 2013, Prague, 24/04/2013 Tilmann Rabl, Meikel Poess, Hans-Arno Jacobsen, Patrick and Elizabeth O’Neil

  2. Real Life Data is Distributed Uniformly… Well, Not Really • Customers zip codes typically clustered around metropolitan areas • Seasonal items (lawn mowers, snow shovels, …) sold mostly during specific periods • US retail sales: • peak during Holiday Season • December sales are 2x of January sales Source: US Census Data Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations

  3. Student Seminar Signup Distribution Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations

  4. How Can Skew Effect Database Systems? • Data placement • Partitioning • Indexing • Data structures • Tree balance • Bucket fill ratio • Histograms • Optimizer  finding the optimal query plan • Index vs. non-index driven plans • Hash join vs. merge join • Hash group by vs. sort group by Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations

  5. Agenda • Data Skew in Current Benchmarks • Star Schema Benchmark (SSB) • Parallel Data Generation Framework (PDGF) • Introducing Skew in SSB Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations

  6. Data Skew in Benchmarks • TPC-D (1994-1999): only uniform data • SIGMOD 1997 - “Successor of TPC-D should include data skew” • No effect until … • TPC-DS (released 2012) • Contains comparability zones • Not fully utilized • TPC-D/H variations • Chaudhuri and Narayasa: Zipfian distribution on all columns • Crolotte and Ghazal: comparability zones • Still lots of open potential Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations

  7. Star Schema BenchmarkI • Star schema version of TPC-H • Merged Order and Lineitem • Date dimension • Dropped Partsupp • Selectivity hierarchies • C_City C_Nation C_Region • … Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations

  8. Star Schema Benchmark II • Completely new set of queries • 4 flights of 3-4 queries • Designed for functional coverage and selectivity coverage • Drill down in dimension hierarchies • Predefined selectivity select sum(lo_extendedprice*lo_discount) as revenue fromlineorder, date wherelo_orderdate = d_datekey andd_year = 1993 andlo_discountbetween 1 and 3 andlo_quantity < 25; Drilldown Q1.1 select sum(lo_extendedprice*lo_discount) as revenue fromlineorder, date wherelo_orderdate = d_datekey andd_yearmonthnum= 199301 andlo_discountbetween 1 and 3 andlo_quantitybetween 26 and 35; Q1.2 Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations

  9. Parallel Data Generation Framework • Generic data generation framework • Relational model • Schema specified in configuration file • Post-processing stage for alternative representations • Repeatable computation • Based on XORSHIFT random number generators • Hierarchical seeding strategy Frank, Poess, and Rabl: Efficient Update Data Generation for DBMS Benchmarks. ICPE '12. Rabl and Poess: Parallel Data Generation for Performance Analysis of Large, Complex RDBMS. DBTest'11. Poess, Rabl, Frank, and Danisch: A PDGF Implementation for TPC-H. TPCTC '11. Rabl, Frank, Sergieh, and Kosch: A Data Generator for Cloud-Scale Benchmarking. TPCTC '10. Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations

  10. Configuring PDGF PDGF DB XML • Schema configuration • Relational model • Tables, fields • Properties • Table size, characters, … • Generators • Simple generators • Metagenerators • Update definition • Insert, update, delete • Generated as change data capture <table name="SUPPLIER"> <size>${S}</size> <field name="S_SUPPKEY" size="" type="NUMERIC“ primary="true" unique="true"> <gen_IdGenerator /> </field> <field name="S_NAME" size="25" type="VARCHAR"> <gen_PrePostfixGenerator> <gen_PaddingGenerator> <gen_OtherFieldValueGenerator> <reference field="S_SUPPKEY" /> </gen_OtherFieldValueGenerator > <character>0</character> <padToLeft>true</padToLeft> <size>9</size> </gen_PaddingGenerator> <prefix>Supplier </prefix> </gen_PrePostfixGenerator> </field> [..] Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations

  11. Opportunities to Inject Data Skew in • Foreign key relations • E.g., L_PARTKEY • One fact table measures • E.g., L_Quantity • Single dimension hierarchy • E.g., P_Brand→ P_Category → P_Mfgr • Multiple dimension hierarchies • E.g., City → Nation in Supplier and Customer • Experimental methodology • One experiment series for each of the above • Comparison to original SSB • Comparison of index-forced, non-index, and automatic optimizer mode • SSB scale factor 100 (100 GB), x86 server Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations

  12. Skew in Foreign Key Relations • Very realistic • Easy to implement in PDGF • Just add a distribution to the reference • But! • Dimension attributes uniformly distributed • Dimension keys uncorrelated to dimension attributes •  Very limited effect on selectivity • Focus on attributes in selectivity predicates <distribution name="Exponential“ lambda="0.26235" /> Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations

  13. Skew in Fact Table Measure – Lo_Quantity • Lo_Quantity distribution • Values range between 0 and 50 • Originally uniform distribution with: • P(X=x)=0.02 • Coefficient of variation of 0.00000557 • Proposed skewed distribution with: • Query 1.1 • lo_quantity < x, x ∈ [2, 51] • Results • Switches too early to non-index plan • Switches too late to non-index plan • Optimizer agnostic to distribution Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations

  14. Skew in Single Dimension Hierarchy - Part • P_Category distribution • Uniform P(X=x)=0.04 • Skewed P(X=x)= 0.01 - 48.36 • Probabilities explicitly defined • Query 2.1 • Restrictions on two dimensions • Results uniform case • Index driven superior • Optimizer chooses non-index driven • Results skewed case • Switches too early to non-index plan Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations

  15. Skew in Multiple Dimension Hierarchies – S_City & C_City • Skewed S_City & C_City • Probabilites exponentially distributed • Query 3.3 • Restrictions on 3 dimensions • Variation on Supplier and Customer city • Results uniform and skewed cases • Automatic plan performs best • Cross over between automatic uniform and skewed too late Join Cardinality Elapsed Time Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations

  16. Conclusion & Future Work • PDGF implementation of SSB • Introduction of skew in SSB • Extensive performance analysis • Several interesting optimizer effects • Performance impact of skew • Future Work • Further analysis on impact of skew • Skew in query generation • Complete suite for testing skew effects Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations

  17. Thanks • Questions? • Download and try PDGF: • http://www.paralleldatageneration.org • (scripts used in the study available on website above) Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations

  18. Back-up Slides Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations

  19. Configuring PDGF Generation • Generation configuration • Defines the output • Scheduling • Data format • Sorting • File name and location • Post processing • Filtering of values • Merging of tables • Splitting of tables • Templates (e.g. XML / queries) <table name="QUERY_PARAMETERS" exclude="false" > <output name="CompiledTemplateOutput" > [..] <template ><!-- inty = (fields [0]. getPlainValue ()).intValue (); intd = (fields [1]. getPlainValue ()).intValue (); intq = (fields [2]. getPlainValue ()).intValue (); String n = pdgf.util.Constants.DEFAULT_LINESEPARATOR; buffer.append("-- Q1.1" + n); buffer.append("select sum(lo_extendedprice *"); buffer.append(" lo_discount) as revenue" + n); buffer.append(“ from lineorder , date" + n); buffer.append(“ where lo_orderdate = d_datekey" + n); buffer.append(“ and d_year = " + y + n); buffer.append(“ and lo_disc between " + (d - 1)); buffer.append(“ and " + (d + 1) + n); buffer.append(“ and lo_quantity < " + q + ";" + n); --></template > </output > </table > Rabl, Poess, Jacobsen, O'Neil, O'Neil - SSB Skew Variations

More Related