Best Practices for Query Performance In a Data Warehouse

Best Practices for Query Performance In a Data Warehouse Calisto Zuzarte IBM calisto@ca.ibm.com Session Code: D09 May 13, 2010 8:30AM–9:30AM Platform: Linux, Unix and Windows

Data Warehouse Life Cycle • Database design / Application design • The Warehouse Application architects and Database Administrators work together to design the queries and schema before they put the application in production • Database performance layer implementation • In order to meet SLAs, DBAs usual go through some iterations augmenting the database with performance layer objects and set up the initial configuration to get good performance • Database tuning operations • During production, with changing requirements and change in data, there is on-going tuning required to keep operations smooth.

Motivation • Data warehouse environments characteristics: • Large volumes of data • Millions/Billions of rows involved in some tables • Large amounts of data rolled-in and rolled-out • Complex queries • Large Joins • Large Sorts, • Large amounts of Aggregations • Many tables involved • Ad Hoc Queries • It is important to pay attention to query performance

Objectives • Provide recommendations so that you can improve data warehouse query performance • Database Design considerations • Application Design considerations • Performance Layer Considerations • Ongoing Tuning Considerations

Agenda • Best Practices – Database Design • Best Practices – Application Design • Best Practices – Performance Layer • Best Practices – Configuration and Operations

Best Practices – Database Design • Best Practices - Parallelism • Inter-partition Shared nothing parallelism • Intra-Query Parallelism (SMP) • Best Practices - Partitioning • Database Partitioning • Table Partitioning • Multi-Dimension Clustering • UNION ALL Views • Best Practices – Schema • Best Practices - Compression

Best Practices - Parallelism • Database partition feature (DPF) is recommended • To achieve parallelism in a data warehouse • For scalability and query performance • SMP (Intra-Query Parallelism) not recommended • In concurrent multi-user environments with heavy CPU usage • SMP recommended • When CPUs are highly under utilized • When DPF is not an option

Partitioning (Complimentary Strategies in DB2) • Database Partitioning (DPF) … DISTRIBUTE BY HASH • Key Benefit : Better scalability and performance through parallelism • Multidimensional Clustering (MDC) … ORGANIZE BY DIMENSION • Key Benefit : Better query performance through data clustering • Table (Range) Partitioning … PARTITION BY RANGE • Table Partitioning • Key Benefit : Better data management (roll-in and roll-out of data) • UNION ALL Views • Key Benefit : Independent branch optimization

Divide And Conquer ! Distribute, Partition, Organize ! Organize By Partition By Distribute By

Best Practices – Database Partitioning • Collocate the fact and largest dimension • Choose to avoid significant skew on some partitions • Avoid DATE dimension where active transactions for current date all fall on one database partition (TIMESTAMP is good) • Possibilities for workload isolation for data marts • Different partition groups but common dimension tables • Needs replicated tables (discussed later)

Best Practices – Table Partitioning • Recommend partitioning the fact tables • Typically based on DATE dimension • Works better with application key predicates applied directly • Table or Range Partitioning • Recommend table or range partitioning (V9.7 :partitioned indexes) • Choose partitioning based on roll-in / roll-out granularity • UNION ALL Views • Each branch optimized independently • Use with well designed applications (Dangers of materialization) • Large number of branches require time and memory to optimize • Needs predicates with constants for branch elimination

Best Practices – Multidimensional Clustering (MDC) • Recommend defining MDC on the fact table • Guaranteed clustering (Avoids the need to REORG for clustering) • I/O optimization • Compact indexes (compact, coexists with regular indexes) • Choose dimensions based on query predicates • Recommend the use of 1 to 4 dimensions • Need to ensure dimensions are chosen such that they do not waste storage • Could choose a finer granularity of Table partitioning range • For example: Table partition range by month, MDC by date

Star Schema STORE PRODUCT Store_id Region_id … Product_id Class_id Group_id Family_id Line_id Division_id … SALES Product_id Store_id Channel_id Date_id Amount Quantity … TIME CHANNEL Date_id Month_id Quarter_id Year_id Channel_id …

Dimension Hierarchy Product Dimension Division Level 5 Line Level 4 Time Dimension Year Family Level 3 Quarter Group Level 2 Store Dimension Month Retailer Class Level 1 Channel Dimension Date Product Channel Store Level 0 Sales Fact

Best Practices - Schema • Surrogate Keys • As far as possible use application keys themselves • allows predicates to be applied/transferred directly on the fact table • DATE is a good candidate (easier to roll-in/roll-out and for MDC ) • Star Schema / Snowflakes • Separate tables for each dimension hierarchy (snowflake) may result in a large number of joins • Flattened dimensions may contain a lot of redundancy (space) • Define Columns NOT NULL when appropriate • Many optimizations that are done based on NOT NULL • Define Uniqueness when appropriate • Primary Keys / Unique Constraints / Unique Indexes

Compression • Table, Index and Temp Table compression • Huge benefits with storage savings • With table and TEMP compression 30-70% • With Index compression 30-40% • Performance gains because • Less I/O and better use of bufferpools • TEMP table compression helps operators like Hash Join, Merge Join, Sorts and Table Queues if they spill

Best Practices - Compression • Consider compression particularly with the fact table • Strongly recommend compression on the fact table when not CPU bound

Best Practices – Application Considerations • Use constants instead of expressions in the query • Example • WHERE DateCol <= CURRENT DATE – 5 • Use VALUES(CURRENT DATE – 5) to get the resulting constant first and use it in the query • Avoid expressions on indexed columns • Example • WHERE DATECOL – 2 DAYS > ‘2009-10-22’ • WHERE DATECOL > ‘2009-10-22’ + 2 DAYS

Best Practices – Application Considerations • Avoid mixing data types in join predicates • Example • WHERE IntegerCol = DecimalCol • Use Global Temporary Tables to split a query if it contains more than 10-15 tables • Reduces optimization time

Best Practices – Performance Layer • Indexes • Statistics • Distribution Statistics • Column Group Statistics • Statistical Views • Constraints • Referential Integrity • Materialized Query Tables • Replicated Tables

Indexes • Indexes are a vertical subset of the data in the table • Indexes provide ORDER • Indexes may allow for clustered access to the table

Index Considerations • To get Index Only Access instead of more expensive ISCAN-FETCH or TSCAN (Table Scan) • To avoid SORTs particularly those that spill • To promote index-ORing and index-ANDing • To promote Star Joins • When you have range join predicates • Better possibilities with Nested Loop Join • Indexes for clustering (MDC)

Cardinality Estimation • Estimating the size of intermediate results is critical to getting good query execution plans • Without sufficient information, the optimizer can only guess based on some assumptions • Data skew and statistical correlation between multiple column values introduce uncertainty • Pay attention to DATE columns

Best Practices - Statistics • Collect distribution Statistics when there is skew and predicates use constants • Consider a high number of quantile statistics on columns with DATE range predicates and character string columns

Column Group Statistics 1 2 3 • Example: COUNTRY = ‘Germany’ And CITY = ‘Frankfurt’ • No CGS: Selectivity = ½ * 1/3 =1/6 …Estimate 1 row • With CGS: Selectivity = 1/3 …Estimate 2 rows

Problem Scenario - Skew 10000000 rows CUST Table 100 rows, 100 custids Frequency Statistics – SALES Table SELECT … FROM SALES, CUST WHERE CUST.CNAME = ‘IBM’ AND CUST.CUSTID = SALES.CUSTID Cardinality Estimate with Uniformity = 100,000 Actual Cardinality : 2,000,000 !!!!!!!!!!!!!!!!!!!!

Best Practices - Statistics • Collect Column Group Statistics with multiple predicates on the same table • WHERE Country = ‘CANADA’ and City = ‘TORONTO’ RUNSTATS … ON ALL COLUMNS AND ON COLUMNS ((country, city) … ) … • Consider Statistical Views when • There is skew on the join column • There is a significant difference in the range of values in the fact and the dimension CREATE VIEW SV1 AS (SELECT C.* FROM CUST C, FACT F WHERE C.CUST_ID = F.CUST_ID) ALTER VIEW cust_fact ENABLE QUERY OPTIMIZATION RUNSTATS ON TABLE dba.cust_fact WITH DISTRIBUTION

Referential Integrity (RI) • Facilitates aggregation push down • Example in the appendix section • Eliminates redundant joins in views • RI helps determine that queries that do not require data from a primary key table need not do that join even if it is in the view • Helps with Materialized Query Table matching • Allows Queries to match MQTs with more dimension table joins

GB GB GB GB JOIN JOIN JOIN JOIN JOIN JOIN JOIN JOIN Dim2 Dim2 Dim2 Dim2 Fact Fact Fact Fact Dim1 Dim1 Dim1 Dim1 Consider Materialized Query Tables Joe’s Query Q9 Q9 Sue’s Joe’s Q Bob’s Q GB Sue’s Query JOIN JOIN Dim2 Bob’s Query MQT Fact Dim1

Best Practices - Defining Materialized Query Tables • What MQTs should I define ? • Estimate the size of the candidate MQTs by executing COUNT queries against base tables. • Try to achieve at least a 10X reduction in size between fact and the MQT • Build MQTs with a reasonable number of GROUP BY columns (3 to 6 dimension keys) at a time based on query patterns • As far as possible build the MQT from the fact table alone • Use Table Partitioning for the fact table and the MQTs

Best Practices - MQT Matching • Define Referential Integrity to help with matching MQTs that contain more tables than the queries • Define Functional Dependencies for thinner MQTs • Use COUNT_BIG instead of COUNT for DPF MQTs • Define indexes on MQTs • Keep statistics up-to-date • Define base table columns NOT NULL as far as possible • For example we can match SUM(A + B) with SUM(A) + SUM(B)

Best Practices – MQT Maintenance • REFRESH IMMEDIATE • Create an index on the GROUP BY columns • Create the index on the set of columns that form a unique key • Always keep the base table and MQT statistics up-to-date • REFRESH DEFERRED • If log space is an issue, consider NOT LOGGED INITIALLY or LOAD from cursor • An MQT can be temporarily toggled into a regular table by using • ALTER TABLE … DROP MATERIALIZED QUERY • ALTER TABLE … ADD MATERIALIZED QUERY • Use ATTACH / DETACH if fact table and MQT are range partitioned tables

Replicated Tables • Replicate dimension tables (unless collocated with fact ) • Benefit : Avoids data movement • Important : Define suitable indexes • If too large, replicate a subset of frequently used columns JOIN JOIN JOIN BTQ BTQ BTQ CUST CUST COPY SALES CUST COPY SALES CUST COPY SALES

Best Practices – Configuration • Optimization Level 5 • Registry Variables • DB2_ANTIJOIN=EXTEND • If slow queries have NOT EXISTS, NOT IN predicates • DB2_REDUCED_OPTIMIZATION=YES • If compile time is an issue • Configuration thumb rules • BUFFPOOL ~= SHEAPTHRES • SORTHEAP ~= SHEAPTHRES/(# of concurrent SORT, HSJN)

Best Practices - Statistics • The DB2 Query Optimizer relies on reasonably accurate statistics to get a good query plans • User runs RUNSTATS when data changes (part of ETL) • Statistics Fabrication (unreliable) • DB2 keeps UPDATE / DELETE / INSERT counters • Fabrication limited to a few statistics – Not enough • Consider configuring Automatic Statistics • Automatically collects statistics on tables in need • Runs in the background as a low priority job • Consider configuring Real Time Statistics • Collects statistics on-the-fly

Summary – Best Practices • Database Design : • Parallelism, Partitioning, Schema, Compression • Application Design • SQL Tips • Performance Layer • Indexes, Statistics, Referential Integrity, Materialized Query Tables, Replicated Tables • Configuration and Operations • Configuration, Collecting Statistics

Calisto Zuzartecalisto@ca.ibm.com

Best Practices for Query Performance In a Data Warehouse

Best Practices for Query Performance In a Data Warehouse

Presentation Transcript

Design a Data Center: Best Practices for Designing a Data Ce

Best Practices in Fast Track, Parallel Data Warehouse, and Traditional Data Warehouse Design

Fast Track Data Warehouse V3.0 New Features and Best Practices

Building a Data Warehouse

Designing a Data Warehouse

DEV411 ASP.NET: Best Practices For Performance

Best Options for Expense Reporting in the Data Warehouse

Best Practices in Performance Testing

Best Practices in Presenting Data

Best Practices for Building a High Performance Board

BEST PRACTICES FOR A COLLABORATIVE APPROACH TO DATA

SQL Server Query Tuning Best Practices

Best Options for Expense Reporting in the Data Warehouse

Web Performance Best Practices

A Data Warehouse for Canadian Literature

Performance – Some Best Practices

SQL Server Query Tuning Best Practices

Populating a Data Warehouse

Best Practices for Ensuring LAN Performance

Sales Performance Best Practices

Best Practices for Data Warehousing

Best Practices for Data Warehousing