Stats Gathering in an OLAP ( DataWarehousing ) Environment .

Steve Catmull Intermountain Healthcare September 2009 Stats Gathering in an OLAP (DataWarehousing) Environment.

Class “Norms” • Last Call • Feel free to interrupt • I try to assume I know less than you • But we can learn together.

Agenda • Background • Messy Statistics • DBMS_STATS in Action • The Need • Our Custom Stats Gathering Procedure • Why It’s Still In Use • Changes We’d Like to Make • 11gR1 Features of Interest

Background

Background • A 90 minute case study? • 4 year journey • A lot of lessons learned. • Some risky behavior. No risk. No reward. • A lot more opportunities for improvement. • Dan Stober

System Background • Focus: Data Warehouse (10gR2) • Hardware • 16 processors • 64 GB memory • 6TB of disk (extents allocated) • Usage • 450 TB of buffer read I/O per day

Shocking Statement • Collect histograms on all columns and all tables • Well we don’t do that….

Statistically Speaking…. • 3 Volunteers • “The statistics on sanity are that one out of every four Americans is suffering from some form of mental illness. Think of your three best friends. If they're okay, then it's you.” -- Rita Mae Brown

Not that far from reality… • Our example may feel contrived or pointless • This is how 9i saw your data. • 10g still sees your data this way. • 11g offers some hope.

Quick Optimizer Summary SQL Optimizer Table # Rows: 39222 # Blocks: 2832 Column Column Name: EMP_ID # Distinct: 39222 # Null: 0 Not Null: TRUE Explain Plan Statistics

Dictionary Review • DBA_TABLES

Dictionary Review • DBA_TAB_COLUMNS

Notes on Partitioned Tables • Logical versus Physical • Physical table = non-partitioned table = 1 segment. • Logical table = partitioned table = 1:many segments.

Note about Partitioned Tables • All of these could represent “table” statistics. • DBA_TABLES – Global • DBA_PARTITIONS • DBA_SUBPARTITIONS • When sub-partitioned • Partitions and Global are just logical groupings • All segments are sub-partitions • When Partitioned only • Segments are partitions.

Column Stats Gathered By Version • 8i • min/max values for each column • NDV (number of distinct values) • 9i • Min/max values for each column • NDV (number of distinct values) • 10g • Histogram profile on all “workload” columns • 9i

Quick Note on Histograms

DBMS_STATS in Action

The New “Stats” • With the introduction of the optimizer, a new method of stats gathering was also introduced. • Now it feels “old” • But what does it do?

Gathering Stats - Simplified • Probes the # of row in the table with a small crude sample. • Then samples the table and columns • Null & Not Null • NDV (Number of Distinct Values) • Evaluates columns with sufficient “workload” to see if it is skewed. If so, collects histograms on those columns. All other columns will store min and max values.

Oracle’s Defaults • Sample Size • DBMS_STATS.AUTO_SAMPLE_SIZE • Gathering Method (METHOD_OPT) • ALL COLUMNS SIZE AUTO • Sample Method • Row Sample

Commentary on some parameters • Sample Size • Block or row sample • Discussion on advantages/disadvantages • Performance time • Data clustering • METHOD_OPT • Emphasis on workload.

A Useful Workload View • To help us understand the “workload” factor, we built a view that we expose to all developers. • COLUMN_PREDICATE_USAGE

Predicate Usage View Result

Predicate Usage View - Source

The Need

Open Environment • Analysts are free to run any query that parses and is not governed by resource manager. • We have over 800 distinct ad hoc query users in any given month. • We cannot gatekeep all their queries • It’s a balance between information progression and performance perfection.

Complex Queries • We have people who write queries that are hundreds and thousands of lines long. • Combining many tables and probably in ways never intended. • We maintain an open and inviting relationship with our customers that they can come to us with performance problems. • This helps us recognize opportunities for improvement.

Our Custom Stats Gathering Procedure

Our Custom Procedure • Gathers at the table level • Version for running against the whole schema • STATS_TABLE(<table_name>) • Current user inferred • Tons of optional parameters to override or customize the experience.

Time Machine: 2003 • Tremendous growth in our data warehouse • No systemic stat gathering • 2% sample size for all tables • Gather optimizer statistics only on indexed columns (min/max only) • Row Sample

Time Machine: 2003

Time Machine: 2004-2005 • Little changes • Move from Row Sample to Block Sample • Start collecting on all columns with MIN/MAX.

+/- with Block Sampling • Faster. 2% sample of a table can be had faster by grabbing a bunch of rows in one block read. • Data is not usually randomly distributed in blocks. Leads to undercounting NDV. If this was the only block sampled, Oracle would assume there were just 2 order dates with the min and max just one day apart.

+/- Min/Max on All Columns • Oracle now had end-points to constrain it’s analysis. • Great for out-of bound queries. • However, within that range, Oracle assumes even distributions of # of distinct values. Example. NDV = 8 Real Data Range 1/1/2002 12/31/2082 1/1/2012 WHERE dt between 01-JAN-2007 and 31-DEC-2007

Time Machine: 2006 • MAJOR overhaul. • Some Drivers • Multi-billion row table (2%?) • Inaccurate explain plans • Index stats going stale.

What Our Procedure Does • Merges GATHER_TABLE_STATS and GATHER_INDEX_STATS. • Had to support replacement of a previous table stats routine with the signature ANALYZE_TABLE(<table_name>)

What Our Procedure Does • Default Behavior • Gather stats on table if stats more than 35 days old. • Gather stats on index if stats more than 35 days old • Collect histograms on all columns of larger tables (row threshold) • For partitioned tables just the global statsget histograms because of gathering cost.

Great Documentation

Easily Configurable

Flow Diagram-Table Stats Validate Parameters including object existence Adjust sample size based on table size Check age of table stats in days (stale?) 720 lines of code 400+ is error checking Call DBMS_STATS

Flow Diagram-Index Stats Do we collect index stats? (default) Get list of index associated with table. Check age of index stats in days (stale?) Call DBMS_STATS 240 lines of code

Why It’s Still In Use

Why Not 10g? • Are we just holding on to a relic? • Always a good question to ask. • 10g’s default stats method is good for many systems and in particular OLTP systems where changes to the data happen all day long. • OLAP is about bulk changes. • OLAP presents other challenges.

Why It’s Still In Use • Oracle “workload” defaults • Based on object ID • Lots of background loads • Create new background table and swap tables • ETL Principle: Gather stats at time of modifications • Not applicable to OLTP. • We have loads that use other tables just loaded. If we let Oracle do the work after the fact, ETL could be inefficient.

Performance • Ran default configurations for 4 of our top most used tables.

Performance Continued • How many rows did each method estimate?

Additions/Changes We’d Like to make

Improve On • Locking partitioned statistics • Using it • Gathering histograms on partitions/sub-partitions. • If we can get compliance with locking finalized partitions, we could switch sample size to AUTO_SAMPLESIZE. • For smaller tables switch histogram method_opt to be AUTO.

Feature Add • Sub-partition histograms • Make use of gathering “stale” statistics by default instead of 1% default. • List of tables to let Oracle manage (OLTP)

On the Horizon

Stats Gathering in an OLAP ( DataWarehousing ) Environment .

Stats Gathering in an OLAP ( DataWarehousing ) Environment .

Presentation Transcript

OLAP in DWH

vs. OLAP

While gathering data during an experiment

In Memory OLAP Engine

SQL/OLAP

OLAP applications

OLAP

OLAP

Information Gathering in Government Bailout Decision: An Experiment

OLAP

Data Warehousing and OLAP Technology: An Overview

OLAP in DWH