1 / 53

Stats Gathering in an OLAP ( DataWarehousing ) Environment .

Steve Catmull Intermountain Healthcare September 2009. Stats Gathering in an OLAP ( DataWarehousing ) Environment. Class “Norms”. Last Call Feel free to interrupt I try to assume I know less than you But we can learn together. Agenda. Background Messy Statistics DBMS_STATS in Action

kfoley
Download Presentation

Stats Gathering in an OLAP ( DataWarehousing ) Environment .

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Steve Catmull Intermountain Healthcare September 2009 Stats Gathering in an OLAP (DataWarehousing) Environment.

  2. Class “Norms” • Last Call • Feel free to interrupt • I try to assume I know less than you • But we can learn together.

  3. Agenda • Background • Messy Statistics • DBMS_STATS in Action • The Need • Our Custom Stats Gathering Procedure • Why It’s Still In Use • Changes We’d Like to Make • 11gR1 Features of Interest

  4. Background

  5. Background • A 90 minute case study? • 4 year journey • A lot of lessons learned. • Some risky behavior. No risk. No reward. • A lot more opportunities for improvement. • Dan Stober

  6. System Background • Focus: Data Warehouse (10gR2) • Hardware • 16 processors • 64 GB memory • 6TB of disk (extents allocated) • Usage • 450 TB of buffer read I/O per day

  7. Shocking Statement • Collect histograms on all columns and all tables • Well we don’t do that….

  8. Statistically Speaking…. • 3 Volunteers • “The statistics on sanity are that one out of every four Americans is suffering from some form of mental illness. Think of your three best friends. If they're okay, then it's you.” -- Rita Mae Brown

  9. Not that far from reality… • Our example may feel contrived or pointless • This is how 9i saw your data. • 10g still sees your data this way. • 11g offers some hope.

  10. Quick Optimizer Summary SQL Optimizer Table # Rows: 39222 # Blocks: 2832 Column Column Name: EMP_ID # Distinct: 39222 # Null: 0 Not Null: TRUE Explain Plan Statistics

  11. Dictionary Review • DBA_TABLES

  12. Dictionary Review • DBA_TAB_COLUMNS

  13. Notes on Partitioned Tables • Logical versus Physical • Physical table = non-partitioned table = 1 segment. • Logical table = partitioned table = 1:many segments.

  14. Note about Partitioned Tables • All of these could represent “table” statistics. • DBA_TABLES – Global • DBA_PARTITIONS • DBA_SUBPARTITIONS • When sub-partitioned • Partitions and Global are just logical groupings • All segments are sub-partitions • When Partitioned only • Segments are partitions.

  15. Column Stats Gathered By Version • 8i • min/max values for each column • NDV (number of distinct values) • 9i • Min/max values for each column • NDV (number of distinct values) • 10g • Histogram profile on all “workload” columns • 9i

  16. Quick Note on Histograms

  17. DBMS_STATS in Action

  18. The New “Stats” • With the introduction of the optimizer, a new method of stats gathering was also introduced. • Now it feels “old” • But what does it do?

  19. Gathering Stats - Simplified • Probes the # of row in the table with a small crude sample. • Then samples the table and columns • Null & Not Null • NDV (Number of Distinct Values) • Evaluates columns with sufficient “workload” to see if it is skewed. If so, collects histograms on those columns. All other columns will store min and max values.

  20. Oracle’s Defaults • Sample Size • DBMS_STATS.AUTO_SAMPLE_SIZE • Gathering Method (METHOD_OPT) • ALL COLUMNS SIZE AUTO • Sample Method • Row Sample

  21. Commentary on some parameters • Sample Size • Block or row sample • Discussion on advantages/disadvantages • Performance time • Data clustering • METHOD_OPT • Emphasis on workload.

  22. A Useful Workload View • To help us understand the “workload” factor, we built a view that we expose to all developers. • COLUMN_PREDICATE_USAGE

  23. Predicate Usage View Result

  24. Predicate Usage View - Source

  25. The Need

  26. Open Environment • Analysts are free to run any query that parses and is not governed by resource manager. • We have over 800 distinct ad hoc query users in any given month. • We cannot gatekeep all their queries • It’s a balance between information progression and performance perfection.

  27. Complex Queries • We have people who write queries that are hundreds and thousands of lines long. • Combining many tables and probably in ways never intended. • We maintain an open and inviting relationship with our customers that they can come to us with performance problems. • This helps us recognize opportunities for improvement.

  28. Our Custom Stats Gathering Procedure

  29. Our Custom Procedure • Gathers at the table level • Version for running against the whole schema • STATS_TABLE(<table_name>) • Current user inferred • Tons of optional parameters to override or customize the experience.

  30. Time Machine: 2003 • Tremendous growth in our data warehouse • No systemic stat gathering • 2% sample size for all tables • Gather optimizer statistics only on indexed columns (min/max only) • Row Sample

  31. Time Machine: 2003

  32. Time Machine: 2004-2005 • Little changes • Move from Row Sample to Block Sample • Start collecting on all columns with MIN/MAX.

  33. +/- with Block Sampling • Faster. 2% sample of a table can be had faster by grabbing a bunch of rows in one block read. • Data is not usually randomly distributed in blocks. Leads to undercounting NDV. If this was the only block sampled, Oracle would assume there were just 2 order dates with the min and max just one day apart.

  34. +/- Min/Max on All Columns • Oracle now had end-points to constrain it’s analysis. • Great for out-of bound queries. • However, within that range, Oracle assumes even distributions of # of distinct values. Example. NDV = 8 Real Data Range 1/1/2002 12/31/2082 1/1/2012 WHERE dt between 01-JAN-2007 and 31-DEC-2007

  35. Time Machine: 2006 • MAJOR overhaul. • Some Drivers • Multi-billion row table (2%?) • Inaccurate explain plans • Index stats going stale.

  36. What Our Procedure Does • Merges GATHER_TABLE_STATS and GATHER_INDEX_STATS. • Had to support replacement of a previous table stats routine with the signature ANALYZE_TABLE(<table_name>)

  37. What Our Procedure Does • Default Behavior • Gather stats on table if stats more than 35 days old. • Gather stats on index if stats more than 35 days old • Collect histograms on all columns of larger tables (row threshold) • For partitioned tables just the global statsget histograms because of gathering cost.

  38. Great Documentation

  39. Easily Configurable

  40. Flow Diagram-Table Stats Validate Parameters including object existence Adjust sample size based on table size Check age of table stats in days (stale?) 720 lines of code 400+ is error checking Call DBMS_STATS

  41. Flow Diagram-Index Stats Do we collect index stats? (default) Get list of index associated with table. Check age of index stats in days (stale?) Call DBMS_STATS 240 lines of code

  42. Why It’s Still In Use

  43. Why Not 10g? • Are we just holding on to a relic? • Always a good question to ask. • 10g’s default stats method is good for many systems and in particular OLTP systems where changes to the data happen all day long. • OLAP is about bulk changes. • OLAP presents other challenges.

  44. Why It’s Still In Use • Oracle “workload” defaults • Based on object ID • Lots of background loads • Create new background table and swap tables • ETL Principle: Gather stats at time of modifications • Not applicable to OLTP. • We have loads that use other tables just loaded. If we let Oracle do the work after the fact, ETL could be inefficient.

  45. Performance • Ran default configurations for 4 of our top most used tables.

  46. Performance Continued • How many rows did each method estimate?

  47. Additions/Changes We’d Like to make

  48. Improve On • Locking partitioned statistics • Using it • Gathering histograms on partitions/sub-partitions. • If we can get compliance with locking finalized partitions, we could switch sample size to AUTO_SAMPLESIZE. • For smaller tables switch histogram method_opt to be AUTO.

  49. Feature Add • Sub-partition histograms • Make use of gathering “stale” statistics by default instead of 1% default. • List of tables to let Oracle manage (OLTP)

  50. On the Horizon

More Related