580 likes | 708 Views
Using Statspack in Oracle8i and 9i to Identify Problems. Ian Jones Database Specialists, Inc. www.dbspecialists.com. Session Topics. Statspack introduction and features Mechanics installing generating snapshots producing reports Discussion of the generic report Examples.
E N D
Using Statspack in Oracle8i and 9i to Identify Problems Ian Jones Database Specialists, Inc. www.dbspecialists.com
Session Topics Statspack introduction and features Mechanics installing generating snapshots producing reports Discussion of the generic report Examples
Session Topics Statspack introduction and features Mechanics installing generating snapshots producing reports Discussion of the generic report Examples
What is Statspack? • An Oracle provided set of SQL*Plus scripts and a PL/SQL package that allows the convenient collection, automation, storage and reporting of performance and diagnostic data • A PERFSTAT schema containing 42 ‘stats$’ tables and a PL/SQL package ‘statspack’ • Replacement for utlbstat/utlestat
Overview of How Statspack Works • Oracle instances constantly update lots of internal statistics, most visible through the v$ views e.g. system statistics, wait events and SQL activity, etc (timed_statistics, resource_limit, 9i statistics_level) • Using ‘statspack.snap’ we save away these values from 34 v$ views into stats$ tables when desired • Then we run the statspack report script ‘spreport.sql’ which calculates and displays the differences between any two sets of statistics • Straightforward and effective
What Questions Can Statspack Answer? What work load is the database under now? What activities/events are we waiting for? Which SQL is consuming most resources? Which segments are most problematic? Where is the I/O, and are we CPU bound? How does all this compare with earlier data? Statspack provides diagnostic data to solve problems.
Why Use Statspack? Simple and quick to install and use Provided with all editions version 8.1.6+ Written by Oracle - in sync with RDBMS Small system overhead (varies with level) Source code is available for review Snapshot data held in tables and available for historical or custom analysis
Replacement For (utl)bstat/estat • Statspack has an improved design over bstat/estat • Flexible reporting because data held in tables • Different levels of data collection • User defined thresholds • Wider range of data • SQL statements • Wait events • Segment statistics (9.2) • Bstat/estat not updated with new features
Statspack Main Files • Set of 19 files named sp* (stat* in 8.1.6) located in $ORACLE_HOME/rdbms/admin • spdoc.txt – Good description of mechanics • spcreate.sql – Sqlplus installation script • spreport.sql – Generic reporting script • sprepsql.sql – Explain plan report script • spauto.sql – Creates dbms_job to automate data collection (job_queue_processes>0)
Session Topics Statspack introduction and features Mechanics installing generating snapshots producing reports Discussion of the generic report Examples
Installation • Run the ‘spcreate.sql’ script using SQL*Plus as user SYS. User PERFSTAT is created by this script, owning all objects needed by the statspack package. • E.g. On Unix: cd $ORACLE_HOME/rdbms/admin sqlplus “/ as sysdba” @spcreate.sql • To set up automatic collection of data every hour: cd $ORACLE_HOME/rdbms/admin sqlplus perfstat/<pwd> @spauto.sql
Snapshots A single set of performance data captured using the statspack PL/SQL package: Begin perfstat.statspack.snap(i_snap_level=>6); End; Different snapshot levels determine data captured: Level = 0 General performance statistics (8i,9i) Level = 5 SQL Statements (default) (8i,9i) Level = 6 SQL Plans (9i) Level = 7 Segment statistics (9.2) Level = 10 Parent and Child latches (8i,9i)
Generic Report (spreport.sql) Generates a report between any two snapshots as long as the instance was not restarted between the snapshots sqlplus perfstat/<pwd> @spreport.sql Enter the start and end snapshot id’s and optionally enter the output file name (or accept the default sp_<b>_<e>.lst)
Session Topics Statspack introduction and features Mechanics installing generating snapshots producing reports Discussion of the generic report Examples
Sections of the Generic Report • Context 0 • Cache Sizes 0 • Load Profile 0 • Instance Efficiency 0 • Timed/Wait Events (renamed now includes CPU time) 0 • SQL (Buffer Gets/Disk Reads/Executions/Parses) 5 • Instance Statistics 0 • Tablespace and Datafile IO 0 • Buffer Pool Statistics 0 • Rollback Activity 0 • Latch Statistics 0,10 • Segment Statistics (introduced in 9.2) 7 • Library Cache Statistics 0 • SGA Pool Breakdown 0 • Instance Parameters 0
Context/Cache Sizes DB Name DB Id Instance Inst Num Release Cluster Host ------- -------- -------- -------- --------- ------- ------- HAW1 39997887 haw1 1 9.2.0.1.0 NO HAWKING Snap Id Snap Time Sessions Curs/Sess ------- ------------------ -------- --------- Begin Snap: 32 24-Oct-02 16:45:54 10 5.2 End Snap: 33 24-Oct-02 16:46:32 10 5.2 Elapsed: 0.63 (mins) Cache Sizes (end) ~~~~~~~~~~~~~~~~~ Buffer Cache: 36M Std Block Size: 8K Shared Pool Size: 12M Log Buffer: 512K
Load Profile Per Second Per Transaction Redo size: 77,138.42 2,931,260.00 Logical reads: 765.50 29,089.00 Block changes: 565.58 21,492.00 Physical reads: 6.39 243.00 Physical writes: 11.76 447.00 User calls: 0.11 4.00 Parses: 280.47 10,658.00 Hard parses: 266.45 10,125.00 Sorts: 15.29 581.00 Logons: 0.00 0.00 Executes: 293.29 11,145.00 Transactions: 0.03 % Blocks changed per Read: 73.88 Recursive Call %: 99.99 Rollback per trans %: 0.00 Rows per Sort: 18.96
Load Profile - Comments • Excellent summary of instance workload based on selected v$sysstat statistics • Problems easier to see if data from a previous baseline is available - are we performing more IO? • Difficult to set upper limits due to hardware and system variation – rough guidelines • Logical reads > 10,000 per 100MHz CPU per second • Physical reads > 100 per disk per second • Hard parses, soft parses > 100, 300 per second • Focus on parse (consider cursor_sharing and session_cached_cursors) and IO rates
Cursor_sharing = force Per Second Per Transaction Redo size: 189,173.33 2,837,600.00 Logical reads: 1,572.40 23,571.00 Block changes: 1,384.87 20,773.00 Physical reads: 15.73 236.00 Physical writes: 30.53 458.00 User calls: 0.27 4.00 Parses: 667.20 10,008.00 Hard parses: 19.27 289.00 Sorts: 3.73 56.00 Logons: 0.00 0.00 Executes: 669.79 10,046.00 Transactions: 0.07 % Blocks changed per Read: 88.13 Recursive Call %: 99.96 Rollback per trans %: 0.00 Rows per Sort: 20.13
Instance Efficiency Instance Efficiency Percentages (Target 100%) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Buffer Nowait %:100.00Redo NoWait %:100.00 Buffer Hit %: 99.98 In-memory Sort %: 99.48 Library Hit %: 76.14 Soft Parse %: 5.00 Execute to Parse %: 4.37 Latch Hit %:100.00 Parse CPU to Parse Elapsd %: 97.73 % Non-Parse CPU: 23.35 Underlined items have good corresponding wait events Shared Pool Statistics Begin End ----- ----- Memory Usage %: 94.08 93.54 % SQL with executions>1: 76.37 54.90 % Memory for SQL w/exec>1: 62.10 61.01
Instance Efficiency - Comments • Pre-computed ratios can highlight problems but may be misleading when using small intervals or after restarts – check actual values for significance • Seeming good ratios can still cause problems. Practical range of ratios differ greatly. 90-100% Buffer/redo nowaits, Latch, Sorts 50-100% Library Cache 0-100% Parse, Buffer Hit • Correlate ratios with wait events where possible • Shared pool usage should settle down to 80-90% if >90% check binds and reloads
Top 5 Timed Events Most valuable section of generic report 9.2 includes ‘CPU Time’ besides waits events (issues if resource_limit=false) Top 5 Timed Events ~~~~~~~~~~~~~~~~~~ % Total Event Waits Time (s) Ela Time ------------------------------ ----- -------- -------- CPU time 30 91.43 direct path read 95 1 3.53 control file sequential read 54 1 2.33 log file parallel write 62 0 .95 db file parallel write 20 0 .68 ---------------------------------------------------------
Incorporation of CPU Time • Pre 9.2 Top 5 wait events = wait time * 100 --------------------------- Sum of all wait times • 9.2 Top 5 timed events = (wait or CPU time) * 100 ------------------------------------------ Sum of all wait times + CPU time
Wait Events - Comments A very important diagnostic provided by Oracle. The major ‘jumping off point’ if the elapsed times are a significant proportion of the interval time (i.e. if most of the time is not spent in idle waits) See Reference Guide for details of each wait Common I/O related waits:- Db file sequential read – Index reads or scans Db file scattered read – Full table scans Direct path read/write – Temp IO Log related waits - IO, switches, buffer
Wait Events – Where to Jump? • Db file * read ->SQL by buffer gets/disk reads, File IO stats • CPU Time -> Parse rates, Sorts, SQL executions, SQL buffer gets/disk reads, SMP processes(bugs) • Direct path reads/writes -> Sorts, Hash joins, hash/sort_area_size, File IO Stats • Buffer busy waits -> Buffer pool, Buffer waits, File IO stats, Segment statistics • Other important wait events (e.g. latches, enqueues) have corresponding statspack sections to themselves
SQL Section • Four sections of “worst SQL” ranked by buffer gets, disk reads, executions, parse counts. SQL ordered by Gets for DB: HAW1 Instance: haw1 Snaps: 117 -118 CPU Elapsd Buffer Gets Execs Gets per Exec %Total Time(s) Time(s) Hash Value ----------- ----- ------------- ------ ------- ------- ---------- 13,192 1 13,192 74.2 1.83 8.76 3097336866 Module: SQL*Plus SELECT * FROM policies WHERE policy_type = :b1
SQL Section - Comments Sub optimal SQL is the most common source of database problems. “Can we get the same results by consuming fewer resources?” SQL ranked by total numbers, often the ‘number per execution’ is more useful What is our current execution plan and has it changed recently? Second statspack report available (9i, level >= 6) sqlplus perfstat/<pwd> @sprepsql.sql This report provides breakdown across snapshots based on SQL hash value. Reveals changing execution plans (see later example)
Segment Statistics • Historically difficult to isolate segment specific data, new 9.2 view v$segstat greatly simplifies this Top 5 Logical Reads per Segment for DB -> End Segment Logical Reads Threshold: 10000 Obj. Logical Owner Tablespace Object Name Type Reads %Total ----- ---------- --------------------- ----- ------- ------ TB TAB1 ANALYSIS_COMMON_RESU TABLE 106,416 24.35 TB TAB1 ANALYSIS_TESTS TABLE 103,744 23.74 TB TAB1 SAMPLES TABLE 40,736 9.32 TB IND1 SAMPLES_UK1 INDEX 18,688 4.28 TB TAB1 ANALYSIS_RESULTS_PK INDEX 18,032 4.13 -------------------------------------------------------------
Instance/Session Statistics • Instance Statistics always included in report, we can also include session statistics for a single session if desired (i_session_id=>10) • Useful for validating ratios & obscure stats Instance Activity Stats for DB: HAW1 Instance: haw1 Statistic Total per Second per Trans ------------------------ ----- ----------- --------- CPU used by this session 1,605 16.2 1,605.0 parse time cpu 8 0.1 8.0 parse time elapsed 8 0.1 8.0
Tablespace and Datafile IO Tablespace IO Stats for DB: Instance: PAYROLL ->ordered by IOs (Reads + Writes) desc Tablespace Filename ---------- ------------------------------------------------- Av Av Av Av Buffer Av Buf Reads Reads/s Rd(ms) Blks/Rd Writes Writes/s Waits Wt(ms) ------- ------- ------ ------- ------ -------- ------ ------ PAY_6 /u01/oradata/payroll/PAY_6_1.dbf 438,860 638 4.8 7.4 10 0 5,750 9.7
Buffer Pool and Buffer Waits Buffer Pool Statistics for DB: NETMON Instance: netmon -> Pools D: default pool, K: keep pool, R: recycle pool Free Write Buffer Buffer Consistent Physical Physical Buffer Complete Busy P Gets Gets Reads Writes Waits Waits Waits - --------- ---------- --------- -------- ------ -------- -------- D 4,859,734 4,765,667 4,755,716 1,740 0 4 8,333 ------------------------------------------------------------------ Buffer wait Statistics for DB: NETMON Instance: netmon -> ordered by wait time desc, waits desc Tot Wait Avg Class Waits Time (cs) Time (cs) ------------ ----- --------- --------- data block 8,375 8,000 1 undo block 4 1 0 --------------------------------------
Buffer Pool and Buffer Waits • 9i report includes hit ratio per pool in 8i we have to calculate it manually 100*(1-physical/buffer gets) • If significant free buffer waits or write buffer waits it implies that db writer is not keeping up with the buffer pool throughput. • Busy buffer waits indicate multi process contention for a block. Check data class and reduce contention (e.g. reverse key indexes, fewer rows per block, freelists, initrans, more rollbacks, etc.)
Latches Latch Activity for DB: Pct Avg Pct Get Get Slps NoWait NoWait Latch Name Requests Miss /Miss Requests Miss ----------------------- --------- ---- ----- --------- ------ cache buffers lru chain 4,925,313 4.3 0.2 4,749,919 4.4 ------------------------------------------------------------- Latch Sleep breakdown for DB -> ordered by misses desc Get Spin & Latch Name Requests Misses Sleeps Sleeps 1->4 ----------------------- --------- ------- ------ ------------ cache buffers lru chain 4,925,313 211,245 35,178 179031/29608 /2337/269/0 -------------------------------------------------------------
Library Cache • Reloads indicate we are aging out code and reparsing. If bind variables used increase shared_pool size, keep objects Library Cache Activity for DB: PROD Instance: PROD ->"Pct Misses" should be very low Get Pct Pin Pct Invali- Namespace Requests Miss Requests Miss Reloads dations --------------- --------- ---- --------- ---- ------- ------- BODY 1,074 0.1 559 92.8 518 0 CLUSTER 2,736 0.0 4,056 0.0 0 0 PIPE 0 0 0 0 SQL AREA 1,146,358 84.0 3,434,570 56.4 14,339 0 TABLE/PROCEDURE 1,988,138 0.0 4,940,442 0.9 27,943 0 TRIGGER 0 0 0 0
Session Topics Statspack introduction and features Mechanics installing generating snapshots producing reports Discussion of the generic report Examples
Examples Monitoring Madness Out of Sorts Distributed SQL Changing Plans Freelists and 9i Auto Managed Segment
Example #1: Monitoring Madness • A previously stable system, a third party monitoring package, is suddenly consuming large amounts of CPU time. The Unix administrators want to know if they should kill these ‘out of control’ Oracle processes Snap Id Snap Time Sessions ------- ------------------ -------- Begin Snap: 2503 07-Aug-02 12:20:24 33 End Snap: 2512 07-Aug-02 12:31:52 33 Elapsed: 11.47 (mins)
Example #1: Load Profile Per Second Per Transaction Redo size: 7,734.59 8,737.93 Logical reads: 7,168.02 8,097.87 Block changes: 31.11 35.15 Physical reads: 6,916.97 7,814.25 Physical writes: 3.04 3.43 User calls: 21.54 24.34 Parses: 2.72 3.07 Hard parses: 1.08 1.22 Sorts: 1.75 1.98 Logons: 0.04 0.04 Executes: 16.15 18.24 Transactions: 0.89 % Blocks changed per Read: 0.43 Recursive Call %: 60.07 Rollback per trans %: 0.49 Rows per Sort: 13.08
Example #1: Wait Events Instance Efficiency Percentages (Target 100%) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Buffer Nowait %: 99.83 Redo NoWait %: 99.99 Buffer Hit %: 3.50 In-memory Sort %: 99.33 Library Hit %: 91.25 Soft Parse %: 60.11 Execute to Parse %: 83.17 Latch Hit %: 98.74 Parse CPU/Parse Elapsd %: 73.37 % Non-Parse CPU: 100.00 Top 5 Wait Events ~~~~~~~~~~~~~~~~~ Wait % Total Event Waits Time (cs) Wt Time ----------------------- ------- --------- ------- db file scattered read 620,235 282,598 63.13 latch free 76,093 145,960 32.61 -------------------------------------------------
Example #1: Physical Reads SQL ordered by Reads for DB: NETMON Instance: netmon -> End Disk Reads Threshold: 1000 Physical Reads Executions Reads per Exec % Total Hash Value -------------- ---------- -------------- ------- ---------- 4,723,495 614 7,693.0 99.3 1725419450 select distinct message_number from ntw_act_messages where message_number=:b0 union select message_number from ntw_act_messages where original_msgid=:b0 union select message_number from ntw_hist_messages where message_number=:b0 union select message_number from ntw_hist_messages where original_msgid=:b0
Example #1: Comments • Execution plan of offending statement SELECT STATEMENT Hint=CHOOSE SORT UNIQUE UNION-ALL INDEX UNIQUE SCAN SYS_C001289 TABLE ACCESS FULL NWT_ACT_MESSAGES INDEX UNIQUE SCAN SYS_C001322 TABLE ACCESS FULL NWT _HIST_MESSAGES • New networking equipment and network problems introduced over the weekend caused major flood of messages
Example #2: Out of Sorts Load Profile Per Second Per Transaction ---------- --------------- Logical reads: 101.25 14,884.00 Physical reads: 51.7 7,610.00 Physical writes: 51.7 7,610.00 Parses: 0.7 113.00 Instance Efficiency Percentages (Target 100%) Buffer Nowait %: 100.00 Redo NoWait %: 100.00 Buffer Hit %: 100.00 In-memory Sort %: 96.64 Library Hit %: 99.62 Soft Parse %: 99.12 Execute to Parse %: 56.70 Latch Hit %: 100.00 Parse CPU / Parse Elapsd %: 100.00 % Non-Parse CPU: 91.43
Example #2: Out of Sorts Top 5 Timed Events Wait % Total Event Waits Time (s) Wt Time ----------------- ---------- -------- ------- direct path write 1,919 78 98.35 SQL ordered by Reads Physical Reads Execs Reads per Exec %Total -------------- ----- -------------- ------ 7,430 1 7,430 97.6 select * from mod where course=:b1 order by nam Tablespace IO Stats Tablespace Reads Reads/s Writes Writes/s ---------- ------ ------- ------ -------- TEMP 3,155 21 7,610 52
Example #2: Conclusions sort_area_size parameter was set to 8i default value of 64k Virtually all the I/O to TEMP tablespace due to disk sorting, even though in memory sorts were 96.65% Increasing sort_area_size produced over 90% improvement in benchmark performance
Example #3: Distributed SQL Users complaining of poor performance Nothing strange in report (e.g. no bad SQL) except Top 5 Wait Events ~~~~~~~~~~~~~~~ Wait % Total Event Waits Time (cs) Wt Time ----------------------------- ------- --------- ------- SQL*Net message from dblink 197,764 12,281 94.20 SQL*Net more data from dblink 1,415 383 2.94 SQL*Net message to dblink 197,765 273 2.09 db file sequential read 53 50 .38 control file parallel write 50 34 .26 --------------------------------------------------------
Example #3: Distributed SQL Stats from the remote database Load Profile Per Second Per Transaction ---------- --------------- Logical reads: 688.33 58,196.50 User calls: 1,283.33 110,366.00 Executes: 647.76 55,707.00 Instance Efficiency Percentages (Target 100%) Buffer Nowait %: 100.00 Redo NoWait %: 100.00 Buffer Hit %: 99.10 In-memory Sort %: 99.39 Library Hit %: 99.80 Soft Parse %: 92.06 Execute to Parse %: 99.39 Latch Hit %: 100.00 Parse CPU/Elapsd %: 50.87 % Non-Parse CPU: 98.29
Example #3: Distributed SQL Top 5 Timed Events Wait % Total Event Waits Time (s) Wt Time ----------------- ----- -------- ------- CPU time 52 98.32 SQL ordered by Executions Executions Rows Processed Rows per Exec Hash Value ---------- -------------- ------------- ---------- 110,703 110,703 1.0 3946697925 select "RESOURCE_ID" from “RESOURCES" "D" WHERE :1="RESOURCE_ID"
Example #3: Conclusions • Search of (v$sql) based on previous fragment identified the following statement on the primary SELECT DISTINCT b.auth_role_code FROM person@paw a, person_auth_roles@paw b, access_roles@paw c, resources@paw d WHERE upper(a.user_login) = upper(‘G243311') AND a.person_id = b.person_id AND b.auth_role_code = c.auth_role_code AND c.resource_id = d.resource_id AND upper(d.resource_id) IN (SELECT upper(ga_resource_id) FROM apps_mapping)
Example #3: Conclusions • Third party package (in remote database) is not analyzed. This results in a poor distributed execution plan Rows Execution Plan ---------- --------------------------------------------------------- 0 SELECT STATEMENT GOAL: CHOOSE 110703 REMOTE [PAW.WORLD] SELECT "RESOURCE_ID" FROM "RESOURCES" "D" WHERE :1= "RESOURCE_ID"
Example #4: Changing Plans • A batch job that had previously performed well was now taking much longer to run. A conventional statspack report showed that a particular statement was dominating the resource usage. What has changed? Begin statspack.snap(I_snap_level=>6); End; sqlplus perfstat/<pwd> @sprepsql.sql