530 likes | 665 Views
Oracle Wait Events That Everyone Should Know Kerry Osborne Senior Oracle Guy. What Are Wait Events?. Basically a section of code Uses gettimeofday All time is stuck in a bucket with a name Multiple things can be stuck in the same bucket
E N D
Oracle Wait Events That Everyone Should Know Kerry Osborne Senior Oracle Guy
What Are Wait Events? • Basically a section of code • Uses gettimeofday • All time is stuck in a bucket with a name • Multiple things can be stuck in the same bucket • Oracle is very well instrumented (but It’s not 100%) • There are 41 classes of wait events in 10.2.0.3 • There are 878 wait events in 10.2.0.3 • 209 enqueue events • 29 latch events • 41 I/O events
What do you do with Wait Events? • See where Oracle is spending it’s time • Create a Resource Profile • Aggregate the time by Wait Event • Order the profile by time • Include # of occurrences and avg. time/event • Shows where to focus • Shows how much improvement can be made
Profiles • Basic Tool – maybe the most important • Shows up in Statspack / AWR reports • Shows up in tkprof output (10g) • Can be pulled out of trace files • Can be pulled from V$ tables (v$system_event) • Can be pulled from ash tables Oracle is very well instrumented
Profiles Top 5 Timed Events ~~~~~~~~~~~~~~~~~~ % Total Event Waits Time (s) Ela Time -------------------------------------------- ------------ ----------- -------- CPU time 37,297 52.27 SQL*Net message from dblink 7,065,802 11,312 15.85 async disk IO 943,852 5,265 7.38 db file sequential read 13,042,432 3,493 4.90 direct path write 108,862 3,410 4.78 -------------------------------------------------------------
Why should developers know this stuff? • Because you need to know where your code is spending it’s time • Scalability is Important
Top 10 List of Wait Events • 1. These are the events you should know by heart • 2. The other 260 or so you can figure out • 3. You need to know what causes them • 4. You need to know what values are reasonable • 5. You need to know what you can do to fix them select name, parameter1, parameter2, parameter3 from v$event_name where name like nvl('&event_name',name) order by name
CPU • Not a wait event – but often included in a profile • Time spent but not accounted for in other buckets • Primarily time spent doing lio (we hope) • We want it to be on top • It may be an indicator of inefficient plans
DB File Sequential Read • Single block read • Usually index block or data block via rowid • Can be done by fts to get chained rows • P1=file, P2=block, P3=# of blocks (always 1) • Always a user’s server process reading into buffer cache • Reasonable e values: <10ms WAIT #14: nam='db file sequential read' ela= 36745 file#=1 block#=67745 blocks=1 obj#=84920
DB File Sequential Read – “fixes” • Do Less Work • tune SQL • reduce lio’s 7 / obj rule of thumb • Explain plan lies (see notes on this slide) • bigger buffer cache • Do the Work Faster • Speed up I/O • Call Jack
Digression – Explain Plan Lies • Explain plan appears to executes a separate code path • Even if it didn’t there is no guarantee TEST matches PROD • Best bet is to execute and look at V$SQL_PLAN • This is easy in 10g with dbms_xplan select * from table(dbms_xplan.display_cursor('&sql_id','&child_no','')) • See an example in the notes section for this slide
DB File Scattered Read • Multi block read • Usually full table scan or index fast full scan • P1=file, P2=block, P3=# of blocks (always < MBRC) • Always a user’s server process reading into buffer cache • Reasonable ela values: <10ms • Relevant parameters: • DBFMBRC, System Stats: MBRC, • _db_file_exec_read_count, _db_file_optimizer_read_count WAIT #14: nam='db file scattered read' ela= 19389 file#=139 block#=44 blocks=5 obj#=83580
DB File Scattered Read – “fixes” • Do Less Work (fewer multi-block reads) • MBRC • Better Indexes • System Stats • bigger buffer cache • Tune SQL • Do the Work Faster • Speed up I/O
Direct Path Read/Write • Usually sorting to Temp • Can also be parallel query • Could also be insert append, etc… • Reasonable ela values: <20ms • Relevant parameters: • DBFMBRC • _db_file_direct_io_count (1M) WAIT #10: nam='direct path write' ela= 4475 p1=401 p2=1518353 p3=57 WAIT #10: nam='direct path read' ela= 16770 p1=401 p2=1482031 p3=63
Direct Path Read/Write – “fixes” • Adjust PGA_AGGREGATE_TARGET • Turn off PX query • Call Randy
Log File Sync • User process wait for LGWR to flush dirty buffers on commit • Synchronous write event • P1=log buffer block • Reasonable ela values: <4ms WAIT #16: nam='log file sync' ela= 1286 buffer#=11910 p2=0 p3=0 obj#=84920
Log File Sync – “fixes” • Do Less Work (fewer commits) • Autocommit? • Row at a time processing with commits? • Do the Work Faster • Speed Up I/O • No RAID 5 • No Contention (other db, arch, ASM) • SS Disk • Improve LGWR Scheduling • Renice
Log File Parallel Write • Wait for LGWR to write to current log files • Not necessarily synchronous write event • System I/O event • Reasonable ela values: <4ms • Occurs as follows: • Every 3 seconds • Log_buffer is 1/3 full or redo = 1M ( default _log_io_size) • Commit or rollback by user session • RAC needs to transfer a dirty block
Log File Parallel Write – “fixes” • Often fixed by fixing Log File Sync (i.e. speed up i/o, etc) • Force LGWR to write more often
Log file switch … • Happens when database can’t switch to next log file • Freezes the whole database • Several Flavors • Checkpoint incomplete • Archiving needed • Reasonable ela values: 0ms WAIT #9: nam='log file switch (checkpoint incomplete)' ela= 986212 p1=0 p2=0 p3=0
Log file switch … – “fixes” • Do less work • Make total online log space bigger • Force more frequent incremental checkpoints • Speed up i/o
Buffer Busy Waits • Contention event for the same block • Read by other session (new event in 10g) • Held by other session in incompatible mode • New event in 10g – read by other session • Multiple sections of code throw time in this bucket • P1=file, P2=block, P3=reason code • See metalink note (34405.1) for details WAIT #7: nam='read by other session' ela= 3251 file#=4 block#=188557 class#=1 obj#=53707 tim=1183096831514887
Buffer Busy Waits - “fixes” • It’s complicated – but basically - eliminate the contention • For Read by Other Session • Do Less Work (tune SQL to do less lio) • Eliminate i/o – bigger buffer cache, etc… • Speed up i/o • For Others • Find type of hot objects and statements suffering • Address issue - ASSM, Freelist groups (RAC), etc..
Free Buffer Waits • No place to put a new block
Free Buffer Waits - “fixes” • Do Less Work • Reduce the amount of lio • Make more buffer space available • Add memory to the buffer cache • Speed up DBWR to flush blocks more quickly • Async_io • More dbwr processes • Renice
SQL*Net message from client • Database is idle – waiting on next request from client • Often marked as an idle event which it may be • Time distribution can be skewed • Common technique to ignore anything over 1 sec • Reasonable e values: <2ms • Protocol should be tcp (or beq for local processes) WAIT #1: nam='SQL*Net message from client' ela= 336 driver id=1650815232 #bytes=1 p3=0 obj#=83660
SQL*Net message from client – “fixes” • Speed up app server • Reduce the number of calls • Use the right protocol (tcp to beq) • Call network guy • Get the users to type faster
SQL*Net more data from client • More than 1 packet required to send SQL statement
SQL*Net more data from client – “fixes” • Don’t send 50K SQL statements • Use packages
SQL*Net message to Client • Time to send a SQL*Net message to a client (sort of) • Actually time it took to write the return data from Oracle’s userland SDU buffer to OS kernel-land TCP socket buffer (Tanel Poder) • Often marked as an idle event which it may be • Reasonable e values: <1ms (micro seconds actually) • P1=bytes (usually 1 ???) WAIT #1: nam='SQL*Net message to client' ela= 2 driver id=1650815232 #bytes=1 p3=0 obj#=83660
SQL*Net message to Client – “fixes” • Use correct protocol • Fix network issue • Call Andy
SQL*Net more data to client • When more than one packet required • Rows spanning block • LOB’s • Array fetches • Often marked as an idle event which it may be • Reasonable e values: <1ms • P1=bytes (usually 1 ???) WAIT #1: nam='SQL*Net message to client' ela= 2 driver id=1650815232 #bytes=1 p3=0 obj#=83660
SQL*Net more data to client – “fixes” • Fix chained/migrated rows • Don’t use LOBs • Nothing to fix
Enqueues • Locks for objects • Que up in order • 9i lumped them all together • 10g has 209 separate events
Enqueue in 9i • P1 has encoded type and mode
Enqueue in 10g • Several Major Categories • TM – table modification • TX – Transaction locks • UL – user lock • CI – Cross Instance • CU – Cursor Bind • HW – High Water • RO – Reusable Object • ST – Space Transaction • TS – Temporary Space • Parameters depend on type
Enqueue: TX row lock • User can’t modify block until other user commits
Enqueue: TX row lock – “fixes” • Use ash tables to find statements waiting • Fix the code • Or do what one of our clients did: write some code to kill any processes blocking for more than 60 seconds – lot’s easier than changing the app
Enqueue: SQ • Waiting on exclusive access to a sequence • Seems minor but can actually be a big issue on some systems
Enqueue: SQ – “fixes” • Cache the sequence • Use NOORDER in RAC
Latches • Locks internal memory structures • Not ordered • Very light weight • Kid Asking for Candy Model • Spins vs. Sleeps (_spin_count, _latch_class_X) • Wait events do not include time spent spinning • 9i lumped them all together • 10g has 28 separate events
Latches in 9i • All dumped in the Latch Free event • P2 has latch#
Latches in 10g • Primarily Deal with major SGA areas • Shared Pool • Latch: shared pool • Latch: library cache • Latch row cache • Buffer Cache • Latch: cache buffer chains • Latch: cache buffers lru chain • Log Buffer • Latch: redo … • Parameters depend on type (see v$event_name)
Latch: cache buffers chains • Each block hashes to a chain • Several chains per latch • Several latches per instance (default based on cache size)
Latch: cache buffers chains –”fixes” • Find hot blocks • Fix bad SQL • Maybe Increase number of latches • Maybe decrease rows per block
Latch: shared pool • Shared Pool Contention • Due primarily to parsing • See also latch: library cache … WAIT #28: nam='latch: shared pool' ela= 751 address=1611562656 number=213 tries=1 j#=46024
Latch: shared pool – “fixes” • Use Bind Variables • CURSOR_SHARING = FORCE • Increase SHARED_POOL – this is where Darcy came in and sat in my office so I didn’t get much more done – You can thank Darcy later!
Unaccounted for • Some tools show this (Hotsos Profiler) • Indicates lost time due to rounding (small) • Or lost time due to CPU contention (large)
Tracing • alter session set events '10046 trace name context forever, level 8'; • Spits out wait events, plans (real), stats • Level 12 includes bind variables – makes file very big • Puts file in user_dump_dest (ls –altr) • Can be executed from logon trigger, in-line, etc… • Scoping is important (i.e. when it’s turned on and off) • Note: bug in early versions of 9.2, fixed in 9.2.0.6