440 likes | 588 Views
Data Guard Best Practices & Tuning. Thomas E. Canty ServerCare, Inc. Session #126. Thomas E. Canty, Senior Oracle DBA, ServerCare, Inc. 19 years of Oracle experience, starting with version 5 Has presented at IOUG, OpenWorld, NoCOUG, IASA, Has been a DBA, Developer, Architect, and IT Manager.
E N D
Data Guard Best Practices & Tuning Thomas E. CantyServerCare, Inc.Session #126
Thomas E. Canty, Senior Oracle DBA, ServerCare, Inc. 19 years of Oracle experience, starting with version 5 Has presented at IOUG, OpenWorld, NoCOUG, IASA, Has been a DBA, Developer, Architect, and IT Manager Speaker Qualifications • Has worked with Fortune 100 companies in Healthcare, Technology, Pharmaceuticals, and Telecom, as well as Major Universities 888-918-6309 http://www.ServerCare.com
Outline • Overview • Network Optimization • ARCn & LGWR Redo Transport • Checkpoint, Redo Read/Apply & Recovery • Wait Events • 10g R2 & 11g Improvements • Best Practices
Data Guard Modes • Maximum Performance Mode • Least performance impact • Default mode • Maximum Protection Mode • Emphasis on data safety • Requires at least one secondary • Maximum Availability Mode • Emphasis on uptime • Continues if secondary unavailable
Outline • Overview • Network Optimization • ARCn & LGWR Redo Transport • Checkpoint, Redo Read/Apply & Recovery • Wait Events • 10g R2 & 11g Improvements • Best Practices
Session Data Unit (SDU) • In Oracle Net connect descriptor: sales.servercare.com= (DESCRIPTION= (SDU=32767) (ADDRESS=(PROTOCOL=tcp) (HOST=sales-server)(PORT=1521)) (CONNECT_DATA= (SID=sales.servercare.com))) • Globally in sqlnet.ora: • DEFAULT_SDU_SIZE=32767
Session Data Unit (SDU) (Cont.) • On standby DB, set in listener.ora: SID_LIST_listener_name= (SID_LIST= (SID_DESC= (SDU=32767) (GLOBAL_DBNAME=sales.servercare.com) (SID_NAME=sales) (ORACLE_HOME=/u01/app/oracle/product/10.2.0/db_1)))
TCP Socket Buffer Size • Set TCP socket buffer size = 3 * BDP • Data Guard broker config. – Set in sqlnet.ora • Non Data Guard broker – set in connect descriptor • BDP - Bandwidth Delay Product • RTT- Round Trip Time
TCP Socket Buffer Size • Assume gigabit network with RTT 25 ms BDP= 1,000 Mbps * 25msec (.025 sec) 1,000,000,000 * .025 25,000,000 Megabits / 8 = 3,125,000 bytes • In this example: socket buffer size = 3 * bandwidth * delay = 3,125,000 * 3 = 9,375,000 bytes • sqlnet.ora: RECV_BUF_SIZE=9375000 SEND_BUF_SIZE=9375000
Network Queue Sizes • Between kernel net. subsystems & NIC driver • txqueuelen - transmit queue size • netdev_max_backlog - receive queue size • Assumes gigabit network with 100ms latency • Set queues: • ifconfig eth0 txqueuelen 10000 • sysctl.conf: • net.core.netdev_max_backlog=20000
Overall Network • Ensure sufficient bandwidth to standby • Verify TCP_NODELAY set to YES (default) • RHEL3 - increase /proc/sys/fs/aio-max-size on standby • From 131072(default) to 1048576 • Set RECV_BUF_SIZE & SEND_BUF_SIZE = 3 * Bandwidth Delay Product (BDP) • Use Session Data Unit (SDU) size of 32767 • Increase send & receive queue sizes • TXQUEUELENGTH • NET_DEV_MAX_BACKLOG
Outline • Overview • Network Optimization • ARCn & LGWR Redo Transport • Checkpoint, Redo Read/Apply & Recovery • Wait Events • 10g R2 & 11g Improvements • Best Practices
ARCn Redo Transport 1) Read from local arch 2) Receive redo 3) Ack - know -ledge
ASYNC LGWR Redo Transport 1) Write local redo 2) ASYNC send redo 3) Receive redo 4) Ack - know -ledge 5) Write stdby redo
SYNC LGWR Redo Transport 1) Write local redo 2) SYNC send redo 3) Receive redo 4) Ack - know -ledge 5) Post receipt to LGWR
Optimize ARCn Transport • Increase MAX_CONNECTIONS to 5 on standby (if possible) • default (2), maximum (5) • Increase LOG_ARCHIVE_MAX_PROCESSES • Larger than MAX_CONNECTIONS • Up to network bandwidth • default (2), maximum (30)
Optimize LGWR Transport • Decrease NET_TIMEOUT (default 180 secs.) • Be careful! - Not too low • New COMMITS • COMMIT IMMEDIATE WAIT (default) • COMMIT NOWAIT • COMMIT NOWAIT BATCH
All Redo Transport • Standby redo logs • Use fastest disks • No RAID5 • Don’t multiplex • Use the recommended number of SRLs • (maximum# of online logfiles + 1) * maximum# of threads
Outline • Overview • Network Optimization • ARCn & LGWR Redo Transport • Checkpoint, Redo Read/Apply & Recovery • Wait Events • 10g R2 & 11g Improvements • Best Practices
Checkpoint Phase • Checkpoint occurs • During log switch • LOG_CHECK_TIMEOUT expiration • LOG_CHECKOUT_INTERVAL reached • Reduce log switch interval • Resize redo log to 1GB - primary and secondary • Recommended - checkpoint every 15 minutes
Checkpoint Phase (Cont.) • Determine checkpoint frequency COL NAME FOR A35; SELECT NAME, VALUE, TO_CHAR(SYSDATE, ‘HH:MI:SS’) TIME FROM V$SYSSTAT WHERE NAME = 'DBWR checkpoints'; NAME VALUE TIME ----------------------------------- ---------- -------- DBWR checkpoints 264 08:15:43 SQL> / NAME VALUE TIME ----------------------------------- ---------- -------- DBWR checkpoints 267 08:34:06
Redo Read (Secondary) • Obtain read rate for the standby redo log SQL> ALTER SYSTEM DUMP LOGFILE '/u01/oradata/docprd/sredo01.log’ validate; System altered. $vi docprd_ora_3560.trc Mon Mar 12 08:59:52 2007 ……………… ----- Redo read statistics for thread 1 ----- Read rate (ASYNC): 4527Kb in 0.58s => 6.90 Mb/sec Longest record: 19Kb, moves: 0/7586 (0%) Change moves: 4340/18026 (24%), moved: 2Mb Longest LWN: 92Kb, moves: 1/1365 (0%), moved: 0Mb Last redo scn: 0x0000.01272351 (19342161)
Redo Apply (Secondary) • Goal • Redo apply rate (secondary) > Redo create rate (primary) • Carefully consider enabling DB_BLOCK_CHECKING • LOW, MEDIUM and FULL options • Possible performance impact
Redo Apply (Cont.) • Determine Log Block Size (LEBSZ) SELECT LEBSZ FROM X$KCCLE WHERE ROWNUM=1; • Get recovery blocks - at least two snapshots • Managed Recovery Case SELECT PROCESS, SEQUENCE#, THREAD#, block#, BLOCKS, TO_CHAR(SYSDATE, 'DD-MON-YYYY HH:MI:SS') time from v$MANAGED_STANDBY WHERE PROCESS='MRP0'; • Determine the recovery rate (MB/sec) for a specific archive sequence number • Managed Recovery Case: ((BLOCK#_END - BLOCK#_BEG) * LOG_BLOCK_SIZE) / (TIME_END - TIME_BEG) * 1024 * 1024
Redo Apply (Cont.) • Oracle Recommends:
Recovery • Parallel Recovery (before 10.1.0.5) • Set to number of CPUs recover managed standby database parallel <#>; • PARALLEL_EXECUTION_MESSAGE_SIZE • Can increase to 4096 or 8192 • Uses additional shared pool memory • Problems if set too high • DB_CACHE_SIZE • Can set secondary DB_CACHE_SIZE >= primary • Must set to primary before changing roles
Outline • Overview • Network Optimization • ARCn & LGWR Redo Transport • Checkpoint, Redo Read/Apply & Recovery • Wait Events • 10g R2 & 11g Improvements • Best Practices
Arch Wait Events - Primary • ARCH wait on ATTACH • Time for all arch processes to spawn RFS connection • ARCH wait on SENDREQ • Time for all arch processes to write received redo to disk + open & close remote archived redo logs • ARCH wait on DETACH • Time for all arch processes to delete RFS connection
LGWR SYNC Wait Events - Primary • LGWR wait on ATTACH • Time for all log writer processes to spawn RFS connection • LGWR wait on SENDREQ • Time for all log writer processes to write received redo to disk + open & close the remote archived redo logs • LGWR wait on DETACH • Time for all log writer processes to delete RFS conn.
LGWR ASYNC Wait Events - Primary • LNS wait on ATTACH • Time for all network servers to spawn RFS connection • LNS wait on SENDREQ • Time for all network servers to write received redo to disk + open & close the remote archived redo logs • LNS wait on DETACH • Time for all network servers to delete RFS conn. • LGWR wait on full LNS buffer • Time for log writer (LGWR) process awaiting for network server (LNS) to free ASYNC buffer space
Wait Events on Secondary • RFS Write • Time to write to standby redo log or archive log + non I/O work like redo block checksum validation • RFS Random I/O • Time to write to a standby redo log to occur • RFS Sequential I/O • Time to write to an archive log to occur
Outline • Overview • Network Optimization • ARCn & LGWR Redo Transport • Checkpoint, Redo Read/Apply & Recovery • Wait Events • 10g R2 & 11gImprovements • Best Practices
10g R2 Improvements • Multiple archive processes can transmit a redo log in parallel to the standby database • MAX_CONNECTIONS attribute of the LOG_ARCHIVE_DEST_n controls the number of these processes • Parallel Recovery for Redo apply is automatically set equal to number of CPUs • 10.1.0.5 and 10.2.0.1 • Fast-Start Failover • Automatically fails over to a previously chosen physical standby database
10g R2 Improvements (Cont.) • LGWR ASYNC • Uses a new process (LNSn) to transmit the redo data directly from the online redo log to the standby database • Physical standby database flashback • Can flash back temporarily for reporting • Logical standby database • Automatically deletes applied archived log • RMAN • Automatically creates temp datafiles after recovery
11g Improvements • Physical standby database open read/write for test or other purposes with zero compromise in data protection using new Snapshot Standby • Automatic failover configurable for immediate response to designated events or errors • More flexibility in primary/standby configurations • e.g. Windows primary and Linux standby • Rolling upgrade options now in physical standby with Transient Logical Standby • ASYNC transport enhanced to eliminate the impact of latency on network throughput
11g Improvements (Cont.) • Fast detection of corruptions caused by lost writes in the storage layer • SQL Apply supports XML data type (CLOB) • Many performance, manageability, and security enhancements • Support for new Oracle Database 11g Options – Oracle Active Data Guard and Oracle Advanced Compression • Fast Start Failover now available for Maximum Performance mode
Outline • Overview • Network Optimization • ARCn & LGWR Redo Transport • Checkpoint, Redo Read/Apply & Recovery • Wait Events • 10g R2 & 11g Improvements • Best Practices
Best Practices • Geographically separate primary & standby DB • Ensure standby hardware configuration same as the primary • Tune standby for write intensive operations • Test Data Guard before deploy in production • Set standard OS and DB parameters to recommended values • Perform switchover testing • Fully document a failover procedure • Use FORCE LOGGING mode
Best Practices (Cont.) • Use real-time apply • Use the Data Guard Broker • Enable Flashback Database on both primary and secondary databases • Evaluate using AFFIRM attribute • Possible performance issues on primary • Verify Asynchronous I/O enabled • Carefully consider DB_BLOCK_CHECKING
Best Practices (Cont.) • Don’t multiplex standby redo logs (SRLs) • Correctly set number of SRLs • Increase PARALLEL_EXECUTION_MESSAGE_SIZE • Place SRLs in fast disk group or disks • Use at lease two standby DBs with Maximum Protection Mode • Utilize COMMIT NOWAIT if appropriate
Best Practices (Cont.) • Ensure appropriate bandwidth between primary and secondary • Increase default send & receive queue sizes • TXQUEUELENGTH • NET_DEV_MAX_BACKLOG • Session Data Unit • Adjust value to 32767 • Improvement during large data transmissions
Questions? • Lots of things we didn’t cover • If we don’t cover something you wanted to hear, please contact me.
Session #126:Data Guard Best Practices & Tuning THANK YOU! Please fill out evaluations! Email Tom Canty: tom.canty@servercare.com Or Call: 888-918-6309 http://www.ServerCare.com