390 likes | 418 Views
Adventures in Dataguard Dr. Jason Arneil. Motivation. Why Dataguard. AGENDA. Introduction The Motivation Dataguard Architecture & Features Creating a Physical Standby Maintaining your standby Using your Standby Performing a Switchover. Introduction. Health Warning. Introduction.
E N D
Motivation Why Dataguard
AGENDA • Introduction • The Motivation • Dataguard Architecture & Features • Creating a Physical Standby • Maintaining your standby • Using your Standby • Performing a Switchover
Introduction Health Warning
Introduction About Me • Jason Arneil • System Administrator/DBA • Using Oracle since 1998 • At Nominet since 2001
Introduction About Nominet • Nominet is the internet registry for .uk domain names • Nominet has been in existence for over 11 years • Nominet is run as a not-for-profit company • Nominet is owned by its members • There are over 6 Million .uk domain names
Motivation Why Dataguard • Big push on a Nominet Business Continuity Plan • Dataguard is the Oracle solution for disaster recovery • Physical Standby was the obvious option • Maximum Availability Architecture (MAA)
Motivation Business Continuity Site
Oracle Net LGWR MRP/ LSP LNS RFS Transform Redo to SQL for SQL Apply FAL Online Redo Logs Standby Redo Logs ARCH Archived Redo Logs ARCH Archived Redo Logs Architecture & Features Dataguard Processes Physical/Logical Standby Database Transactions Primary Database Backup / Reports
Architecture & Features Dataguard Features • Several Protection Modes • Maximum Protection • Maximum Availability • Maximum Performance • Several Transport Modes • LGWR SYNC • LGWR ASYNC • ARCH
Creating a Standby Prepare Primary & Standby • Prepare Primary Database • Enable Force Logging SQL> alter database force logging; • Modify initialization parameters • Prepare Standby Database • Setup directory structure • Create spfile with correct parameters • Start database in nomount
Creating a Standby Log Transport Parameters • LOG_ARCHIVE_CONFIG='DG_CONFIG=(PRIMARY, STANDBY)' • LOG_ARCHIVE_DEST_1='LOCATION=/var/oracle/PRIMARY/arch' • LOG_ARCHIVE_DEST_2='SERVICE=PRIMARC DB_UNIQUE_NAME=PRIMARY' • LOG_ARCHIVE_DEST_3='SERVICE=STANDBY LGWR ASYNC REOPEN=15 MAX_FAILURE=10 OPTIONAL VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=STANDBY'
Creating a Standby ssh tunnels • You may not wish your redo data being sent unencrypted across the internet to your standby. You can use ssh tunnels to avoid this • ssh -N -L 3333:standby:1521 oracle@standby • Now the tnsnames entry points to the localhost STANDBYARC = (DESCRIPTION = (SDU = 32767) (ADDRESS_LIST = (ADDRESS = (PROTOCOL = TCP)(HOST = localhost)(PORT=3333))) (CONNECT_DATA = (SERVICE_NAME = STANDBY)))
Creating a Standby Some Other Parameters • FAL_SERVER • FAL_CLIENT • ARCHIVE_LAG_TARGET • STANDBY_FILE_MANAGEMENT • DB_FILE_NAME_CONVERT • LOG_FILE_NAME_CONVERT
Creating a Standby backup your primary • Backup primary - rman is good • rman> backup format '/backup/%U' database plus archivelog; • rman> backup format '/backup/%U' current controlfile for standby; • Recover backup on standby node • I like using rman duplicate to create standby: • (oracle$) rman target sys/password@PRIMARY auxiliary / • rman> duplicate target database for standby;
Creating a Standby Start applying redo • Create standby redo log files on both primary and standby: • sql> alter database add standby logfile thread 2 group 42 (’PATH_TO_DATA/standbyredo01.log') size 512M; • Now you can start the physical standby recovering logs: • sql>alter database recover managed standby database disconnect from session; • Or if you prefer real time apply: • sql>alter database recover managed standby database using current logfile disconnect from session;
Maintaining your standby Monitoring the Standby • You have to ensure your standby is keeping up with your primary • You can check which was the last log to have been applied to your standby is • sql> SELECT MAX(SEQUENCE#), THREAD# FROM V$ARCHIVED_LOG where APPLIED='YES' GROUP BY THREAD#; MAX(SEQUENCE#) THREAD# -------------- ---------- 2976 1 1888 2
Maintaining your standby Monitoring Standby Progress • A good way of checking what the background processes of your standby are up to is using v$managed_standby • SQL> select process, sequence#, status from V$managed_standby; PROCESS SEQUENCE# STATUS -------- ---------- ------------ ARCH 2967 CLOSING ARCH 2974 CLOSING RFS 2977 IDLE MRP0 1889 APPLYING_LOG RFS 1889 IDLE RFS 2977 IDLE
Maintaining your standby Monitoring Your Standby • You have to ensure your standby is keeping up with your primary • V$DATAGUARD_STATS provides useful information • SQL> select name, value from v$dataguard_stats; NAME VALUE -------------------------------- ------------------------------------ apply finish time +00 00:00:00 apply lag +00 00:00:11 estimated startup time 41 standby has been open N transport lag +00 00:00:03
Maintaining your standby Monitoring Your Standby • A way of finding out what has been happening to your standby over a period time is to look at the v$dataguard_status view • Log Apply Services 01-AUG-07 Media Recovery Waiting for thread 1 sequence 2977 (in transit) • Log Apply Services 01-AUG-07 Media Recovery Waiting for thread 1 sequence 2977 (in transit) • Log Apply Services 01-AUG-07 Media Recovery Waiting for thread 2 sequence 1889 (in transit) • Remote File Server 01-AUG-07 Primary database is in MAXIMUM PERFORMANCE mode • Remote File Server 01-AUG-07 RFS[53]: Successfully opened standby log 14: '+DATA2/standby/standbyredo02.log'
Maintaining your standby Oracle can’t divide by 0 • Standby was happily working away • ORA-07445: exception encountered: core dump [kcrarmb()+152] [SIGFPE] [Integer divide by zero] [0x00085C300 • MRP process crashes • No redo gets applied from this point • Logs after the one that caused the ORA-07445 still being shipped • A simple restart of the managed recovery process does a FAL and the standby is back up-to-date
Maintaining your standby kcrfr_resize2 • Lots of problems after upgrade to 10.2.0.3 • Recovery of Online Redo Log: Thread 2 Group 23 Seq 999 Reading mem 0 Mem# 0: +DATA3/standby/standbyredo11.log ORA-00600: internal error code, arguments: [kcrfr_resize2], [652614828032], [268423168], [], [], [], [], [] • Perhaps caused by the following: • Bug 3306010 OERI[kcrfr_resize2] possible in MEDIA recovery Media recovery may fail with ORA-600 [kcrfr_resize2] when the number of redo strands is set to a high value using log_parallelism.
Maintaining your standby kcrfr_resize2 • This issue has recently been published as Note:453259.1 • Triggered by having a large log_buffer • This bug affects 10.2.0.3 and potentially 9.2.0.8 • It is related to the size of the log_buffer parameter • Fix is included in 10.2.0.4
Maintaining your standby kcrrupirfs • ARC processes died on primary: ORA-00600: [kcrrupirfs.20] [4] [368] • Trace file showed the following: Corrupt redo block 479421 detected: bad block number Flag: 0x0 Format: 0x0 Block: 0x00000000 Seq: 0x00000000 Beg: 0x0 Cks:0x0 <<<<<<<-- ----- Dump of Corrupt Redo Buffer -----000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Maintaining your standby kcrrupirfs • Oracle think initially think this ORA-600 error was hardware related • There are NO indications of any hardware fault - the primary keeps running • After a couple of weeks it was decided this was a “bug situation” • This was bug 4767278 which talked about FAL not being able to read from multiple mirror sides when encountering invalid/stale redo in a file. Apparently required for ASM configurations because ASM does not guarantee all mirror sides contain same data after writing. • We were using ASM, but external redundancy • Oracle then said “The ASM group is not 100% sure if the patch 4767278 will fix the problem”
Maintaining your standby log corruption • The Managed Recovery process crashed complaining about log corruption MRP0: Background Media Recovery terminated with error 355 ORA-00355: change numbers out of order ORA-00353: log corruption near block 2 change 1273622545 time 03/06/2007 08:32:46 ORA-00312: online log 13 thread 1: '+DATA2/standby/standbyredo01.log' • Oracle blame the upgrade process at first. They suggest rebuilding the standby • Then I notice that trying managed recovery rather than real time apply seems to allow the standby to progress
Maintaining your standby log corruption • At this point Oracle say “it looks like a bug” • Lots of time spent diagnosing the issue • ALTER SYSTEM DUMP LOGFILE '+DATA2/nom/standby33.log' scn min 865465290 scn max 865465300; • Eventually Oracle produced a patch 5746174 • MRP HANGS WITH ASYNC LNS AND PARALLEL ARCHIVAL
Using Your Standby Utilize those cpu cycles • A Standby can be considered an insurance policy • Several ways to utilize your standby • Run your backups from your standby • Open your standby read only for reporting • Flashback standby to look at old data • Open your standby read write for testing purposes
Using Your Standby Open for Reports • You need to cancel managed recovery • sql> alter database recover managed standby database cancel; • Then simply open the standby • sql> alter database open; • Redo is still transported to your standby • To transition back to applying redo shutdown the open standby, startup mount and restart the recovery process
Using Your Standby Open for read write • You must have flashback database enabled for this • Stop redo apply on standby • Create a restore point • Activate the Standby & perform read/write testing • Flashback to restore point • Start the redo on the Standby again
Using Your Standby Open for read write Restore Point Physical Standby Physical Standby Flashback Database Activate standby read write
Using Your Standby Flashback Database in a Nutshell • Set up Flashback Database • alter system set db_recovery_file_dest_size = 8G; • alter system set db_recovery_file_dest = 'your flashback destination'; • alter system set db_flashback_retention_target = 1440 ; • alter database flashback on; • Once you have cancelled the standby recovery create a guaranteed restore point • create guaranteed restore point before_activate;
Using Your Standby Open for read write • Activate your Standby • SQL> ALTER DATABASE ACTIVATE STANDBY DATABASE; • You can open the Standby for business • SQL> ALTER DATABASE OPEN; • To become a Standby again shutdown and startup in mount • SQL> FLASHBACK DATABASE TO RESTORE POINT BEFORE_ACTIVATE; • SQL> ALTER DATABASE CONVERT TO PHYSICAL STANDBY;
Using Your Standby Open for read write • However things never go according to plan • ORA-00600: internal error code, arguments: [3705], [1], [8], [3], [8], [], [] • This was bug 4479323 which is a bug with recovery (not standby specific) and only occurs in a RAC environment • This is fixed in 10.2.0.3
Doing a Switchover It’s good to test • A business continuity plan is no good unless it’s been tested • It’s not all about the database • Good to think in terms of services
Doing a Switchover Database Switchover • Make sure your standby is up-to-date • Check your primary database switchover status: • primary> SELECT SWITCHOVER_STATUS FROM V$DATABASE; • Switchover primary database • primary> ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY with session shutdown; • Switchover the standby • standby> ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY with session shutdown;
Doing a switchover DNS Primer • DNS allows translation from hostname to IP address • example.co.uk IN A 162.0.0.1 • Our principle is all services are accessed through a CNAME • anexample.co.uk 5M IN CNAME example.co.uk • relocation of the service is just a case of changing where the CNAME points
Conclusion Conclusion • Dataguard is an efficient DR solution for your primary database • Dataguard is mostly reliable but is not without it’s blips • There are opportunities for gaining added value from your standby • You can’t test your Business continuity plan enough
Adventures in Dataguard Questions? Contact: • jason@nominet.org.uk • http://blog.nominet.org.uk