820 likes | 1.46k Views
RAC Troubleshooting. Julian Dyke Independent Consultant. Web Version - May 2008. © 2008 Julian Dyke. juliandyke.com. Agenda. Installation and Configuration Oracle Clusterware ASM and RDBMS. Installation and Configuration. Cluster Verification Utility Overview. Introduced in Oracle 10.2
E N D
RAC Troubleshooting Julian Dyke Independent Consultant Web Version - May 2008 ©2008 Julian Dyke juliandyke.com
Agenda • Installation and Configuration • Oracle Clusterware • ASM and RDBMS
Cluster Verification UtilityOverview • Introduced in Oracle 10.2 • Checks cluster configuration • stages - verifies all steps for specified stage have been completed • components - verifies specified component has been correctly installed • Supplied with Oracle Clusterware • Can be downloaded from OTN (Linux and Windows) • Also works with 10.1 (specify -10gR1 option) • For earlier versions see Metalink Note 135714.1Script to Collect RAC Diagnostic Information (racdiag.sql)
Cluster Verification UtilityCVUQDISK Package • On the Red Hat 4 and Enterprise Linux platforms, the following additional RPM is required for CLUVFY cvuqdisk-1.0.1-1.rpm • This package is supplied in the clusterware/cluvfy/rpm directory on the clusterware CD-ROM • It can also be download from OTN • On each node as the root user install the RPM using: rpm -ivh cvuqdisk-1.0.1-1.rpm
Cluster Verification UtilityStages • CLUVFY stages include:
Cluster Verification UtilityComponents • CLUVFY components include:
Cluster Verification UtilityExample • For example, to check configuration before installing Oracle Clusterware on node1 and node2 use: sh runcluvfy.sh stage -pre crsinst -n london1,london2 • Checks: • node reachability • user equivalence • administrative privileges • node connectivity • shared stored accessibility • If any checks fail append -verbose to display more information
Cluster Verification Utility Trace & Diagnostics • To enable trace in CLUVFY use: export SRVM_TRACE = true • Trace files are written to the $CV_HOME/cv/log directory • By default this directory is removed immediately after CLUVFY is execution • On Linux/Unix comment out the following line in runcluvfy.sh # $RM -rf $CV_HOME • Pathname of CV_HOME directory is based on operating system process e.g: /tmp/18124 • It can be useful to echo value of CV_HOME in runcluvfy.sh: echo CV_HOME=$CV_HOME
Oracle Universal Installer (OUI)Trace & Diagnostics • On Unix/Linux to launch the OUI with tracing enabled use: ./runInstaller -J-DTRACING.ENABLED=true -J-DTRACING.LEVEL=2 • Log files will be written to $ORACLE_BASE/oraInventory/logs • To trace root.sh execute it using: sh -x root.sh • Note that it may be necessary to cleanup the CRS installation before executing root.sh again
DBCATrace & Diagnostics • To enable trace for the DBCA in Oracle 9.0.1 and above • Edit $ORACLE_HOME/bin/dbca and change # Run DBCA$JRE_DIR/bin/jre -DORACLE_HOME=$OH -DJDBC_PROTOCOL=thin-mx64m -classpath $CLASSPATH oracle.sysman.assistants.dbca.Dbca$ARGUMENTS • to # Run DBCA$JRE_DIR/bin/jre -DORACLE_HOME=$OH -DJDBC_PROTOCOL=thin-mx64m -DTRACING.ENABLED=true -DTRACING.LEVEL=2 -classpath $CLASSPATH oracle.sysman.assistants.dbca.Dbca$ARGUMENTS • Redirect standard output to a file e.g. $ dbca > dbca.out &
Oracle ClusterwareOverview • Provides • Node membership services (CSS) • Resource management services (CRS) • Event management services (EVM) • In Oracle 10.1 and above resources include • Node applications • ASM Instances • Database • Instances • Services • Node applications include: • Virtual IP (VIP) • Listeners • Oracle Notification Service (ONS) • Global Services Daemon (GSD)
Oracle ClusterwareVirtual IP (VIP) • Node application introduced in Oracle 10.1 • Allows Virtual IP address to be defined for each node • All applications connect using Virtual IP addresses • If node fails Virtual IP address is automatically relocated to another node • Only applies to newly connecting sessions
Oracle ClusterwareVIP (Virtual IP) Node Application Before After VIP1 VIP2 VIP1 VIP1 VIP2 Listener1 Listener2 Listener1 Listener2 Instance1 Instance2 Instance1 Instance2 Node 1 Node 2 Node 1 Node 2
Oracle ClusterwareVIP (Virtual IP) Node Application • On Linux during normal operation, each node will have one VIP address. For example: [root@server3]# ifconfig eth0 Link encap:Ethernet HWaddr 00:11:D8:58:05:99 inet addr:192.168.2.103 Bcast:192.168.2.255 Mask:255.255.255.0 inet6 addr: fe80::211:d8ff:fe58:599/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:6814 errors:0 dropped:0 overruns:0 frame:0 TX packets:10326 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:684579 (668.5 KiB) TX bytes:1449071 (1.3 MiB) Interrupt:217 Base address:0x8800 eth0:1 Link encap:Ethernet HWaddr 00:11:D8:58:05:99 inet addr:192.168.2.203 Bcast:192.168.2.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Interrupt:217 Base address:0x8800 • The resource for VIP address for 192.168.2.203 is initially running on server3
Oracle ClusterwareVIP (Virtual IP) Node Application • If Oracle Clusterware on server3 is shutdown, the VIP resource is transferred to another node (in this case server11) [root@server11]# ifconfig eth0 Link encap:Ethernet HWaddr 00:1D:7D:A3:0A:55 inet addr:192.168.2.111 Bcast:192.168.2.255 Mask:255.255.255.0 inet6 addr: fe80::21d:7dff:fea3:a55/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:2792 errors:0 dropped:0 overruns:0 frame:0 TX packets:4097 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:329891 (322.1 KiB) TX bytes:593615 (579.7 KiB) Interrupt:177 Base address:0x2000 eth0:1 Link encap:Ethernet HWaddr 00:1D:7D:A3:0A:55 inet addr:192.168.2.211 Bcast:192.168.2.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Interrupt:177 Base address:0x2000 eth0:2 Link encap:Ethernet HWaddr 00:1D:7D:A3:0A:55 inet addr:192.168.2.203 Bcast:192.168.2.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Interrupt:177 Base address:0x2000
Oracle ClusterwareVIP Failover • VIP addresses can occasionally be failed over incorrectly. • For example: HA Resource Target State----------- ------ -----ora.server11.vip application ONLINE on server11ora.server12.vip application ONLINE on server12ora.server3.vip application ONLINE on server11ora.server4.vip application ONLINE on server4 [root@server3]# ./crs_relocate ora.server3.vip -c server3Attempting to stop `ora.server3.vip` on member `server11`Stop of `ora.server3.vip` on member `server11` succeeded.Attempting to start `ora.server3.vip` on member `server3`Start of `ora.server3.vip` on member `server3` succeeded. HA Resource Target State----------- ------ -----ora.server11.vip application ONLINE on server11ora.server12.vip application ONLINE on server12ora.server3.vip application ONLINE on server3ora.server4.vip application ONLINE on server4
Oracle ClusterwareLogging • In Oracle 10.2, Oracle Clusterware log files are created in the $CRS_HOME/log directory • can be located on shared storage • $CRS_HOME/log directory contains • subdirectory for each node e.g. $CRS_HOME/log/server6 • $CRS_HOME/log/<node> directory contains: • Oracle Clusterware alert log e.g. alertserver6.log • client - logfiles for OCR applications including CLSCFG, CSS, OCRCHECK, OCRCONFIG, OCRDUMP and OIFCFG • crsd - logfiles for CRS daemon including crsd.log • cssd - logfiles for CSS daemon including ocssd.log • evmd - logfiles for EVM daemon including evmd.log • racg - logfiles for node applications including VIP and ONS
$ORA_CRS_HOME log <nodename> client cssd evmd racg crsd alert<nodename>.log racgeut racgimon racgmain Oracle Clusterware Log Files • Log File locations in $ORA_CRS_HOME
$ORACLE_HOME log <nodename> client racg racgeut racgimon racgmain racgmdb Oracle Clusterware Log Files • Log File locations in $ORACLE_HOME (RDBMS and ASM)
Oracle ClusterwareTroubleshooting • If OCR or voting disk are not available, error files may be created in /tmp e.g. /tmp/crsctl.4038 • For example, if OCR cannot be found: OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [No such file or directory] [2] OCR is inaccessible - no CRS daemons will start No errors written to log files • If Voting Disk has incorrect ownership clsscfg_vhinit: unable(1) to open disk (/dev/raw/raw2)Internal Error Information: Category: 1234 Operation: scls_block_open Location: statfs Other: statfs failed /dev/raw/raw2 Dep: 2Failure 1 checking the Cluster Synchronization Services voting disk '/dev/raw/raw2'.Not able to read adequate number of voting disks
Oracle Clusterwareracgwrap • Script called on each node by SRVCTL to control resources • Copy of script in each Oracle home • $ORA_CRS_HOME/bin/racgwrap • $ORA_ASM_HOME/bin/racgwrap • $ORACLE_HOME/bin/racgwrap • Sets environment variables • Invokes racgmain executable • Generated from racgwrap.sbs • Differs in each home • Sets $ORACLE_HOME and $ORACLE_BASE environment variables for racgmain • Also sets $LD_LIBRARY_PATH • Enable trace by setting _USR_ORA_DEBUG to 1
Oracle Clusterwareracgwrap • In Unix systems the Oracle SGA is located in one or more operating system shared memory segments • Each segment is identified by a shared memory key • Shared memory key is generated by the application • Each shared memory key maps to a shared memory ID • Shared memory ID is generated by operating system • Shared memory segments can be displayed using ipcs -m [root@server3] # ipcs -m------ Shared Memory Segments --------key shmid owner perms bytes nattch status0x8a48ff44 131072 oracle 640 94371840 20 0x17d04568 163841 oracle 660 2099249152 246 • Oracle generates the shared memory key from the values of • $ORACLE_HOME • $ORACLE_SID
Oracle Clusterwareracgwrap • If instance is currently running e.g. [oracle@server3]$ ps -ef | grep pmon_PROD1oracle 8653 1 0 16:13 ? 00:00:00 ora_pmon_PROD1 • But SQL*Plus cannot connect to the instance [oracle@server3]$ export ORACLE_SID=PROD1[oracle@server3]$ sqlplus / as sysdba... Connected to idle instance • Compare $ORACLE_HOME environment variable to ORACLE_HOME variable in $ORACLE_HOME/bin/racgwrap [oracle@server3]$ echo $ORACLE_HOME/u01/app/oracle/product/10.2.0/db_1 [oracle@server3]$ grep "^ORACLE_HOME" $ORACLE_HOME/bin/racgwrapORACLE_HOME=/u01/app/oracle/product/10.2.0/db_1/
Oracle ClusterwareProcess Monitor (OPROCD) • Process Monitor Daemon • Provides Cluster I/O Fencing • Implemented on Unix systems • Not required with third-party clusterware • Implemented in Linux in 10.2.0.4 and above • In 10.2.0.3 and below hangcheck timer module is used • Provides hangcheck timer functionality to maintain cluster integrity • Behaviour similar to hangcheck timer • Runs as root • Locked in memory • Failure causes reboot of system • See /etc/init.d/init.cssd for operating system reboot commands
Oracle ClusterwareProcess Monitor (OPROCD) • OPROCD takes two parameters • -t - Timeout value • Length of time between executions (milliseconds) • Normally defaults to 1000 • -m - Margin • Acceptable margin before rebooting (milliseconds) • Normally defaults to 500 • Parameters are specified in /etc/init.d/init.cssd • OPROCD_DEFAULT_TIMEOUT=1000 • OPROCD_DEFAULT_MARGIN=500 • Contact Oracle Support before changing these values
Oracle ClusterwareProcess Monitor (OPROCD) • /etc/init.d/init.cssd can increase OPROCD_DEFAULT_MARGIN based on two CSS variables • reboottime (mandatory) • diagwait (optional) • Values can for these be obtained using [root@server3]# crsctl get css reboottime[root@server3]# crsctl get css diagwait • Both values are reported in seconds • The algorithm is If diagwait > reboottime then OPROCD_DEFAULT_MARGIN := (diagwait - reboottime) * 1000 • Therefore increasing diagwait will reduce frequency of reboots e.g [root@server3]# crsctl set css diagwait 13
Oracle Clusterware Heartbeats • CSS maintains two heartbeats • Network heartbeat across interconnect • Disk heartbeat to voting device • Disk heartbeat has an internal I/O timeout (in seconds) • Varies between releases • In Oracle 10.2.0.2 and above disk heartbeat timeout can be specified by CSS disktimeout parameter • Maximum time allowed for a voting file I/O to complete • If exceeded file is marked offline • Defaults to 200 seconds crsctl get css disktimeoutcrsctl set css disktimeout <value>
Oracle Clusterware Heartbeats • Network heartbeat timeout can be specified by CSS misscount parameter • Default values (Oracle Clusterware 10.1 and 10.2) are: • Default value for vendor clusterware is 600 seconds crsctl get css misscountcrsctl set css misscount <value>
Oracle ClusterwareHeartbeats • Relationship between internal I/O timeout (IOT), MISSCOUNT and DISKTIMEOUT varies between releases
Oracle ClusterwareHeartbeats • If disktimeout supported CSS will not evict a node from the cluster when I/O to voting disk takes more than MISSCOUNT seconds unless during • during initial cluster formation • slightly before reconfiguration • Nodes will not be evicted as long as voting disk operations are completed within DISKTIMEOUT seconds
Oracle Clusterware CRSCTL • CRSCTL can also be used to enable and disable Oracle Clusterware • To enable Clusterware use: # crsctl enable crs • To disable Clusterware use: # crsctl disable crs • These commands update the following file: • /etc/oracle/scls_scr/<node>/root/crsstart
Oracle ClusterwareCRSCTL • In Oracle 10.2, CRSCTL can be used to check the current state of Oracle Clusterware daemons • To check the current state of all Oracle Clusterware daemons # crsctl check crsCSS appears healthyCRS appears healthyEVM appears healthy • To check the current state of individual Oracle Clusterware daemons # crsctl check cssdCSS appears healthy # crsctl check crsdCRS appears healthy # crsctl check evmdEVM appears healthy
Oracle ClusterwareCRSCTL • CRSCTL can be used to manage the CSS voting disk • To check the current location of the voting disk use: # crsctl query css votedisk0. 0 /dev/raw/raw31. 0 /dev/raw/raw42. 0 /dev/raw/raw5 • To add a new voting disk use: # crsctl add css votedisk <path_name> • To delete an existing voting disk use: # crsctl delete css votedisk <path_name>
Oracle ClusterwareDebugging • In Oracle 10.2 and above • Oracle Clusterware debugging can be enabled and disabled for • CRS • CSS • EVM • Resources • Subcomponents • Debugging can be controlled • statically using environment variables • dynamically using CRSCTL • Debug settings can be persisted in OCR for use in subsequent restarts
Oracle ClusterwareDebugging • To list modules available for debugging use: # crsctl lsmodules crs# crsctl lsmodules css# crsctl lsmodules evm • In Oracle 11.1 modules include:
Oracle ClusterwareDebugging • To debug individual modules use: # crsctl debug log crs <module>:<level>[,<module>:<level>] • For example: # crsctl debug log crs "CRSCOMM:2,COMMCRS:2,COMMNS:2"Set CRSD Debug Module: CRSCOMM Level: 2Set CRSD Debug Module: COMMCRS Level: 2Set CRSD Debug Module: COMMNS Level: 2 • Values only apply for current node • Stored within OCR in SYSTEM.crs.debug.<node>.<module> • For example: # ocrdump -stdout -keyname SYSTEM.crs.debug.vm1.CRSCOMM • Log will be written to: • $ORA_CRS_HOME/log/<node>/crsd/crsd.log
Oracle ClusterwareDebugging • To debug an individual resource use: # crsctl debug log res <resource>:<level> • For example: # crsctl debug log res ora.vm1.vip:5Set Resource Debug Module: ora.vm1.vip Level: 5 • To disable debugging again set level 0 e.g.: # crsctl debug log res ora.vm1.vip:0Set Resource Debug Module: ora.vm1.vip Level: 0 • OCR debug value is stored in USR_ORA_DEBUG • To check current debug value set in OCR for ora.vm1.vip use: # ocrdump -stdout -keyname \CRS.CUR.ora\!vm1\!vip.USR_ORA_DEBUG • Log will be written to • $ORA_CRS_HOME/log/<node>/racg/<resource>.log
Oracle ClusterwareDebugging • Debugging for CRSD and EVMD can also be configured using environment variables • To enable tracing for all modules use ORA_CRSDEBUG_ALL • For example: # export ORA_CRSDEBUG_ALL=5 • To enable tracing for individual modules use ORA_CRSDEBUG_<module> • For example: # export ORA_CRSDEBUG_CRSOCR=5 • Note that these environment variables have not been implemented in OCSSD or OPROCD
Oracle ClusterwareDebugging • In Oracle 10.1 and above debugging can also be configured in • $ORA_CRS_HOME/srvm/admin/ocrlog.ini • By default this file contains: # "mesg_logging_level" is the only supported parameter currently.# level 0 means minimum logging. Only error conditions are loggedmesg_logging_level = 0 # The last appearance of a parameter will override the previous value.# For example, log level will become 3 when the following value is uncommented.# Change to log level 3 for detailed logging from Oracle Cluster Registry# mesg_logging_level = 3 # Component log and trace level specification template#comploglvl="comp1:3;comp2:4"#comptrclvl="comp1:2;comp2:1" • Component level logging can be configured in this file e.g.: comploglvl="OCRAPI:5;OCRCLI:5;OCRSRV:5;OCRMAS:5;OCRCAC:5"
Oracle ClusterwareDebugging • Component level logging can also be configured in the OCR • For example: crsctl debug log crs OCRAPI:5;OCRCLI:5;OCRSRV:5;OCRMAS:5;OCRCAC:5 • Components include: • OCRAPI - OCR Abstraction Component • OCRCAC - OCR Cache Component • OCRCLI - OCR Client Component • OCRMAS - OCR Master Thread Component • OCRMSG - OCR Message Component • OCRSRV - OCR Server Component • OCRUTL - OCR Util Component
Oracle ClusterwareDebugging • CRSCTL can also generate state dumps crsctl debug statedump crscrsctl debug statedump csscrsctl debug statedump evm • CSS dump is written to • $ORA_CRS_HOME/log/<node>/cssd/ocssd.log • Dump contents can be made more readable e.g.: cut -c58- < ocssd.log > ocssd.dmp
Oracle ClusterwareOLSNODES • The olsnodes utility lists all nodes currently running on the cluster • With no arguments olsnodes lists the nodes e.g. $ olsnodeslondon1london2 • In Oracle 10.2 and above, with -p argument olsnodes lists node names and private interconnect $ olsnodes -plondon1 london1-privlondon2 london2-priv • In Oracle 10.2 and above, with -i argument olsnodes lists node names and VIP address $ olsnodes -ilondon1 london1-viplondon2 london2-vip
Oracle ClusterwareOCRCONFIG • In Oracle 10.1 and above the OCRCONFIG utility performs various administrative operations on the OCR including: • displaying backup history • configuring backup location • restoring OCR from backup • exporting OCR • importing OCR • upgrading OCR • downgrading OCR • In Oracle 10.2 and above OCRCONFIG can also • manage OCR mirrors • overwrite OCR files • repair OCR files
Oracle ClusterwareOCRCONFIG • Options include
Oracle ClusterwareOCRCONFIG • In Oracle 10.1 and above • OCR is automatically backed up every four hours • Previous three backup copies are retained • Backup copy retained from end of previous day • Backup copy retained from end of previous week • Check node, times and location of previous backups using the showbackup option of OCRCONFIG e.g. # ocrconfig -showbackuplondon1 2005/08/04 11:15:29 /u01/app/oracle/product/10.2.0/crs/cdata/crslondon1 2005/08/03 22:24:32 /u01/app/oracle/product/10.2.0/crs/cdata/crslondon1 2005/08/03 18:24:32 /u01/app/oracle/product/10.2.0/crs/cdata/crslondon1 2005/08/02 18:24:32 /u01/app/oracle/product/10.2.0/crs/cdata/crslondon1 2005/07/31 18:24:32 /u01/app/oracle/product/10.2.0/crs/cdata/crs • ENSURE THAT YOU COPY THE PHYSICAL BACKUPS TO TAPE AND/OR REDUNDANT STORAGE
Oracle ClusterwareOCRCONFIG • In Oracle 11.1 and above OCR can be backed up manually using: # ocrconfig -manualbackup • Backups will be written to the location specified by: # ocrconfig -backuploc <directory_name> • Manual backups can be listed using: # ocrconfig -showbackup manual • Automatic backups can be listed using: # ocrconfig -showbackup auto
Oracle ClusterwareOCRCONFIG • To restore the OCR from a physical backup copy • Check you have a suitable backup using: # ocrconfig -showbackup • Stop Oracle Clusterware on each node using: # crsctl stop crs • Restore the backup file using # ocrconfig -restore <filename> • For example: # ocrconfig -restore $ORA_CRS_HOME/cdata/crs/backup00.ocr • Start Oracle Clusterware on each node using: # crsctl start crs
Oracle ClusterwareOCRCHECK • In Oracle 10.1 and above, you can verify the configuration of the OCR using the OCRCHECK utility # ocrcheckStatus of Oracle Cluster Registry is as follows : Version : 2 Total space (kbytes) : 262144 Used space (kbytes) : 7752 Available space (kbytes) : 254392 ID : 1093363319 Device/File Name : /dev/raw/raw1 Device/File integrity check succeeded /dev/raw/raw2 Device/File integrity check succeeded Cluster registry integrity check succeeded • In Oracle 10.1 this utility does not print the ID and Device/File Name information