600 likes | 961 Views
VMware vCenter Site Recovery Manager 5.x with EMC VNX Arrays & MirrorView. By Dave O’Sullivan David.C.OSullivan@emc.com. Intended Audience:. VNX Block CLARIION Block
E N D
VMware vCenter Site Recovery Manager 5.xwith EMC VNX Arrays & MirrorView By Dave O’Sullivan David.C.OSullivan@emc.com
Intended Audience: • VNX Block • CLARIION Block • This training is designed to give an overview of SRM and explain how the relevant software plugins and hardware all interact with each other. • This will cover : • Pre-requisites • Design • Test Failover / Failover / Recovery [DEMO] • Required Logs • Troubleshooting
Assumptions: • You are familiar [not expert] with: • VM’s ! • vCenter • MirrorView A and S • VNX / CLARIION Arrays • SRA (s) ??? • Before the customer does any work with SRM: • MirrorView is working (check zoning) • All the appropriate software is installed. [including enablers]
What is SRM? • ensures the simplest and most reliable disaster protection for all virtualized applications. • Site Recovery plans can be tested non-disruptively as frequently as required to ensure that they meet business objectives. • At the time of a site failover or migration, Site Recovery Manager automates both failover and failback processes, ensuring fast and highly predictable recovery point objectives (RPOs) and recovery time objectives (RTOs).
Pre-requisites • SRM is heavily reliant on DNS, so it would be assumed DNS is fully setup and all hosts can be resolved in both directions. • IP Connectivity between all SP’s / VC / ESX on both sites. • SRM is also reliant on Databases, in this setup there are 4 in total. • 1 DB for VC • 1 DB for SRM • This applies for both sites. • This doc covers the DB setup in full details : • Virtual How to Install and Configure • SQL Express 2005 • For use with Site Recovery Manager V4 • Rob Nourse, Sr. Consultant • VMware Consulting Services • http://communities.vmware.com/servlet/JiveServlet/download/11547-1-32136/Install%20%26%20Configure%20SQL%20Express%20for%20use%20with%20SRM4%20v1.3.pdf
Design: • ….
Design considerations: MirrorView • In order for SRM failover to work, the “protected” VM’s must be located within a LUN that is replicated form production site to DR site. • This is handled my MirrorView A/S (we are using A in this setup) • Below is the LUN info for my setup:
Design considerations: MirrorView Zoning • For MirrorView to work, we need to ensure that the appropriate ports are zoned together. • So in the setup, the FC ports used for MirrorView are zoned to the opposite Array.
Design considerations: MirrorView • LUN is created first on Prod (Athena) side • Then used MirrorView options “create secondary mirror” and follow thru wizard. • I used the Mirror Wizard to complete this task
Design considerations: MirrorView • When its working, it should look like this:
Design considerations: Reserved Lun Pool • You must add LUNs with adequate capacity to the Reserved LUN Pool before proceed. • This will be used when the SRA calls a snapshot for the SRM failover test (only!)
Design considerations: • That’s pretty much it on the VNXZ side • Once MirrorView is up & running, you should be good to go with the SRM windows / VMWare side of the setup. • Next • What is the SRA?
SRA [Storage Replication Adapters] • What is the SRA? • The SRA is a windows .exe installed on the same windows box as SRM as part of the SRM setup • The vCenter “talks” to the SRA -> the SRA sends navi commands to the Array. • This is why naviseccli is required to be installed don the same box as SRM (check path!) • Each Array vendor has their own SRA adapters. • The SRA’s are EMC code, so we support them! SRAs for SRM 5.x For the full list of storage replication adapters supported by SRM 5.x, see http://www.vmware.com/resources/compatibility/search.php?deviceCategory=sra.
SRA [Storage Replication Adapters] • These are the most current supported EMC SRA’s
SRA [Storage Replication Adapters] • So on both sites, the following should be installed: • Note that there is 2 SRA’s • VNX SRA for vCenter • MirrorView enabler for VNX SRA • As we are only doing block replication, we only need the MirrorView enabler. • NFS replication is also possible using the EMC_VNX_Replicator_Enabler_for_VNX_SRA_v5.0.xx
Test Failover – sequence of events We will look happens on both sites concerning: • SRM • VNX • ESX
Test Failover – sequence of events [PROD] • - Ensure to check the output form: • Recovery Plan History ReportVMware Site Recovery Manager 5.0
Test Failover – sequence of events [PROD] - The is a cosmetic issue which *should* be fixed in later versions of SRM Warning: Failed to update embedded paths in virtual machine file '/vmfs/volumes/507432e1-3a92a2a4-027e-b8ac6f866cc6/2008-1/2008-1_1.vmdk'. A general system error occurred: No such deviceFailed to update embedded paths in virtual machine file '/vmfs/volumes/507432e1-3a92a2a4-027e-b8ac6f866cc6/2008-2/2008-2_1.vmdk'. A general system error occurred: No such device
Test Failover – sequence of events [PROD ESX] - Some fairly serious errors in the vmkernel on prod. Esx, these can be ignored. )WARNING: VMW_SATP_LIB_CX: satp_lib_cx_otherSPIsHung:338:Path "vmhba2:C0:T1:L3" Peer SP is hung. )WARNING: VMW_SATP_LIB_CX: satp_lib_cx_otherSPIsHung:338:Path "vmhba3:C0:T0:L3" Peer SP is hung. )ALERT: NMP: vmk_NmpVerifyPathUID:1166:The physical media represented by device naa.600601609da02e0012ce6a8f930de211 (path vmhba3:C0:T1:L9) has changed. If this is a data LUN, this is a critical error. Detecte[0$ )ALERT: NMP: vmk_NmpVerifyPathUID:1166:The physical media represented by device naa.600601609da02e0012ce6a8f930de211 (path vmhba2:C0:T0:L9) has changed. If this is a data LUN, this is a critical error. Detecte[0$ )NMP: nmp_DeviceUpdatePathStates:547: Activated path "vmhba2:C0:T1:L9" for NMP device "naa.600601609da02e0012ce6a8f930de211". - Watch out for messages like this, customers could open cases based on these errors alone…
Test Failover – sequence of events[DR VNX] • A 10/09/12 14:20:23 SnapCopy 7100000a Snapshot Logical Unit device CopyDisk0000 has been created. • B 10/09/12 14:20:23 SnapCopy 7100000a Snapshot Logical Unit device CopyDisk0000 has been created. • A 10/09/12 14:20:24 4600 'Create a SnapShot LU' called by 'admin' (10.64.29.93) on 'Navi_SnapCopyFeature' with result: Success (Successfully created SnapShot LU.) • A 10/09/12 14:20:27 4600 '' called by 'admin' (10.64.29.93) on 'Navi_SnapCopyFeature' with result: Success (Started SnapView session successfully. Session name - async-25_SRM-TEST-FAILOVER_session) • A 10/09/12 14:20:27 SnapCopy 71000003 SnapView persistent session async-25_SRM-TEST-FAILOVER_session has been started on LUN 25. • B 10/09/12 14:20:27 SnapCopy 71000003 SnapView persistent session async-25_SRM-TEST-FAILOVER_session has been started on LUN 25. • A 10/09/12 14:20:30 4600 'Activate' called by 'admin' (10.64.29.93) on 'SnapShot WWN: 60:06:01:60:54:50:2E:00:92:7D:58:76:1C:12:E2:11' with result: Success (Successfully activated snapshot LU: 60:06:01:60:54:50:2E:00:92:7D:58:76:1C:12:E2:11 (async-25_SRM-TEST-FAILOVER_session)) • A 10/09/12 14:20:35 4600 'ExecuteClientRequest' called by ' Navi User admin' (10.64.29.93) on 'CLIFeature' (Result: Success). snapview -storagegroup -addsnapshot -gname SG_dellpr710-g.emcvmw.ctc -hlu 9 -snapshotname async-25_SRM-TEST-FAILOVER -compatibilitymode called by 'admin' • A 10/09/12 14:20:39 RemoteMirror 71050115 MirrorView quiesce LU request. • A 10/09/12 14:20:39 RemoteMirror 71050139 RM_ADMIN_INFO_WILL_REBIND the object. • A 10/09/12 14:20:39 RemoteMirror 71050111 MirrorView rebind request for LUN 60060160b9502e00:a4cc7fc81507e211. • A 10/09/12 14:20:39 RemoteMirror 71050115 MirrorView quiesce LU request. • A 10/09/12 14:20:39 SnapCopy 71000006 SnapView has been bound to device 00000017. • B 10/09/12 14:20:39 RemoteMirror 71050136 Quiesce request from peer SP. • B 10/09/12 14:20:39 RemoteMirror 71050136 Quiesce request from peer SP. • B 10/09/12 14:20:39 RemoteMirror 71050120 Rebind request from peer SP for LUN 60060160b9502e00:a4cc7fc81507e211. • B 10/09/12 14:20:39 SnapCopy 71000006 SnapView has been bound to device Disk0001. • B 10/09/12 14:20:39 RemoteMirror 71050120 Rebind request from peer SP for LUN 60060160b9502e00:a4cc7fc81507e211. • A 10/09/12 14:20:40 4600 'Create a SnapShot LU' called by 'admin' (10.64.29.93) on 'Navi_SnapCopyFeature' with result: Success (Successfully created SnapShot LU.) • A 10/09/12 14:20:40 SnapCopy 7100000a Snapshot Logical Unit device CopyDisk0001 has been created. • B 10/09/12 14:20:40 SnapCopy 7100000a Snapshot Logical Unit device CopyDisk0001 has been created. • A 10/09/12 14:20:43 4600 '' called by 'admin' (10.64.29.93) on 'Navi_SnapCopyFeature' with result: Success (Started SnapView session successfully. Session name - sync-24_SRM-TEST-FAILOVER_session) • A 10/09/12 14:20:43 Bus1 Enc0 DskE 60a A logical unit has been enabled [ALU 2] 0 ffff0000 2100e • A 10/09/12 14:20:43 SnapCopy 71000003 SnapView persistent session sync-24_SRM-TEST-FAILOVER_session has been started on LUN 24.
Test Failover – sequence of events[DR VNX] • B 10/09/12 14:20:43 Bus1 Enc0 DskE 606 Unit Shutdown for Trespass [ALU 2] 0 ffff0000 2100e • B 10/09/12 14:20:43 SnapCopy 71000003 SnapView persistent session sync-24_SRM-TEST-FAILOVER_session has been started on LUN 24. • A 10/09/12 14:20:46 4600 'Activate' called by 'admin' (10.64.29.93) on 'SnapShot WWN: 60:06:01:60:54:50:2E:00:32:0B:08:80:1C:12:E2:11' with result: Success (Successfully activated snapshot LU: 60:06:01:60:54:50:2E:00:32:0B:08:80:1C:12:E2:11 (sync-24_SRM-TEST-FAILOVER_session)) • A 10/09/12 14:20:51 4600 'ExecuteClientRequest' called by ' Navi User admin' (10.64.29.93) on 'CLIFeature' (Result: Success). snapview -storagegroup -addsnapshot -gname SG_dellpr710-g.emcvmw.ctc -hlu 10 -snapshotname sync-24_SRM-TEST-FAILOVER -compatibilitymode called by 'admin' • A 10/09/12 14:20:55 RemoteMirror 71050115 MirrorView quiesce LU request. • A 10/09/12 14:20:55 RemoteMirror 71050111 MirrorView rebind request for LUN 60060160b9502e00:fe606cb30d03e211. • A 10/09/12 14:20:55 RemoteMirror 71050115 MirrorView quiesce LU request. • A 10/09/12 14:20:55 SnapCopy 71000006 SnapView has been bound to device 0000000f. • A 10/09/12 14:20:55 RemoteMirror 71050139 RM_ADMIN_INFO_WILL_REBIND the object. • B 10/09/12 14:20:55 RemoteMirror 71050136 Quiesce request from peer SP. • B 10/09/12 14:20:55 RemoteMirror 71050120 Rebind request from peer SP for LUN 60060160b9502e00:fe606cb30d03e211. • B 10/09/12 14:20:55 RemoteMirror 71050136 Quiesce request from peer SP. • B 10/09/12 14:20:55 SnapCopy 71000006 SnapView has been bound to device Disk0002. • B 10/09/12 14:20:55 RemoteMirror 71050120 Rebind request from peer SP for LUN 60060160b9502e00:fe606cb30d03e211. • A 10/09/12 14:20:56 4600 'Create a SnapShot LU' called by 'admin' (10.64.29.93) on 'Navi_SnapCopyFeature' with result: Success (Successfully created SnapShot LU.) • A 10/09/12 14:20:56 SnapCopy 7100000a Snapshot Logical Unit device CopyDisk0002 has been created. • B 10/09/12 14:20:56 SnapCopy 7100000a Snapshot Logical Unit device CopyDisk0002 has been created. • A 10/09/12 14:20:59 4600 '' called by 'admin' (10.64.29.93) on 'Navi_SnapCopyFeature' with result: Success (Started SnapView session successfully. Session name - sync-0_SRM-TEST-FAILOVER_session) • A 10/09/12 14:20:59 SnapCopy 71000003 SnapView persistent session sync-0_SRM-TEST-FAILOVER_session has been started on LUN 0. • A 10/09/12 14:20:59 Bus1 Enc0 DskE 60a A logical unit has been enabled [ALU 3] 0 ffff0001 3100e • B 10/09/12 14:20:59 Bus1 Enc0 DskE 606 Unit Shutdown for Trespass [ALU 3] 0 ffff0001 3100e • B 10/09/12 14:20:59 SnapCopy 71000003 SnapView persistent session sync-0_SRM-TEST-FAILOVER_session has been started on LUN 0. • A 10/09/12 14:21:02 4600 'Activate' called by 'admin' (10.64.29.93) on 'SnapShot WWN: 60:06:01:60:54:50:2E:00:54:39:77:89:1C:12:E2:11' with result: Success (Successfully activated snapshot LU: 60:06:01:60:54:50:2E:00:54:39:77:89:1C:12:E2:11 (sync-0_SRM-TEST-FAILOVER_session)) • A 10/09/12 14:21:06 4600 'ExecuteClientRequest' called by ' Navi User admin' (10.64.29.93) on 'CLIFeature' (Result: Success). snapview -storagegroup -addsnapshot -gname SG_dellpr710-g.emcvmw.ctc -hlu 11 -snapshotname sync-0_SRM-TEST-FAILOVER -compatibilitymode called by 'admin' f
Cleanup– sequence of events [PROD ESX] - Some fairly serious errors in the vmkernel on prod. Esx, these can be ignored. )WARNING: VMW_SATP_LIB_CX: satp_lib_cx_otherSPIsHung:338:Path "vmhba2:C0:T1:L3" Peer SP is hung. )WARNING: VMW_SATP_LIB_CX: satp_lib_cx_otherSPIsHung:338:Path "vmhba3:C0:T0:L3" Peer SP is hung. )ALERT: NMP: vmk_NmpVerifyPathUID:1166:The physical media represented by device naa.600601609da02e0012ce6a8f930de211 (path vmhba3:C0:T1:L9) has changed. If this is a data LUN, this is a critical error. Detecte[0$ )ALERT: NMP: vmk_NmpVerifyPathUID:1166:The physical media represented by device naa.600601609da02e0012ce6a8f930de211 (path vmhba2:C0:T0:L9) has changed. If this is a data LUN, this is a critical error. Detecte[0$ )NMP: nmp_DeviceUpdatePathStates:547: Activated path "vmhba2:C0:T1:L9" for NMP device "naa.600601609da02e0012ce6a8f930de211". - Watch out for messages like this, customers could open cases based on these errors alone…
Cleanup – sequence of events [DR VNX] A 10/09/12 14:46:49 4600 'storagegroup' called by ' Navi User admin' (10.64.29.93) with result: Success (Navisphere CLI command: ' storagegroup -removesnapshot -o -gname SG_dellpr710-g.emcvmw.ctc -snapshotname async-25_SRM-TEST-FAILOVER ') A 10/09/12 14:46:50 4600 'Stop' called by 'admin' (10.64.29.93) on 'Session Name: async-25_SRM-TEST-FAILOVER_session' with result: Success (Deactivated snapshot LU successfully: 60:06:01:60:54:50:2E:00:92:7D:58:76:1C:12:E2:11 (async-25_SRM-TEST-FAILOVER_session)Stopped session su A 10/09/12 14:46:50 SnapCopy 71000004 SnapView session async-25_SRM-TEST-FAILOVER_session has been stopped on LUN 25 with status of 0. B 10/09/12 14:46:50 SnapCopy 71000004 SnapView session async-25_SRM-TEST-FAILOVER_session has been stopped on LUN 25 with status of 0. A 10/09/12 14:46:52 SnapCopy 7100000b Snapshot Logical Unit device CopyDisk0000 has been removed. A 10/09/12 14:46:52 NaviCimom 71288021 Failing Command: Set LUN. B 10/09/12 14:46:52 SnapCopy 7100000b Snapshot Logical Unit device CopyDisk0000 has been removed. A 10/09/12 14:46:53 4600 'Destroy a SnapShot' called by 'admin' (10.64.29.93) on 'SnapShot WWN: 60:06:01:60:54:50:2E:00:92:7D:58:76:1C:12:E2:11' with result: Success (Destroy snapshot successfully: 60:06:01:60:54:50:2E:00:92:7D:58:76:1C:12:E2:11) B 10/09/12 14:46:53 NaviCimom 71288021 Failing Command: Set LUN. A 10/09/12 14:47:00 4600 'storagegroup' called by ' Navi User admin' (10.64.29.93) with result: Success (Navisphere CLI command: ' storagegroup -removesnapshot -o -gname SG_dellpr710-g.emcvmw.ctc -snapshotname sync-24_SRM-TEST-FAILOVER ') A 10/09/12 14:47:01 4600 'Stop' called by 'admin' (10.64.29.93) on 'Session Name: sync-24_SRM-TEST-FAILOVER_session' with result: Success (Deactivated snapshot LU successfully: 60:06:01:60:54:50:2E:00:32:0B:08:80:1C:12:E2:11 (sync-24_SRM-TEST-FAILOVER_session)Stopped session succ A 10/09/12 14:47:01 Bus1 Enc0 DskE 606 Unit Shutdown for Trespass [ALU 2] 0 ffff0000 2100e A 10/09/12 14:47:01 SnapCopy 71000004 SnapView session sync-24_SRM-TEST-FAILOVER_session has been stopped on LUN 24 with status of 0. B 10/09/12 14:47:01 SnapCopy 71000004 SnapView session sync-24_SRM-TEST-FAILOVER_session has been stopped on LUN 24 with status of 0. B 10/09/12 14:47:01 Bus1 Enc0 DskE 60a A logical unit has been enabled [ALU 2] 0 ffff0000 2100e A 10/09/12 14:47:02 SnapCopy 7100000b Snapshot Logical Unit device CopyDisk0001 has been removed. B 10/09/12 14:47:02 SnapCopy 7100000b Snapshot Logical Unit device CopyDisk0001 has been removed. A 10/09/12 14:47:03 4600 'Destroy a SnapShot' called by 'admin' (10.64.29.93) on 'SnapShot WWN: 60:06:01:60:54:50:2E:00:32:0B:08:80:1C:12:E2:11' with result: Success (Destroy snapshot successfully: 60:06:01:60:54:50:2E:00:32:0B:08:80:1C:12:E2:11) A 10/09/12 14:47:03 RemoteMirror 71050115 MirrorView quiesce LU request. A 10/09/12 14:47:03 RemoteMirror 71050115 MirrorView quiesce LU request. A 10/09/12 14:47:03 RemoteMirror 71050139 RM_ADMIN_INFO_WILL_REBIND the object. A 10/09/12 14:47:03 RemoteMirror 71050111 MirrorView rebind request for LUN 60060160b9502e00:a4cc7fc81507e211.
Cleanup – sequence of events [DR VNX] B 10/09/12 14:47:03 RemoteMirror 71050136 Quiesce request from peer SP. B 10/09/12 14:47:03 RemoteMirror 71050120 Rebind request from peer SP for LUN 60060160b9502e00:a4cc7fc81507e211. B 10/09/12 14:47:03 RemoteMirror 71050136 Quiesce request from peer SP. B 10/09/12 14:47:03 SnapCopy 71000007 SnapView has been unbound from device Disk0001. B 10/09/12 14:47:03 RemoteMirror 71050120 Rebind request from peer SP for LUN 60060160b9502e00:a4cc7fc81507e211. A 10/09/12 14:47:09 4600 'storagegroup' called by ' Navi User admin' (10.64.29.93) with result: Success (Navisphere CLI command: ' storagegroup -removesnapshot -o -gname SG_dellpr710-g.emcvmw.ctc -snapshotname sync-0_SRM-TEST-FAILOVER ') A 10/09/12 14:47:10 4600 'Stop' called by 'admin' (10.64.29.93) on 'Session Name: sync-0_SRM-TEST-FAILOVER_session' with result: Success (Deactivated snapshot LU successfully: 60:06:01:60:54:50:2E:00:54:39:77:89:1C:12:E2:11 (sync-0_SRM-TEST-FAILOVER_session)Stopped session succes A 10/09/12 14:47:10 Bus1 Enc0 DskE 606 Unit Shutdown for Trespass [ALU 3] 0 ffff0001 3100e A 10/09/12 14:47:10 SnapCopy 71000004 SnapView session sync-0_SRM-TEST-FAILOVER_session has been stopped on LUN 0 with status of 0. B 10/09/12 14:47:10 SnapCopy 71000004 SnapView session sync-0_SRM-TEST-FAILOVER_session has been stopped on LUN 0 with status of 0. B 10/09/12 14:47:10 Bus1 Enc0 DskE 60a A logical unit has been enabled [ALU 3] 0 ffff0001 3100e A 10/09/12 14:47:12 4600 'Destroy a SnapShot' called by 'admin' (10.64.29.93) on 'SnapShot WWN: 60:06:01:60:54:50:2E:00:54:39:77:89:1C:12:E2:11' with result: Success (Destroy snapshot successfully: 60:06:01:60:54:50:2E:00:54:39:77:89:1C:12:E2:11) A 10/09/12 14:47:12 RemoteMirror 71050115 MirrorView quiesce LU request. A 10/09/12 14:47:12 RemoteMirror 71050139 RM_ADMIN_INFO_WILL_REBIND the object. A 10/09/12 14:47:12 RemoteMirror 71050111 MirrorView rebind request for LUN 60060160b9502e00:fe606cb30d03e211. A 10/09/12 14:47:12 RemoteMirror 71050115 MirrorView quiesce LU request. A 10/09/12 14:47:12 SnapCopy 7100000b Snapshot Logical Unit device CopyDisk0002 has been removed. B 10/09/12 14:47:12 SnapCopy 7100000b Snapshot Logical Unit device CopyDisk0002 has been removed. B 10/09/12 14:47:12 RemoteMirror 71050120 Rebind request from peer SP for LUN 60060160b9502e00:fe606cb30d03e211. B 10/09/12 14:47:12 RemoteMirror 71050136 Quiesce request from peer SP. B 10/09/12 14:47:12 SnapCopy 71000007 SnapView has been unbound from device Disk0002. B 10/09/12 14:47:12 RemoteMirror 71050120 Rebind request from peer SP for LUN 60060160b9502e00:fe606cb30d03e211. B 10/09/12 14:47:12 RemoteMirror 71050136 Quiesce request from peer SP.
Troubleshooting: Obtaining the correct logs • Ensure to capture the SRM & SRA logs. • Please use VMWare KB 1009253 • “Export system Logs” • Please complete these actions on both sites! • http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=1009253
Troubleshooting: Obtaining the correct logs • If the issue is related to a Test Failover or actual Failover then having the failed Recovery Plan export log will also be invaluable in troubleshooting the issue. To generate the log Export for the failed Recovery Plan: • In the left pane, click Recovery Plans and select the Recovery Plan which had the issue. • Select the Plan Name which is showing an Error in the Result column. • On the Plan Name with the error click the Export action to generate the report for the failed Test Failover or actual Failover. • Save the file to your desktop and upload this file with the SRM system logs.
Troubleshooting: Obtaining the correct logs • Exported information will look like this, take note of the time stamps as this is what we will use to search thru the SPCOLLECT with • The errors listed here are extremely useful in the actual diagnosis of the issue.
Troubleshooting: log files of interest: • There are 2 main folder of interest within the exported log bundles: • The Logs folder surprisingly enough: • This will contain all the activity form the SRM application on that particular site. • Extract all .gz archives in case the errors you are searching for a while back…. • The mail file of interest in this folder is called “vmware-dr-XX” • Sort by date and review most recent, or search for time stamp obtained form the html page described on slide 20.
Troubleshooting: log files of interest: • Please note the sate in the SRM logs are in the following format • 2012-09-02T09:10:48.508+01:00 • The dates in the exported .html page are in: • 2012-09-02 09:10:29 (UTC 0) • So adjust accordingly when searching for errors across logs. • In my example I’ll search thru the SRM logs with : • 2012-09-02T09 • This will be a good start point • For the Linux heads, this is what I’ using to make the logs more human readable: • grep "2012-09-02T09" vmware-dr-3*|grep -v "<" |less
Troubleshooting: log files of interest: • Most on the information in these vmware-dr-XX logs are really of more interest to VMWare than EMC, as its just really verbose logging of the SRM application and database interaction. • No harm I having a peek to see it there is anything n jumping out though. • These log flies we want to next focus on is the SRA logs, and there are a few different logs. • Location = srm-support\Logs\SRAs\EMC VNX SRA • sra_discoverArrays_08-30-2012_11-12-08.359 • sra_discoverDevices_08-30-2012_12-46-07.042 • sra_failover_08-29-2012_14-34-05.253 • sra_prepareFailover_09-02-2012_22-21-46.604 • sra_prepareReverseReplication_09-02-2012_22-26-51.711 • sra_queryCapabilities_02-24-2012_14-02-20.738 • sra_queryConnectionParameters_02-24-2012_14-02-23.035 • sra_queryErrorDefinitions_02-24-2012_14-02-25.676 • sra_queryInfo_02-24-2012_14-02-09.816 • sra_queryReplicationSettings_08-30-2012_17-47-54.968 • sra_queryStrings_02-24-2012_14-02-24.160 • sra_querySyncStatus_09-02-2012_22-19-09.277 • sra_reverseReplication_09-02-2012_22-17-17.440 • sra_syncOnce_09-02-2012_22-20-26.700 • sra_testFailoverStart_08-29-2012_12-52-23.461 • sra_testFailoverStop_09-02-2012_10-21-37.204
Troubleshooting: log files of interest: • Timestamp format again is similar here, so just take note of it. • Get timestamp from .html page as before: • The SRA log folder will be loaded full of many logs so I’m just going to focus on the logs form Sept 02 2012 • The above error was received when trying to do a Test Failover. • We need to check in the following log: • sra_testFailoverStart_09-02-2012_08-42-09.811.log • Note the UTC time adjustment.
Troubleshooting: log files of interest: • Looking at the log file, we can pull some very useful information: • [sra_testFailoverStart_09-02-2012_08-42-09.811.log] • grep / search for: • com.emc.mirrorview.platform.naviseccli.NaviseccliConnection • This will show the actual navi commands that are being issued by the SRA to the SP
Troubleshooting: log files of interest: • Within the same log file search / grep for : • Command result: • This should give a clear indication of where the error lies. • In this case, looks like we have a issue with Snapview
Troubleshooting: log files of interest: • Switch over to the SPCOLLECT’s for both sites, and grep out any messages related to SnapCopy • On the DR site, we can see: • Dave@QQWWQQWW /cygdrive/c/Users/Dave/Documents.backup/Logs/SRM_LOGS_PPTX/Pandora • $ grep SnapCopy "TRiiAGE_full_SPlogs.txt“ • B 09/02/12 07:54:35 NaviCimom 7100808b Failing Command: K10SnapCopyAdmin DBid 0 Op 1046. • A 09/02/12 07:54:43 NaviCimom 7100808b Failing Command: K10SnapCopyAdmin DBid 0 Op 1046. • A 09/02/12 07:54:44 4600 'Create a SnapShot LU' called by 'admin' (10.64.29.93) on 'Navi_SnapCopyFeature' with result: Failure (Could not create SnapShot LU.. [0x7100808B] A SnapView snapshot already exists with the specified name (0x7100808b)) • A 09/02/12 07:54:45 SnapCopy 71008031 You must add LUNs with adequate capacity to the Reserved LUN Pool before you can use this feature. • A 09/02/12 07:54:46 4600 '' called by 'admin' (10.64.29.93) on 'Navi_SnapCopyFeature' with result: Failure (Could not start SnapView session. Session name - sync-0_SRM-TEST-FAILOVER_session. [0x71008031] You must add LUNs with adequate capacity to the Reserved LUN Pool before you • A 09/02/12 07:54:46 NaviCimom 71008031 Failing Command: K10SnapCopyAdmin DBid 0 Op 1038. • This is indicating that there is no Reserve Lun Pool setup, as described on slide 13.
Troubleshooting: Logs needed - Recap. • So for all SRM / SRA cases you will need the following logs: • SPCOLLECT form both sites • SRM logs form both sites • SRA logs form both sites. • “Recovery Plan Export Log.html” as explained on slide 19 • Seriously, don’t proceed until you have everything listed above. • http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=1009253
Troubleshooting: Workflow Ok, so for every SRM / SRA case that does go in, the following should apply as valid workflow towards resolution of the case. • Collect Logs • Check MirrorView & confirm it is actually working. • Have customer reconfirm DNS & IP connectivity is OK • all hosts should be DNS resolvable on both sites • All hosts / SP’s should have IP connectivity on same VLAN, all hosts / SP’s should be able to ping each other… • Confirm Software requirements listed on Slide 7 • Check error that is reported in Recovery Plan Export Log.html • Search in TRiiAGE_full_Splogs for any errors at the time reported on the Recovery Plan Export Log.html
Log error / message examples • In this section we will provide examples of errors and informative messages that may assist in troubleshooting your issue.