RAS - Reliability, Availability, Serviceability

RAS - Reliability, Availability, Serviceability Product Support Engineering VMware Confidential

Module 2 Lessons • Lesson 1 – vCenter Server High Availability • Lesson 2 – vCenter Server Distributed Resource Scheduler • Lesson 3 – Fault Tolerance Virtual Machines • Lesson 4 – Enhanced vMotion Compatibility • Lesson 5 – DPM - IPMI • Lesson 6 – vApps • Lesson 7 – Host Profiles • Lesson 8 – Reliability, Availability, Serviceability ( RAS ) • Lesson 9 – Web Access • Lesson 10 – vCenter Server Update Manager • Lesson 11 – Guided Consolidation • Lesson 12 – Health Status VI4 - Mod 2-8 - Slide

Module 2-8 Lessons • Lesson 1 – Overview of RAS • Lesson 2 – RAS objectives • Lesson 3 – Networking vProbs • Lesson 4 – Storage vProbs • Lesson 5 – VMFS vProbs • Lesson 6 – Migration vProb VI4 - Mod 2-8 - Slide

Introduction • The long-term goal of the ESX RAS project is to make ESX more Reliable, Available and Serviceable. • To do so the VMkernel needs to detect, report, recover, diagnose and repair/react to hardware and software problems which occur in the system. • ESX RAS 1.0 will focus on detecting asynchronous hardware and synchronous software observations and reporting them. VI4 - Mod 2-8 - Slide

RAS Objectives • ESX RAS team objective is to increase the reliability, availability and serviceability of the vmkernel. This includes: • Hardening of vmkernel drivers (hardware errors): CPU, Memory, PCI(-X/Express), SCSI, Networking. • Hardening of vmkernel facilities (software errors): SCSI, Networking, VMotion, DMotion, etc. • Developing a standardized method of reporting observations from software and hardware error handlers. • Developing a method to diagnose a given stream of observations, down to one or more problems which may have caused them. • Develop method for determining predictive failure of a given (sub-)system and feed analysis to consumers (DRS, DPM, FT, HA) • Gather and write service actions which correspond to the problem or set of problems which are possibly present. • Develop automated policies for certain problems which may be taken care of without user action. • Maintain and improve logging, coredump, and PSOD infrastructure in the vmkernel VI4 - Mod 2-8 - Slide

RAS Terms • RAS: Reliability, Availability, Serviceability. • Reliability: The ability of a system to perform and maintain its functions, in the face of hostile or unexpected circumstances. • Availability: The proportion of time a system is in a functioning condition. • Serviceability: The ability to debug or perform root cause analysis in pursuit of solving a problem with a product. • Hardening: To enhance a (sub-)system to be able to detect, report and handle errors which may be encountered, whether hardware or software related. Handling may involve panicing and/or attempting recovery from a given error or stream of errors. • VProb: A VProb is an automatically generated problem report. VI4 - Mod 2-8 - Slide

RAS Categories • The framework defines the following use cases for vSphere 4.0: • Each of the use cases link to respective KBs which describe where the error happened (i.e. affected vmnic#, portgroup, vSwitches, storage path etc.) and provides troubleshooting tips to fix the issue. • Networking • vprob.net.connectivity.lost • vprob.net.redundancy.lost • vprob.net.redundancy.degraded • vprob.net.e1000.ts06.notsupported • Storage • vprob.storage.connectivity.lost • vprob.storage.redundancy.lost • vprob.storage.redundancy.degraded VI4 - Mod 2-8 - Slide

RAS Categories • VMFS specific: • vprob.vmfs.nfs.server.disconnect • vprob.vmfs.nfs.server.restored • vprob.vmfs.heartbeat.timedout • vprob.vmfs.heartbeat.recovered • vprob.vmfs.heartbeat.unrecoverable • vrpob.vmfs.lock.corruptiondisk • vprob.vmfs.resource.corruptiondisk • vprob.vmfs.volume.locked • Migration Specific: • vprob.net.migrate.vmknic The Public KB’s will be available at GA time. VI4 - Mod 2-8 - Slide

Networking VProb • vprob.net.connectivity.lost http://communities.vmware.com/viewwebdoc.jspa?documentID=DOC-6122&communityID=2701 • Connectivity to a physical network has been lost, all the affected portgroups are part of the message (e.g. >Lost network connectivity on virtual switch "system". Physical NIC vmnic1 is down. Affected port groups: "cos", "VM Network".<) VI4 - Mod 2-8 - Slide

Networking VProb vprob.net.redundancy.lost http://communities.vmware.com/viewwebdoc.jspa?documentID=DOC-6097&communityID=2701 • Only one physical NIC is currently connected, one more failure will result in a loss of connectivity (e.g. >Lost uplink redundancy on virtual switch "system". Physical NIC vmnic0 is down. Affected port groups: "cos", "VM Network".<) VI4 - Mod 2-8 - Slide

Networking VProb • vprob.net.redundancy.degraded http://communities.vmware.com/viewwebdoc.jspa?documentID=DOC-6098&communityID=2701 • One of the physical NICs in your NIC team has gone down, you still have n-1 NICs available (e.g. >Uplink redundancy degraded on virtual switch "vSwitch0". Physical NIC vmnic1 is down. 2 uplinks still up. Affected portgroups: "VM Network".<) VI4 - Mod 2-8 - Slide

Networking VProb • vprob.net.e1000.tso6.notsupported (KB article) • Guest e1000 driver is misbehaving and sending TSO IPv6 packets, which will be dropped. The vprob specifies the affected VM, and the KB article discusses ways to fix this. http://communities.vmware.com/viewwebdoc.jspa?documentID=DOC-7393 • "Guest-initiated IPv6 TCP Segmentation Offload (TSO) packets ignored. Manually disable TSO inside the guest operating system in virtual machine "XYZ", or use a different virtual adapter." VI4 - Mod 2-8 - Slide

Storage VProb • vprob.storage.connectivity.lost http://communities.vmware.com/viewwebdoc.jspa?documentID=DOC-6099&communityID=2701 • The connectivity to a specific device has been lost (e.g. "Lost connectivity to storage device naa.60a9800043346534645a433967325334. Path vmhba35:C1:T0:L7 is down") VI4 - Mod 2-8 - Slide

Storage VProb • vprob.storage.redundancy.lost http://communities.vmware.com/viewwebdoc.jspa?documentID=DOC-6120&communityID=2701 • Only one path is remaining to a device and you no longer have any redundancy (e.g. "Lost path redundancy to storage device naa.60a9800043346534645a433967325334. Path vmhba35:C1:T0:L7 is down.") VI4 - Mod 2-8 - Slide

Storage VProb • vprob.storage.redundancy.degraded http://communities.vmware.com/viewwebdoc.jspa?documentID=DOC-6099&communityID=2701 • One of your paths to a device has been lost but you still have n-1 paths remaining (e.g. "Path redundancy to storage device naa.60a9800043346534645a433967325334 degraded. Path vmhba35:C1:T0:L7 is down. 3 remaining active paths.") VI4 - Mod 2-8 - Slide

VMFS vProb • vprob.vmfs.nfs.server.disconnect • vprob.vmfs.nfs.server.restored http://pseweb.vmware.com/twiki/bin/viewfile/Main/VmkernelRas?rev=1;filename=vprob.vmfs.volume.locked.htm • Lost connection to server nfs-server mount point /share, mounted as 1264e433-5854ee53-0000-000000000000 ("nfs-share") VI4 - Mod 2-8 - Slide

VMFS vProb • vprob.vmfs.heartbeat.timedout http://pseweb.vmware.com/twiki/bin/viewfile/Main/VmkernelRas?rev=1;filename=vprob.vmfs.heartbeat.combined.htm • VMFS Volume Connectivity Degraded 496befed-1c79c817-6beb-001ec9b60619 san-lun-100 VI4 - Mod 2-8 - Slide

VMFS vProb • vprob.vmfs.heartbeat.recovered http://pseweb.vmware.com/twiki/bin/viewfile/Main/VmkernelRas?rev=1;filename=vprob.vmfs.heartbeat.combined.htm • VMFS Volume Connectivity Restored 496befed-1c79c817-6beb-001ec9b60619 san-lun-100 VI4 - Mod 2-8 - Slide

VMFS vProb • vprob.vmfs.heartbeat.unrecoverable http://pseweb.vmware.com/twiki/bin/viewfile/Main/VmkernelRas?rev=1;filename=vprob.vmfs.heartbeat.combined.htm • VMFS Volume Connectivity lost 496befed-1c79c817-6beb-001ec9b60619 san-lun-100 VI4 - Mod 2-8 - Slide

VMFS vProb • vrpob.vmfs.lock.corruptiondisk • vprob.vmfs.resource.corruptiondisk http://pseweb.vmware.com/twiki/bin/viewfile/Main/VmkernelRas?rev=1;filename=vprob.vmfs.corruptioncombined.htm • Volume 4976b16c-bd394790-6fd8-00215aaf0626 (san-lun-100) may be damaged on disk. Corrupt lock detected at offset O • Volume 4976b16c-bd394790-6fd8-00215aaf0626 (san-lun-100) may be damaged on disk. Resource cluster metadata corruption detected VI4 - Mod 2-8 - Slide

VMFS vProb • vprob.vmfs.volume.locked http://pseweb.vmware.com/twiki/bin/viewfile/Main/VmkernelRas?rev=1;filename=vprob.vmfs.volume.locked.htm • Volume on device naa.60060160b3c018009bd1e02f725fdd11:1 locked, possibly because remote host 10.17.211.73 encountered an error during a volume operation and couldn’t recover. VI4 - Mod 2-8 - Slide

Migration Specific • vprob.net.migrate.vmknic http://pseweb.vmware.com/twiki/bin/viewfile/Main/VmkernelRas?rev=1;filename=vprob.net.migrate.vmkernel.htm • The ESX advanced config option /Migrate/Vmknic is set to an invalid vmknic: vmk0. /Migrate/Vmknic specifies a vmknic that VMotion binds to for improved performance. Please update the config option with a valid vmknic or, if you don't want VMotion to bind to a specific vmknic, remove the invalid vmknic and leave the option blank. VI4 - Mod 2-8 - Slide

Lesson 2-8 Summary • Understand what vProbs are • Learn how to troubleshoot vProbs VI4 - Mod 2-8 - Slide

Lesson 2-8 – Optional Lab 1 • OPTIONAL • Lab 1 involves generating vProb scenarios VI4 - Mod 2-8 - Slide

RAS - Reliability, Availability, Serviceability