Singapore, June 2013

Network Troubleshooting in the virtual world Singapore, June 2013

Co-authors & Reviewers • Reviewers • Lim Wei Chiang • Huang Ya Jian, SE Manager, Arista Networks VCAP-DCD, TOGAF Certified, vExpert 2013 • Iwan ‘e1’ Rahabok • Staff SE, Strategic Accounts, VMware • e1@vmware.com| Linkedin.com/in/e1ang VCP, CCDP, CCNP

Network Troubleshooting • Our example scenario: • You are responsible for the following environment: • 1500 VMs on 100 ESXi 5.1. All VMs are server VM, not desktop VM. Mixed of Windows and Linux. • Majority of VM use VADP based backup. A few have LAN-based. • All the above reside on 1 physical datacenter. • The physical networks are 10 GE on Arista switches. • Each ESXi is 1U rackmount, has 2x 10 GE ports, and the following network • vmkernel network: IP Storage, vMotion, Management • VM network: DMZ, Zone 1, Zone 2, Veritas Heartbeat • You use dvSwitch 5.1 with Network QoS enabled.

Network Troubleshooting: Approach • Possible reasons for demand spike: • Storage vMotion, vMotion, broadcast storm • Possible reasons for supply drop: • Misconfiguration • Hardware fault

Available counters Cluster Counters VM Counters ESXi Counters Datacenter Counters

Network Troubleshooting • What type of info do you need? • Drop packets and errors • Throughputs (bandwidth) • Latency • Special packets: broadcast, error, multicast • How do you need to show the info? • A line chart is useful in showing a few object across time. • Great at showing the time or period. Not scalable in terms of #objects • A heat map is useful in showing many objects, but at a given point in time • Normally just the current. • It can also present 2-dimensional information, making it useful for comparison. • It gives good relative information, comparing many objects againts one another • A weather map adds a dimension allows you to go back in time • Not as good as line chart • A top-N chart shows the top N objects (e.g. Top 25 VM in terms of network utilisation) • A data-distribution charts shows how the data is distributed during a period of time • See next screen for example

Charts example

To prove that Network is performing well • Errors • Not a single ESXi host is experiencing packet drops in any of its NICs (vmnics) • If there are, show the ESXi names. • Not a single VM is experiencing packet drops. • Utilisation • Not a single VM is hitting its limit, be it 1 GE or 10 GE. • Not a single ESXi vmnic is hitting its limit. • Total bandwitdh hitting the physical switches is below capacity. • Top 25 talkers showing utilisation below limit • 4 charts required: VM TX, VM RX, ESXi TX, ESXi RX • Special network • The broadcast network is minimal. For both ESXi and VM.

Approach • Dashboard #1: Do we have any errors in our networks? • A multi-datacenter view • A single error in a VM or ESXi will show up in this overall dashboard, as it is taking the Max (all objects). • Dashboard #2: If yes, which VMs and ESXi are affected? • Listing the top 25 VM and top 25 ESXi • Dashboard #3: Is any VM or ESXi near its peak? • A peak in any VM or ESXi will show up in this super-metric based line chart. • Dashboard #4: Is our network near its peak? • Dashboard #5: Who are the top consumer for each physical datacenter? • Dashboard #6: How is the workload distributed? • This uses a heat map to show relative info. • Dashboard #7: What’s the detail for a particular VM? • When we have identified a specific VM and want to know all the network details.

Dashboard #1: Do we have any errors in our networks? Physical Datacenter 1 Physical Datacenter 2 Maximum packet drop for all VM in entire DC % Maximum packet drop for all ESXi in entire DC Same sets of charts with Datacenter 1. We should display all datacenters that have heavy connection with each other. % Maximum “bad“ packet for all ESXi in entire DC Gb/s Explanation on how this dashboard is built will be given later.

Dashboard #2: If yes, which VMs and ESXi are affected? Physical Datacenter 1 Physical Datacenter 2 Top 25 VM by packet drop Top 25 ESXi by packet drop Same sets of charts with Datacenter 1. We should display all datacenters that have heavy connection with each other. The above charts consists of 2 part, the bar chart and the line chart. The line chart is not really visible though, so we will zoom into it later on. An actual dashboard will be shown later.

Dashboard #3: Is any VM or ESXi near its peak? Physical Datacenter 1 Physical Datacenter 2 Maximum TX for all VM in entire DC % Maximum TX all ESXi in entire DC Same sets of charts with Datacenter 1. We should display all datacenters that have heavy connection with each other. % Maximum RX for all VM in entire DC % Maximum RX all ESXi in entire DC %

Dashboard #4: Is our network near its peak? Physical Datacenter 1 Physical Datacenter 2 Total TX from all VM in entire DC Gb/s Total TX from all ESXi in entire DC Same sets of charts with Datacenter 1. We should display all datacenters that have heavy connection with each other. Gb/s Total RX from all VM in entire DC Gb/s Total RX from all ESXi in entire DC Gb/s %

Dashboard #5: Who are the top consumer for each physical datacenter? Physical Datacenter 1 Top 25 ESXi by RX in Mb/sec Top 25 ESXi by TX in Mb/sec Top 25 VM by RX in Mb/sec Top 25 VM by TX in Mb/sec

Dashboard #6: How is the workload distributed?

Dashboard #7: What’s the detail for a particular VM?

Dashboard #1

Dashboard #1: Cluster packet drop

Dashboard #2: VM with packet drop • Screenshot showing Top 25 VMs in terms of % packet drops • In this example, it’s clear there is a problem as the % is high. • The chart can be complemented with a line chart, showing the details of the selected VM. • The line chart can be adjusted to display historical data.

Dashboard #3: Peak of any VM or ESXi • We are using Workload (%), a derived metric

Dashboard #4: Total network utilisation • Sample super-metric that provides granularity

Dashboard #5 • This example shows the top 25 VM in terms of packet sent • Data is shown in KBps. • Utilisation is very low. • The bar chart is complemented with a simple line chart • It gives historical data. • Can go back 1 year.

Thank you

Super Metric: main screen 1 2

Super Metric: applying to a type of resource 2 1

Super Metric package: group of super metrics 1 3 2 4

Super Metric: naming tips [Calculation] [Object] [Resource] in a[Container] (units) Sum, Min, Max, etc VM or ESXi CPU or RAM or Disk or Network Cluster or Datacenter or vCenter % or Mbps or packets, etc

Singapore, June 2013

Singapore, June 2013

Presentation Transcript

June 2013

June 2013

June 2013

June 2013

June 2013

June 2013

Singapore-2013

June 2013

June, 2013

June 2013

June 2013

June 2013

June 2013

June 2013

June 2013

June 2013

June 2013

June 2013

June 2013

JUNE 2013

June 2013

June 2013