500 likes | 508 Views
Learn about the performance improvements and best practices in ESX 3.5, including NFS and iSCSI optimizations. Discover troubleshooting techniques and real-life case studies.
E N D
AP02 NFS & iSCSI: Performance Characterization and Best Practices in ESX 3.5 Priti Mishra MTS, VMware Bing Tsai Sr. R&D Manager, VMware
Housekeeping • Please turn off your mobile phones, blackberries and laptops • Your feedback is valued: please fill in the session evaluation form (specific to that session) & hand it to the room monitor / the materials pickup area at registration • Each delegate to return their completed event evaluation form to the materials pickup area will be eligible for a free evaluation copy of VMware’s ESX 3i • Please leave the room between sessions, even if your next session is in the same room as you will need to be rescanned
Topics • General Performance Data and Comparison • Improvements in ESX 3.5 over ESX 3.0.x • Performance Best Practices • Troubleshooting Techniques • Basic methodology • Tools • Case studies
Key performance improvements since ESX3.0.x (1 of 3) • NFS • Accurate CPU accounting further improves load balancing among multiple VMs • Optimized buffer and heap sizes • Improvements in TSO support • TSO (TCP segmentation offload) improves large writes • H/W iSCSI (with QLogic 405x HBA) • Improvements in PAE (large memory) support • Results in better multi-VM performance in large systems • Minimized NUMA performance overhead • This overhead exists in physical systems as well • Improved CPU cost per I/O
Key performance improvements since ESX3.0.x (2 of 3) • S/W iSCSI (S/W-based initiator in ESX) • Improvements in CPU costs per I/O • Accurate CPU accounting further improves load balance among multiple VMs • Increased maximum transfer size • Minimizes iSCSI protocol processing cost • Reduces network overhead for large I/Os • Ability to handle more concurrent I/Os • Improved multi-VM performance
Key performance improvements since ESX3.0.x (3 of 3) • S/W iSCSI (continued) • Improvements in PAE (large memory) support • CPU efficiency much improved for systems with >4GB memory • Minimizing NUMA performance overhead
Performance Experiment Setup (1 of 3) • Workload: Iometer • Standard set based on • Request size • 1k, 4k, 8k, 16k, 32k, 64k, 72k, 128k, 256k, 512k • Access mode • 50% read/ write • Access pattern • 100% sequential • 1 worker, 16 Outstanding I/Os • Cached runs • 100MB data disks to minimize array/server disk activities • All I/Os served from server/array cache • Gives upper bound on performance
Performance Experiment Setup (2 of 3) • VM information • Windows 2003 Enterprise Edition • 1 VCPU; 256 MB memory • No file system used in VM (Iometer sees disk as physical drive) • No caching done in VM • Virtual disks located on RDM device configured in physical mode • Note: VMFS-formatted volumes are used in some tests where noted
Performance Experiment Setup (3 of 3) • ESX Server • 4-socket, 8 x 2.4GHz cores • 32GB DRAM • 2 x Gigabit NICs • One for vmkernel networking: used for NFS and software iSCSI protocols • One for general VM connectivity • Networking Configuration • Dedicated VLANs for data traffic isolated from general networking
How to read performance comparison charts • Throughput • Higher is better • Positive is better higher throughput • Latency • Lower is better • Negative is better lower response time • CPU cost • Lower is better • Negative is better reduced CPU cost • How does this metric matter?
CPU Costs • Why is CPU cost data useful? • Determines how much I/O traffic the system CPUs can handle • How many I/O-intensive VMs can be consolidated in a host • How to compute CPU cost • Measure total physical CPU usage in ESX • esxtop counter: Physical Cpu(_Total) • Normalize to per I/O or per MBps • Example: MHz/MBps = {(Physical CPU usage percentage out 100%) ) X (# of physical CPUs) X (CPU MHz rating)} / (throughput in MBps)
Performance Data • First set: Relative to baselines in ESX 3.0.x • Second set: Comparison of storage options using Fibre Channel data as the baseline • Last: VMFS vs. RDM physical
Software iSCSI – Throughput Comparison to 3.0.x: higher is better
Software iSCSI – Latency Comparison to 3.0.x: lower is better
Software iSCSI – CPU Cost Comparison to 3.0.x: lower is better
Software iSCSI – Performance Summary • Lower CPU costs • Can lead to higher throughput for small IO sizes when CPU is pegged • CPU costs per IO also greatly improved for larger block sizes • Latency is lower • Especially for smaller data sizes • Read operations benefit most • Throughput levels • Dependent on workload • Mixed read-write patterns show most gain • Read I/Os show gains for small data sizes
Hardware iSCSI – Throughput Comparison to 3.0.x: higher is better
Hardware iSCSI – Latency Comparison to 3.0.x: lower is better
Hardware iSCSI – CPU Cost Comparison to 3.0.x : lower is better
Hardware iSCSI – Performance Summary • Lower CPU costs • Results in higher throughput levels for small IO sizes • CPU costs per IO are especially improved for larger data sizes • Latency is better • Smaller data sizes show the most gain • Mixed read-write and read I/Os benefit more • Throughput levels • Dependent on workload • Mixed read-write patterns show most gain for all block sizes • Pure read and write I/Os show gains for small block sizes
NFS – Performance Summary • Performance also significantly improved in ESX 3.5 • Data now shown here for interest of time
Protocol Comparison • Which storage option to choose? • IP Storage vs. Fibre Channel • How to read the charts? • All data is presented as ratio to the corresponding 2Gb FC (Fibre Channel) data • If the ratio is 1, the FC and IP protocol data is identical; if < 1, FC data value is larger
Comparison with FC: Throughput if < 1, FC data value is larger
Comparison with FC: Latency lower is better
VMFS vs. RDM • Which one has better performance? • Data shown as ratio to RDM physical
VMFS vs. RDM-physical: Throughput higher is better
VMFS vs. RDM-physical: Latency lower is better
VMFS vs. RDM-physical: CPU Cost lower is better
Topics • General Performance Data and Comparison • Improvements in ESX 3.5 over ESX 3.0.x • Performance Best Practices • Troubleshooting Techniques • Basic methodology • Tools • Case studies
Pre-Deployment Best Practices: Overview • Understand the performance capability of your • Storage server/array • Networking hardware and configurations • ESX host platform • Know your workloads • Establish performance baselines
Pre-Deployment Best Practices (1 of 4) • Storage server/array: a complex system by itself • Total spindle count • Number of spindles allocated for use • RAID level and stripe size • Storage processor specifications • Read/write cache sizes and caching policy settings • Read-Ahead, Write-Behind, etc. • Useful sources of information: • Vendor documentation: manuals, best practice guides, white papers, etc. • Third-party benchmarking reports • NFS-specific tuning information: SPEC-SFS disclosures in http://www.spec.org
Pre-Deployment Best Practices (2 of 4) • Networking • Routing topology and path configurations: # of links in between, etc. • Switch type, speed and capacity • NIC brand/model, speed and features • H/W iSCSI HBAs • ESX host • CPU: revision, speed and core count • Architecture basics • SMP or NUMA? • Disabling NUMA is not recommended • Bus speed, I/O subsystems, etc. • Memory configuration and size • Note: NUMA nodes may not have equal amount of memory
Pre-Deployment Best Practices (3 of 4) • Workload characteristics • What are the smallest, largest and most common I/O sizes? • What is the read%? write%? • Is access pattern sequential? random? mixed? • Response time more important or aggregate throughput? • Response time variance an issue or not? • Important: know the peak resource usage, not just the average
Pre-Deployment Best Practices (4 of 4) • Establish performance baselines by running standardized benchmarks • What’s the upperbound IOps for small I/Os? • What’s the upperbound MBps? • What’s the average/worst case response time? • What’s the CPU cost of doing I/O?
Additional Considerations (1 of 3) • NFS parameters • # of NFS mount points • Multiple VMs using multiple mount points may give higher aggregate throughput with slightly higher CPU cost • Export option on NFS server affects performance • iSCSI protocol parameters • Header digest processing: slight impact on performance • Data digest processing: turning off may result in • Improved CPU utilization • Slightly lower latencies • Minor throughput improvement • Actual outcome highly dependent on workload
Additional Considerations (2 of 3) • NUMA specific • If only one VM is doing heavy I/O, may be beneficial to pin the VM and its memory to node 0 • If CPU usage is not a concern; no pinning necessary • On each VM reboot, ESX Server will place it on the next adjacent NUMA node • Minor performance implications for certain workloads • To avoid this movement, VM should be affinitized using VI client • SMP VMs • For I/O workloads within an SMP VM that migrate frequently between VCPUs • Pin the guest thread/process to a specific VCPU • Some versions of Linux has KHz timer rate and may incur high overhead
Additional Considerations (3 of 3) • CPU headroom • Software initiated iSCSI and NFS protocols can consume significant amount of CPU in certain I/O patterns • Small I/O workloads require large amount of CPU; ensure that CPU saturation does not restrict I/O rate • Networking • Avoid link over-subscription • Ensure all networking parameters or even the basic gigabit connection is consistent across the full network path • Intelligent use of VLAN or zoning to minimize traffic interference
General Troubleshooting Tips (1 of 3) • Identify • Components in the whole I/O path • Possible issues at each layer in the path • Check all hardware & software configuration parameters, in particular • Disk configurations and cache management policies on storage server/array • Network settings and routing topology • Design experiments to isolate problems, such as: • Cached runs • Use a small file or logical device, or a physical host configured with RAM-disks: Minimizing physical disk effects • Indicate upper-bound throughput and I/O rate achievable
General Troubleshooting Tips (2 of 3) • Run tests with single outstanding I/O • Easier for analysis on packet traces • Throughput entirely dependent on I/O response times • Micro benchmarking each layer in the I/O path • Compare to non-virtualized, native performance results • Collect data • Guest OS data: But don’t trust the CPU% • Esxtop data • Storage server/array data: Cache hit ratio, storage processor busy%, etc. • Packet tracing with tools like TCPdump, Ethereal, Wireshark, etc.
General Troubleshooting Tips (3 of 3) • Analyze performance data • Do any stats, e.g., throughput or latency, change drastically over time? • Check esxtop data for anomalies, e.g., CPU spikes or excessive queueing • Server/array stats • Compare array stats with ESXstats • Is cache hit ratio reasonable? Storage processor overloaded? • Network trace analysis • Inspect packet traces to see if • NFS and iSCSI requests are processed timely • IO sizes issued by the guest match the transfer sizes over the wire • Block addresses aligned to appropriate boundaries?
Isolating Performance Problems: Case Study#1 (1 of 3) • Symptoms • Throughput can reach Gigabit wire speed doing 128KB sequential reads from a 20GB LUN on an iSCSI array with 2GB cache • Throughput degrades for larger data sizes beyond 128KB • From esxtop data • CPU utilization also lower for l/O sizes larger than 128KB • CPU cost per I/O is in expected range for all I/O sizes
Isolating Performance Problems: Case Study#1 (2 of 3) • From esxtop or benchmark output • I/O response times in the 10 to 20ms range for the problematic IOs • Indicates constant physical disk activities required to serve the reads • From network packet traces • No retransmissions or packet loss observed indicating no networking issue • Packet time stamps indicating array takes 10ms to 20ms to respond to a read request, no delay in the ESX host • From cached run results • No throughput degradation above 128KB! • Problem exists only for file sizes exceeding cache capacity • Array appears to have cache-management issues with large sequential reads
Isolating Performance Problems: Case Study#1 (3 of 3) • From native tests to same array • Same problem observed • From the administration GUI of the array • Read-ahead policies set to highly aggressive • Is the policy appropriate for the workload? Solution • Understand performance characteristics of the array • Experiment with different read-ahead policies • Try turning off read-ahead entirely to get the baseline behavior
Isolating Performance Problems: Case Study#2 (1 of 4) • Symptoms • 1KB random write throughput much lower (< 10%) than sequential writes to a 4GB vmdk file located on an NFS server • Even after extensive warm-up period • But very little difference in performance between random and sequential reads • From NFS server spec • 3GB read/write cache • Most data should be in cache after warming up
Isolating Performance Problems: Case Study#2 (2 of 4) • From esxtop and application/benchmark data • CPU% utilization lower but CPU cost per I/O mostly same regardless of randomness • Not likely a client side (i.e., ESX host) issue • Random write latency in the 20ms range • Sequential write < 1ms • From NFS server stats • cache hit% much lower for random writes, even after warm-up
Isolating Performance Problems: Case Study#2 (3 of 4) • From cached runs to a 100MB vmdk • Random write latency almost matches sequential write • Again, suggests that issue is not in ESX host • From native tests • Random and sequential write performance is almost same • From network packet traces • Server responds to random writes in 10 to 20ms, sequential writes in <1ms • Offset in NFS WRITE requests is not aligned to power-of-2 boundary • Packet traces from native runs show correct alignment
Isolating Performance Problems: Case Study#2 (4 of 4) • Question • Why are sequential writes not affected? • NFS Server file system idiosyncrasies • Manages cache memory at 4KB granularity • Old blocks are not updated in place; writes go to new blocks • Each < 4KB write incurs a read from the old block • Aggressive read-ahead masks the read latency associated with sequential writes Solution • Use disk alignment tool in the guest OS to align disk partition • Alternatively, use unformatted partition inside guest OS
Summary and Takeaways • IP-based storage performance in ESX is being constantly improved; Key enhancements in ESX 3.5: • Overall storage subsystem • Networking • Resource scheduling and management • Optimized NUMA, multi-core, and large memory support • IP-based network storage technologies are maturing • Price/performance can be excellent • Deployment and troubleshooting could be challenging • Knowledge is key: server/array, networking, host, etc. • Stay tuned for further updates from VMware
Questions? NFS & iSCSI – Performance Characterization and Best Practices in ESX 3.5 Priti Mishra & Bing Tsai VMware