1.21k likes | 1.38k Views
Additional Info (some are still draft ). Tech notes that you may find useful as input to the design. A lot more material can be found at the Design Workshop. Internal Cloud: Gartner model and VMware model. Gartner take: Virtual infrastructure On-demand, elastic, automated/dynamic
E N D
Additional Info (some are still draft) Tech notes that you may find useful as input to the design. A lot more material can be found at the Design Workshop
Internal Cloud: Gartner model and VMware model • Gartner take: • Virtual infrastructure • On-demand, elastic, automated/dynamic • Improves agility and business continuity Self-service provisioning portal Service catalog Chargeback system Capacity management Ext. cloud connector Life cycle management Service governor/infrastructure authority Identity and access management Configuration and change management Enterprise service management Orchestrator Performance management Virtual infrastructure management Virtual infrastructure Physical infrastructure
Cluster: Settings • For the 3 sample sizes, here is my personal recommendation • DRS fully automated. Sensitivity: Moderate • Use anti-affinity or affinity rules only when needed. • More things for you to remember. • Gives DRS less room to maneuver • DPM enabled. Choose hosts that support DPM • Do not use WOL. Use DPM or IPMI • VM Monitoring enabled. • VM monitoring sensitivity: Medium • HA will restart the VM if the heartbeat between the host and the VM has not been received within a 60 second interval • EVC enabled. Enable you to upgrade in future. • Prevent VMs from being powered on if they violate availability constraints better availability • Host isolation response: Shut down VM • See http://www.yellow-bricks.com/vmware-high-availability-deepdiv/ • Compared with “Leave VM Powered on”, this prevent data/transaction integrity risk. The risk is rather low as the VM itself has lock • Compared with “Power off VM”, this allows graceful shutdown. Some application needs to run consistency check after a sudden power off.
DRS, DPM, EVC In our 3 sizes, here are the settings: • DRS: Fully Automated • DRS sensitivity: Leave it at default (middle. 3 Star migration) • EVC: turn on. • It does not reduce performance. • It is a simple mask. • DPM: turn on. Unless HW vendor shows otherwise • VM affinity: use sparingly. It adds complexity as we are using group affinity. • Group affinity: use (as per diagram in design) Why turn on DPM • Power cost is real concern Singapore example: S$0.24 per kWh x (600 W + 600 W) x 24 hours 365 days x 3 years / 1000 W = $5100 This is quite close of buying 1 server For every 1W of power consumed, we need minimum 1W of power for aircond + UPS + lighting
VMware VMmark • Use VMmark as the basis for CPU selection only, not entire box selection. • It is the official benchmark for VMware, and it uses multiple workload • Other benchmark are not run on vSphere, and typically test 1 workload • VMmark does not include TCO. Consider entire cost when choosing HW platform • Use it as a guide only • Your environment is not the same. • You need head room and HA. • How it’s done • VMmark 2.0 uses 1 - 4 vCPU • MS Exchange, MySQL, Apache, J2EE, File Server, Idle VM • Result page: • VMmark 2.0 is not compatible with 1.x results • www.vmware.com/products/vmmark/results.html This slide needs update
VMmark: sample benchmark result (HP only) I’m only showing result from 1 vendor as vendor comparison is more than just VMmark result. IBM, Dell, HP, Fujitsu, Cisco, Oracle, NEC have VMmark results Look at this number. 20 tiles = 100 Active VM This number is when comparing with same #Tiles ± 10% is ok for real-life sizing. This is benchmark Opteron 8439, 24 cores Xeon 5570, 8 cores Opteron 2435, 12 cores Xeon 5470, 8 cores This tells us that Xeon 5500 can run 17 Tiles, at 100% utilisation. Each Tile has 6 VM, but 1 is idle. 17 x 5 VM = 85 active VM in 1 box. At 80% Peak utilisation, that’s ~65 VM.
MS Clustering ESX Port Group properties • Notify Switches = NO • Forged Transmits = Accept. Win08 does not support NFS Storage Design • Virtual SCSI adapter • LSI Logic Parallel for Windows Server 2003 • LSI Logic SAS for Windows Server 2008 ESXi changes • ESXi 5.0 uses a different technique to determine if RDM LUNs are used for MSCS cluster devices, by introducing a configuration flag to mark each device as "perennially reserved" that is participating in a MSCS cluster. Unicast mode reassigns the station (MAC) address of the network adapter for which it is enabled and all cluster hosts are assigned the same MAC address, you cannot have ESX send ARP or RARP to update the physical switch port with the actual MAC address of the NICs as this break the the unicast NLB communication
Symantec ApplicationHA Can install agent to multiple VM simultaneously Additional Roles for security It does not cover Oracle yet Presales contact for ASEAN: Vic
VMware HA and DRS Read Duncan’s yellowbrick first. • Done? Read it again. This time, try to internalise it. See speaker notes below for an example. vSphere 4.1 • Primary Nodes • Primary nodes hold cluster settings and all “node states” which are synchronized between primaries. Node states hold for instance resource usage information. In case that vCenter is not available the primary nodes will have a rough estimate of the resource occupation and can take this into account when a fail-over needs to occur. • Primary nodes send heartbeats to primary nodes and secondary nodes. • HA needs at least 1 primary because the “fail-over coordinator” role will be assigned to this primary, this role is also described as “active primary”. • If all primary hosts fail simultaneously no HA initiated restart of the VMs will take place. HA needs at least one primary host to restart VMs. This is why you can only take four host failures in account when configuring the “host failures” HA admission control policy. (Remember 5 primaries…) • The first 5 hosts that join the VMware HA cluster are automatically selected as primary nodes. All the others are automatically selected as secondary nodes. A cluster of 5 will be all Primary. • When you do a reconfigure for HA the primary nodes and secondary nodes are selected again, this is at random. The vCenter client does not show which host is a primary and which is not. • Secondary Nodes • Secondary nodes send their state info & heartbeats to the primary nodes only. • HA does not knows if the host is isolated or completely unavailable (down). • The VM lock file is the safety net. In VMFS, the file is not visible. In NFS, it is the .lck file. Nodes send a heartbeat every 1 second. The mechanism to detect possible outages.
vSphere 4.1: HA and DRS Best Practices • Avoid using advance settings to decrease slot size as it might lead to longer down time. Admission control does not take fragmentation of slots into account when slot sizes are manually defined with advanced settings. What can go wrong in HA • VM Network lost • HA network lost • Storage Network lost
VMware HA and DRS Split Brain >< Partitioned Cluster • A large cluster that spans across racks might experience partitioning. Each partition will think they are full cluster. So long there is no loss is storage network, each partition will happily run their own VM. • Split Brain is when 2 hosts want to run a VM. • Partitioned can happen when the cluster is separated by multiple switches. Diagram below shows a cluster of 4 ESX.
HA: Admission Control Policy (% of Cluster) Specify a percentage of capacity that needs to be reserved for failover • You need to manually set it so it is at least equal to 1 host failure. • E.g. you have a 8 node cluster and wants to handle 2 node failure. Set the % to be 25% Complexity arises when nodes are not equal • Different RAM or CPU • But this also impact the other Admission Control option. So always keep node size equal, especially in Tier 1. Total amount of reserved resource < (Available Resources – Reserved Resources) If no reservation is set a default of 256 MHz is used for CPU and 0MB + overhead for MEM Monitor the thresholds with vCenter on the Cluster’s “summary” tab
Snapshot Only keep for maximum 1-3 days. • Delete or commit as soon as you are done. • A large snapshot may cause issue when committing/deleting. For high transaction VM, delete/commit as soon as you are done verifying • E.g. databases, emails. 3rd party tool • Snapshots taken by third party software (called via API) may not show up in the vCenter Snapshot Manager. Routinely check for snapshots via the command-line. Increasing the size of a disk with snapshots present can lead to corruption of the snapshots and potential data loss. • Check for snapshot via CLI before you increase
vMotion Can be encrypted. At a cost certainly. If vMotion network is isolated, then there is no need. May lose 1 ping. Inter-cluster vMotion is not the same with intra-cluster • Involves additional calls into vCenter, so hard limit • Lose VM cluster properties (HA restart priority, DRS settings, etc.)
ESXi: Network configuration with UCS • If you are using Cisco UCS blade • 2x 10G or 4x 10G depending on blade model and mezzanine card • All mezzanine card models support FCoE • Unified I/O • Low Latency • The Cisco Virtualized Adapter (VIC) supports • Multiple virtual adapters per physical adapter • Ethernet & FC on the same adapter • Up to 128 virtual adapters (vNICs) • High Performance 500K IOPS • Ideal for FC, iSCSIand NFS Once you decide it’s Cisco,discuss the detail with Cisco.
Storage DRS and DRS Interactions: • Storage DRS placement may impact VM-host compatibility for DRS • DRS placement may impact VM-datastore compatibility for Storage DRS Solution: datastore and host co-placement • Done at provisioning time by Storage DRS • Based on an integrated metric for space, I/O, CPU and memory resources • Overcommitted resources get more weights in the integrated metric • DRS placement proceeds as usual But easier to architect it properly. Map ESX Cluster to Datastore Cluster manually. Datastore 3 Datastore 2 Datastore 1
Multiple points of management FC Ethernet Blade switches High cable count Unified fabric with Fabric extender Single point of management Reduced cables Fiber between racks Copper in racks End of Row Deployment Fabric Extender Unified Fabric with Fabric Extender
Storage IO Control Suggested Congestion Threshold values One: Avoid different settings for datastores sharing underlying resources • Use same congestion threshold on A, B • Use comparable share values(e.g. use Low/Normal/High everywhere) SIOC SIOC Datastore A Datastore B Physical drives
NAS & NFS • Two key NAS protocols: • NFS (the “Network File System”). This is what we support. • SMB (Windows networking, also known as “CIFS”) • Things to know about NFS • “Simpler” for person who are not familiar with SAN complexity • To remove a VM lock is simpler as it’s visible. • When ESX Server accesses a VM disk file on an NFS-based datastore, a special .lck-XXX lock file is generated in the same directory where the disk file resides to prevent other ESX Server hosts from accessing this virtual disk file. • Don’t remove the .lck-XXX lock file, otherwise the running VM will not be able to access its virtual disk file. • No SCSI reservation. This is a minor issue • 1 Datastore will only use 1 path • Does Load Based Teaming work with it? • For 1 GE, throughput will peak at 100 MB/s. At 16 K block size, that’s 7500 IOPS. • The Vmkernel in vSphere 5 only supports NFS v3, not v4. Over TCP only, no support for UDP. • MSCS (Microsoft Clustering) is not supported with NAS. • NFS traffic by default is sent in clear text since ESX does not encrypt it. • Use only NAS storage over trusted networks. Layer 2 VLANs are another good choice here. • 10 Gb NFS is supported. So is Jumbo Frames, and configure it end to end. • Deduplication can save sizeable amount. See speaker notes
iSCSI • Use Virtual port storage system instead of plain Active/Active • I’m not sure if they cost much more. • Has 1 additional Array Type over traditional FC: Virtual port storage system • Allows access to all available LUNs through a single virtual port. • These are active-active Array, but hide their multiple connections though a single port. ESXi multipathing cannot detect the multiple connections to the storage. ESXi does not see multiple ports on the storage and cannot choose the storage port it connects to. These array handle port failover and connection balancing transparently. This is often referred to as transparent failover • The storage system uses this technique to spread the load across available ports.
iSCSI • Limitations • ESX/ESXi does not support iSCSI-connected tape devices. • You cannot use virtual-machine multipathing software to perform I/O load balancing to a single physical LUN. • A host cannot access the same LUN when it uses dependent and independent hardware iSCSI adapters simultaneously. • Broadcom iSCSI adapters do not support IPv6 and Jumbo Frames. [e1: still true in vSphere 5??] • Some storage systems do not support multiple sessions from the same initiator name or endpoint. Multiple sessions to such targets can result in unpredictable behavior. • Dependant and Independent • A dependent hardware iSCSI adapter is a third-party adapter that depends on VMware networking, and iSCSI configuration and management interfaces provided by VMware. This type of adapter can be a card, such as a Broadcom 5709 NIC, that presents a standard network adapter and iSCSI offload functionality for the same port. The iSCSI offload functionality appears on the list of storage adapters as an iSCSI adapter • Error correction • To protect the integrity of iSCSI headers and data, the iSCSI protocol defines error correction methods known as header digests and data digests. These digests pertain to the header and SCSI data being transferred between iSCSI initiators and targets, in both directions. • Both parameters are disabled by default, but you can enable them. Impact CPU. Nehalem processors offload the iSCSI digest calculations, thus reducing the impact on performance • Hardware iSCSI • When you use a dependent hardware iSCSI adapter, performance reporting for a NIC associated with the adapter might show little or no activity, even when iSCSI traffic is heavy. This behavior occurs because the iSCSI traffic bypasses the regular networking stack • Best practice • Configure jumbo frames end to end. • Use NIC with TCP segmentation offload (TSO)
iSCSI & NFS: caveat when used together Avoid using them together iSCSI and NFS have different HA models. • iSCSI uses vmknics with no Ethernet failover – using MPIO instead • NFS client relies on vmknics using link aggregation/Ethernet failover • NFS relies on host routing table. • NFS traffic will use iSCSI vmknic and results in links without redundancy • Use of multiple session iSCSI with NFS is not supported by NetApp • EMC supports, but best practice is to have separate subnets, virtual interfaces
NPIV What it is • Allow a single Fibre Channel HBA port to register with the Fibre Channel fabric using several worldwide port names (WWPNs). This ability makes the HBA port appear as multiple virtual ports, each having its own ID and virtual port name. Virtual machines can then claim each of these virtual ports and use them for all RDM traffic. • Note that is WWPN, not WWNN • WWPN – World Wide Port Name • WWNN – World Wide Node Name • Single port HBA typically has a single WWNN and a single WWPN (which may be the same). • Dual port HBAs may have a single WWNN to identify the HBA, but each port will typically have its own WWPN. • However they could also have an independent WWNN per port too. Design consideration • Only applicable to RDM • VM does not get its own HBA nor FC driver required. It just gets an N-port, so it’s visible from the fabric. • HBA and SAN switch must support NPIV • Cannot perform Storage vMotion or VMotion between datastores when NPIV is enabled. All RDM files must be in the same datastore. • Still in place in v5 First one is WW Node Name Second one is WW Port Name
2 TB VMDK barrier You need to have > 2 TB disk within a VM. • There are some solutions, each with pro and cons. • Say you need a 5 TB disk in 1 Windows VM. • RDM (even with physical compatibility) and DirectPath I/O do not increase virtual disk limit. Solution 1: VMFS or NFS • Create a datastore of 5 TB. • Create 3 VMDK. Present to Windows • Windows then combine the 3 disk into 1 disk. • Limitation • Certain low level storage-softwares may not work as they need 1 disk (not combined by OS) Solution 3: iSCSI within the Guest • Configure the iSCSI initiator in Windows • Configure a 5 TB LUN. Present the LUN directly to Windows, bypassing the ESX layer. You can’t monitor it. • By default, it will only have 1 GE. NIC teaming requires driver from Intel. Not sure if this supported.
Storage: Queue Depth When should you adjust the queue depth? • If a VM generates more commands to a LUN than the LUN queue depth; Adjust the device/LUN queue. • Generally with fewer, very high IO VMs on a host, larger queues at the device driver will improve performance. • If the VM’s queue depth is lower than the HBA’s; Adjust the vmkernel. Be cautious when setting queue depths • With too large of device queues, the storage array can easily be overwhelmed and its performance may suffer with high latencies. • Device driver queue depths is global and set per LUN setting. • Change the device queue depth for all ESX hosts in the cluster Calculating the queue depth: • To verify that you are not exceed the queue depth for an HBA use the following formula: • Max. queue depth of the HBA = Device queue setting * # of LUNs on HBA Queue are at multiple levels • LUN queue for each LUN at ESXi host. • If the above queue is full, then kernel queue will be filled up • LUN queue at array level for each LUN • If this queue does not exist, then the array writes straight into disk. • Disk queue • The queue at the disk level, if there is no LUN queue
Sizing the Storage Array • For RAID 1 (it has IO Penalty of 2) • 60 Drives= ((7000 x 2 x 30%) + (7000 x 70%)) / 150 IOPS • Why RAID 5 has 4 IO Penalty?
Storage: Performance Monitoring Get a baseline of your environment during a “normal” IO time frame. • Capture as many data points as possible for analysis. • Capture data from the SAN Fabric, the storage array, and the hosts. Which statistics should be captured • Max and average read/write IOps • Max and average read/write latency (ms) • Max and average Throughput (MB/sec) • Read and write percentages • Random vs. sequential • Capacity – total and used
TR RC N_Port 0 N_Port 1 TR RC Fabric Switch 1 Node A Node B Node C E_Port RC N_Port 2 TR RC TR TR RC RC TR N_Port 1 N_Port 0 E_Port TR RC Node G Node E Fabric Switch 2 Node H RC N_Port 2 TR TR TR TR TR F_Port F_Port F_Port F_Port RC RC RC RC Node F Node D F_Port F_Port F_Port F_Port RC RC RC RC TR TR N_Port 3 N_Port 3 TR TR TR TR RC RC Fibre Channel Multi-Switch Fabric
Backup: VADP vs Agent-based ESX has 23 VM. Each VM is around 40 GB. • All VMs are idle, so this CPU/Disk are purely on back up. • CPU Peak is >10 GHz (just above 4 cores) • But Disk Peak is >1.4 Gbps of IO, almost 50% of a 4 Gb HBA. After VAPD, both CPU and Disk drops to negligible
VADP: Adoption Status This is as at June 2010. Always check with vendor for the most accurate data
Partition alignment Affects every protocol, and every storage array • VMFS on iSCSI, FC, & FCoE LUNs • NFS • VMDKs & RDMs with NTFS, EXT3, etc VMware VMFS partitions that align to 64KB track boundaries give reduced latency and increased throughput • Check with storage vendor if there are any recommendations to follow. • If no recommendations are made, use a starting block that is a multiple of 8 KB. Responsibility of Storage Team. • Not vSphere Team On NetApp : • VMFS Partitions automatically aligned. Starting block in multiples of 4k • MBRscan and MBRalign tools available to detect and correct misalignment FS 4KB-1MB Cluster Cluster Cluster VMFS 1MB-8MB Block Array 4KB-64KB Chunk Chunk Chunk
Tools: Array-specific integration • The example below is from NetApp. Other Storage partners have integration capability too. • Always check with respective product vendor for latest information.
Tools: Array-specific integration • Management of the Array can be done from vSphere client. Below is from NetApp • Ensure storage access is not accidently given to vSphere admin by using RBAC
Data Recovery No integration with tape • Can do manual If a third-party solution is being used to backup the deduplication store, those backups must not run while the Data Recovery service is running. Do not back up the deduplication store without first powering off the Data Recovery Backup Appliance or stopping the datarecovery service using the command service datarecovery stop. Some limits • 8 concurrent jobs on the appliance at any time (backup & restore). • An appliance can have at the most 2 dedupe store destinations due to the overhead involved in deduping. • VMDK or RDM based deduplication stores of up to 1TB or CIFS based deduplication stores of up to 500GB. • No IPv6 addresses • No multiple backup appliances on a single host. VDR cannot back up VMs • that are protected by VMware Fault Tolerance. • with 3rd party multi-pathing enabled where shared SCSI buses are in use. • with raw device mapped (RDM) disks in physical compatibility mode. • Data Recovery can back up VMware View linked clones, but they are restored as unlinked clones. Using Data Recovery to backup Data Recovery backup appliances is not supported. • This should not be an issue. The backup appliance is a stateless device, so there is not the same need to back it up like other types of VMs.
VMware Data Recovery We assume the following requirements • Back up to external array, not the same array. • External Array can be used for other purpose too. So the 2 arrays are backing up each other. • How to ensure Write performance as the array is shared? • 1x a day back up. No need multiple back up per day on the same VM. Consideration • Bandwidth: Need dedicated NIC to the Data Recovery VM • Performance: Need to reserve CPU/RAM for the VM? • Group like VM together. It maximises dedupe • Destination: RDM LUN presented via iSCSI to the Appliance. See picture below (hard disk 2) • Not using VMDK format to enable LUN level operation • Not using CIFS/SMB as Dedupliation Store is 0.5 TB vs 1 TB on RDM/VMDK • Space calculation: need to find a tool to help estimate the disk requirements.
Mapping: Datastore – VM Criteria to use when placing a VM into a Tier: • How critical is the VM? Importance to business. • What are its performance and availability requirements? • What are its Point-in-Time restoration requirements? • What are its backup requirements? • What are its replication requirements? Have a document that lists which VM resides on which datastore group • Content can be generated using PowerCLI or Orchestrator, which shows datastores and their VMs. • Example tool: Quest PowerGUI • While rarely happen, you can’t rule out if datastore metadata get corrupted. • When that happens, you want to know what VMs are affected. A VM normally change tiers throughout its life cycle • Criticality is relative and might change for a variety of reasons, including changes in the organization, operational processes, regulatory requirements, disaster planning, and so on. • Be prepared to do Storage vMotion. • Always test it first so you know how long it takes in your specific environment • VAAI is critical, else the traffic will impact your other VMs.
RDM • Use sparingly. • VMDK is more portable, easier to manage, and easier to resize. • VMDK and RDM have similar performance. • Physical RDM • Can’t take snapshot. • No Storage vMotion. But can do vMotion. • Physical mode specifies minimal SCSI virtualization of the mapped device, allowing the greatest flexibility for SAN management software. • VMkernel passes all SCSI commands to the device, with one exception: the REPORT LUNs command is virtualized so that the VMkernel can isolate the LUN to the owning virtual machine. • Virtual RDM • Specifies full virtualization of the mapped device. Features like snapshot, etc works • VMkernel sends only READ and WRITE to the mapped device. The mapped device appears to the guest operating system exactly the same as a virtual disk file in a VMFS volume. The real hardware characteristics are hidden.
Human Experts vs Storage DRS 2 VMware performance engineers vs Storage DRS competing to balance the following: • 13 VMs: 3 DVD store, 2 Swingbench, 4 mail servers, 2 OLTP, 2 web servers • 2 ESX hosts and 3 storage devices (different FC LUNs in shades of blue) Storage DRS provides lowest average latency, while maintaining similar throughput. Why human expert lost? • Too many numbers to crunch, too many dimensions to the analysis. Human took a couple of hours to think this through. Why bother anyway IOPS Latency (ms) StorageDRS StorageDRS Green: Average Latency (ms)
Alternative Backup Method VMware ecosystem may provide new way of doing back up. • Example below is from NetApp NetApp SnapManager for Virtual Infrastructure (SMVI) • In Large Cloud, SMVI server should sit on a separate VM from with vCenter. • While it has no performance requirement, it is best from Segregation of Duty point of view. • Best practice is to keep vCenter clean & simple. vCenter is playing much more critical role in larger environment where plug-ins are relying on vCenter up time. • Allows for consistent array snapshots & replication. • Combine with other SnapManager products (SM for Exchange, SM for Oracle, etc) for application consistency • Exchange and SQL work with VMDK • Oracle, SharePoint, SAP require RDM • Can be combined with SnapVault for vaulting to disk. • 3 levels of data protection : • On disk array snapshots for fast backup (seconds) & recovery (up to 255 snapshot copies of any datastore can be kept with no performance impact) • Vaulting to separate array for better protection, slightly slower recovery • SnapMirror to offsite for DR purposes • Serves to minimize backup window (and frozen vmdk when changes are applied) • Option to not create a vm snapshot to create crash consistent array snapshots
Support multi-switch Link aggr? One VMKernel port & IP subnet Yes Use multiple links with IP hash load balancing on the NFS client (ESX) Use multiple VMKernel Ports & IP subnets Use multiple links with IP hash load balancing on The NFS server (array) Use ESX routing table Storage needsmultiple sequential IP addresses Storage needs multiple sequential IP addresses
vMotion Performance on 1 GbEVs 10 GbE • Idle/Moderately loaded VM scenarios • Reductions in duration when using 10 GbE vs 1 GbE on both vSphere 4.1 and vSphere 5 Consider switch from 1 GbE to 10 GbE vMotion network • Heavily loaded VM scenario • Reductions in duration when using 10 GbE vs 1 GbE • 1 GbE on vSphere 4.1: Memory copy convergence issues lead to network connection drops • 1 GbE on vSphere 5 : SDPS kicked-in resulting in zero connection drops vMotion in vSphere 5 never fails due to memory copy convergence issues Duration of vMotion (lower the better)
vMotion duration : 15 sec Impact on Database Server Performance During vMotion Impact during guest trace period vMotion duration : 23 sec Performance impact minimal during the memory trace phase in vSphere 5 Throughput was never zero in vSphere 5 (due to switch-over time < half a second) Time to resume to normal level of performance about 2 seconds better in vSphere 5 Impact during switch-over period Time (in seconds) Impact during switch-over period Impact during guest trace period Time (in seconds)