260 likes | 468 Views
mClock : Handling Throughput Variability for Hypervisor IO Scheduling. Ajay Gulati VMware Inc. Arif Merchant HP Labs. Peter Varman Rice University. in USENIX conference on Operating Systems Design and Implementation ( OSDI ) 2010. Outline. Introduction Scheduling goals of mClock
E N D
mClock: Handling Throughput Variability for Hypervisor IO Scheduling Ajay Gulati VMware Inc. ArifMerchant HP Labs Peter Varman Rice University in USENIX conference on Operating Systems Design and Implementation (OSDI ) 2010.
Outline • Introduction • Scheduling goals of mClock • mClock Algorithm • Distributed mClock • Performance Evaluation • Conclusion
Introduction • Hypervisors are responsible for multiplexing the underlying hardware resources among VMs • CPU, memory, network and storage IO Host Throughput available to a host is not under its own control The amount of CPU and memory resources on a host are fixed andtime-invariant . VM Storage IOScheduler VM VM CPU RAM Host VM Storage IOScheduler VM VM Storage Array CPU RAM
Introduction (cont’d) • Existing methods provide many knobs for allocating CPU and memory to VMs. • The current state of the art in terms of IO resource allocation is much more rudimentary. • Limited to providing proportional shares to different VMs. • Lack of QoS support for IO resources can have widespread effects • rendering existing CPU and memory controls ineffective when applications block on IO requests.
Introduction (cont’d) • The amount of IO throughput available to any particular host can fluctuate widely based on the behavior of other hosts accessing the shared device. VM5starts VM1starts VM2,3start VM1stop VM4starts VM2,3stop VM4stop
Introduction (cont’d) • Three main controls in resource allocation • Shares (a.k.a. weights) • proportional resource allocation • Reservations • minimum amount of resource allocation • to provide latency guarantee • Limits • maximum allowed resource allocation • prevent competing IO-intensive applications from consuming all the spare bandwidth in the system
Scheduling goals of mClock When reservations cannot be met: Proportional to reservations Limit the maximum throughput of DM When reservations can be met: Satisfy reservations first, then proportional to weight
Scheduling goals of mClock(cont’d) • Each VM i has three parameters: • Reservation(ri),Limit (li), Weight (wi) • VMs are partitioned into three sets: • Reservation-clamped(R), limit-clamped (L)or proportional (P),based on whether their current allocation is clamped atthe lower or upper bound or is in between. • Define
mClock Algorithm • mClock uses two main ideas: • multiple real-time clocks • Reservation-based,Limit-based, and Weight-based clocks • dynamic clock selection • Dynamic select one from multiple real-time clocks for scheduling. Tag assignment method is similar to the Virtual Clock scheduling.
mClock Algorithm (cont’d) • Tag Adjustment • To calibrate the proportional share tags against real time • To prevent starvation. • In virtual time based scheduling, this synchronization is done using global virtual time. ( Si,k= max{Fi,k-1, V(ai,k)} ) • In mClock, the reservation tag and limit tag must base on real time. => Adjust the origin of existing P tags to the real time.
mClock Algorithm (cont’d) Tag Adjustment Reservation first Select the request from the VMs under limitation. Active_IOs : count the queue length.
mClock Algorithm (cont’d) • This maintains the condition that R tags are always spaced apart by 1/ri, so that reserved service is not affected by the service provided in the weight-based phase. Rk1 Rk2 Rk3 Rk5 Rk4 1/rk time Current time trk3is served. The waiting time of rk4 may be longer than 1/rk
Storage-specific Issues • Bust Handling • Storage workloads are known to be bursty • Requests from the same VM often have a high spatial locality. • We help bursty workloads that were idle to gain a limited preference in scheduling when the system next has spare capacity. • To accomplish this, we allow VMs to gain idle credits. Pk1 Pk2+1/wi Pk2 Pk3 σi/wi time t idle Current time t: rk3 arrival
Storage-specific Issues (cont’d) • IO size • Since larger IO sizes take longer to complete, differently-sized IOs should not be treated equally by the IO scheduler. • The IO latency with n random outstanding IOs with an IO size of S each can be written as: • Converting latency observed for an IO of size S1 to a IO of a reference size S2, • A single request of IO size S is treated equivalent to(1 + S/(Tm×Bpeak)) IO requests. Tm: mechanical delay due to seek and disk rotation. Bpeak: the peak transfer bandwidth of a disk. For a smaller reference size, this part is negligible
Storage-specific Issues (cont’d) • Request Location • mClockimproves the overall efficiency of the system by scheduling IOs with high locality as a batch. • A VM is allowed to issue IO requests in a batch as long as the requests are close in logical block number space. • Reservation Setting • IOPS = Outstanding IOs / Latency • Application that keeps 8 IOs outstanding and requires 25ms latency, 8 / 0.025 = 320 IOPS for reservation
Distributed mClock • Cluster-based storage systems • dmClockruns a modified version of mClock • piggyback two integers ρi and δiwith each request of VM vi to a storage server sj. • δi: the number of IO requests from VM vi that have completed service at all the servers between the previous request (from vi) to the server s j and the current request. • ρi : the number of IO requests from vithat have been served as part of constraint-satisfying phase between the previous request to s j and the current request
Performance Evaluation • Implemented in VMware ESX server hypervisor • By modifying the SCSI scheduling layer in the I/O stack of VMware ESX server hypervisor. • The host is a Dell Poweredge2950 server • two QlogicHBAs connected to an EMC CLARiiON CX3-40 storage array over FC SAN. • Used two different storage volumes • A 10 disk RAID 0 disk group • A 10 disk RAID 5 disk group
Performance Evaluation • Two kinds of VMs • Linux VMs with a 10GB virtual disk, one VCPU and 512MB memory • Windows server 2003 VMs with a 16GB virtual disk,one VCPU and 1GB memory • Workload generator • Iometer in the Windows server VMs • http://www.iometer.org/ • A self-designed work-load generator in Linux VMs
Performance Evaluation (cont’d) • Limit Enforcement At t=140 the limit for DM is set to 300 IOPS.
Performance Evaluation (cont’d) • Reservations Enforcement • Five VMs with weights in ratio 1:1:2:2:2. • VMs are started at 60 sec intervals 300 IOPS 250 IOPS mClock enforces reservations SFQ only does proportional allocation
Performance Evaluation (cont’d) • Bursty VM Workloads • VM1: 128 IOs every 400ms, all 4KB reads, 80% random. • VM2: 16 KB reads, 20% of them random and the rest sequential with 32 outstanding IOs. • Idle credits do not impact the overall bandwidth allocation over time. • The latency seen by the bursty VM1 decreases as we increase the idle credits.
Performance Evaluation (cont’d) • Filebench Workloads • Emulate the workload of OLTP VMs [25] R. McDougall. Filebench: Application level file system benchmark. http://www.solarisinternals.com/si/tools/filebench/index.php
Performance Evaluation (cont’d) • dmClockEvaluation • Implementation in a distributed storage system that consists of multiple storage servers (nodes). • Each node is implemented using a virtual machine running RHEL Linux with a 10GB OS disk and a 10GB experimental disk.
Conclusion • The mClockprovides per-VM quality of service. The QoS requirements are expressed as • minimum reservation • maximum limit • proportional shares(weight) • The controls provided by mClockwould allow stronger isolation among VMs. • The techniques are quite generic and can be applied to array level scheduling and to other resources such as network bandwidth allocation as well
Comments • Existing VM services only provide resources in terms of CPU, memory, and storage. But I/O throughput may be the largest factor in QoSprovisioning. • In terms of response time or delay time. • It’s a good idea to combine reservation, limit and proportional share in schedule algorithms. • WF2Q-M considered the limit but no reservations. • The problem of reservation, limit and proportional share between VMs in different hosts ??
Comments (cont’d) • Experiments just validate the correctness of mClock. • How about the short term fairness, latency distribution and computation overhead ? • The experiments just use one host machine. • Cannot reflect the condition of throughput variability when there are multiple hosts.