300 likes | 318 Views
Control resource shares to create fair scheduling in virtualization for server consolidation, with a focus on generic mechanisms and virtual time formalism. Developed in Linux.
E N D
Tal Ben-Nun Scl. Eng & CS Hebrew University Design and Implementation ofa Generic Resource-SharingVirtual-Time Dispatcher Yoav Etsion CS Dept Barcelona SC Ctr Dror Feitelson Scl. Eng & CS Hebrew University Supported by the Israel Science Foundation, grant no. 28/09
Design and Implementationofa Generic Resource-SharingVirtual-Time Dispatcher • Goal is to control share of resources, not to optimize performance – important in virtualization
Design and Implementationofa Generic Resource-SharingVirtual-Time Dispatcher • Goal is to control share of resources, not to optimize performance – important in virtualization Same module used for diverse resources
Design and Implementationofa Generic Resource-SharingVirtual-Time Dispatcher • Goal is to control share of resources, not to optimize performance – important in virtualization Same module used for diverse resources Mechanism used: dispatch the most deserving client at each instant
Design and Implementationofa Generic Resource-SharingVirtual-TimeDispatcher • Goal is to control share of resources, not to optimize performance – important in virtualization Same module used for diverse resources Mechanism used: dispatch the most deserving client at each instant Selection of deserving client using virtual time formalism
Design and Implementationofa Generic Resource-SharingVirtual-TimeDispatcher • Goal is to control share of resources, not to optimize performance – important in virtualization Same module used for diverse resources Mechanism used: dispatch the most deserving client at each instant Selection of deserving client using virtual time formalism Implemented and measured in Linux
Motivation • Context: VMM for server consolidation • Multiple legacy servers share physical platform • Improved utilization and easier maintenance • Flexibility in allocating resources to virtual machines • Virtual machines typically run a single application (“appliances”)
Motivation • Assumed goal: enforce predefined allocation of resources to different virtual machines (“fair share” scheduling) • Based on importance / SLA • Can change with time or due to external events Problem: what is “30% of the resources” when there are many different resources, and diverse requirements?
Global Scheduling • “Fair share” usually applied to a single resource • But what if this resource is not a bottleneck? • Global scheduling idea: • Identify the system bottleneck resource • Apply fair share scheduling on this resource • This induces appropriate allocations on other resources • This paper: how to apply fair-share scheduling on any resource in the system
Previous Work I: Virtual Time • Accounting is inversely proportional to allocation • Schedule the client that is farthest behind
Previous Work II: Traffic Shaping • Leaky bucket • Variable requests • Constant rate transmission • Bucket represent buffer • Token bucket • Variable requests • Constant allocations • Bucket represents stored capacity
Putting them Together: RSVT • “Resource sharing”: all clients make progress continuously • Generalization of processor sharing • Each job has its ideal resource sharing progress • This is considered to be the allocation ai • Grows at constant rate • Each job has its actual consumption ci • Grows only when job runs • Scheduling priority is the difference: • pi = ai – ci
Example • Three clients • Allocations roughly 50%, 30%, 20% • Consumption always occur in resource time Consumed resource time Wallclock time
Bookkeeping • The set of active jobs is A • The relative allocation of job i is ri • During an interval T job k has run • Update allocations: • Update consumptions:
The Active Set • Active jobs (the set A) are those that can use the resource now • Allocations are relative to the active set • The active set may change • New job arrives • Job terminates • Job stops using resource temporarily • Job resumes use of resource
Grace Period • Intermittent activity: process data / send packet • should retain allocations even when inactive • Thus ai continues to grow during grace period after it becomes inactive • Grace period reflects notion of continuity • Sub-second time scale
Rebirth • Resumption after very long inactive periods should be treated as new arrivals • Due to grace period, job that becomes inactive accrues extra allocation • Forget this extra allocation after rebirth period • (set ai = ci) • Two order of magnitude larger than grace period
Implementation • Kernel module with generic functionality • Create / destroy module • Create / destroy client • Make request / set active / set inactive • Make allocations • Dispatch • Check-in (note resource usage) • Glue code for specific subsystems • Currently networking and CPU • Plan to add disk I/O
Networking Glue Code • Use the Linux QoS framework: create RSVT queueing discipline App TCP IP QoS queueing discipline NIC
Networking Glue Code App • Non-RSVT traffic has priority (e.g. NFS traffic) and is counted as dead time TCP IP no RSVT? yes enqueue send immediately select and send NIC
CPU Scheduling Glue Code • Use Linux modular scheduling core • Add an RSVT scheduling policy • RSVT module essentially replaces the policy runqueue • Initial implementation only for uniprocessors • CFS and possibly other policies also exist and have higher priority • When they run, this is considered dead time
Timer Interrupts • Linux employs timer interrupts (250 Hz) • Allocations are done at these times • Translate time into microseconds • Subtract known dead time (unavailable to us) • Divide among active clients according to relative allocations • Bound divergence of allocation from consumption • Also handling of grace period (mark as inactive) • Also handling of rebirth (set ai = ci)
Multi-Queue • At dispatch, need to find client with highest priority • But priorities change at different rates • Solution: allow only a limited discrete set of relative priorities • Each priority has a separate queue • Maintain all clients in each queue in priority order • Only need to check the first in each queue to find the maximum
Experiment – Throttling • Two competing MPlayers • The one with higher allocation does not need all of it • Allocation tracks consumption
Conclusions • Demonstrated generic virtual-time based resource sharing dispatcher • Need to complete implementation • Support for I/O scheduling • More details, e.g. SMP support • Building block of global scheduling vision