Design and Implementation of a Generic Resource-Sharing Virtual-Time Dispatcher

Tal Ben-Nun Scl. Eng & CS Hebrew University Design and Implementation ofa Generic Resource-SharingVirtual-Time Dispatcher Yoav Etsion CS Dept Barcelona SC Ctr Dror Feitelson Scl. Eng & CS Hebrew University Supported by the Israel Science Foundation, grant no. 28/09

Design and Implementationofa Generic Resource-SharingVirtual-Time Dispatcher • Goal is to control share of resources, not to optimize performance – important in virtualization

Design and Implementationofa Generic Resource-SharingVirtual-Time Dispatcher • Goal is to control share of resources, not to optimize performance – important in virtualization Same module used for diverse resources

Design and Implementationofa Generic Resource-SharingVirtual-Time Dispatcher • Goal is to control share of resources, not to optimize performance – important in virtualization Same module used for diverse resources Mechanism used: dispatch the most deserving client at each instant

Design and Implementationofa Generic Resource-SharingVirtual-TimeDispatcher • Goal is to control share of resources, not to optimize performance – important in virtualization Same module used for diverse resources Mechanism used: dispatch the most deserving client at each instant Selection of deserving client using virtual time formalism

Design and Implementationofa Generic Resource-SharingVirtual-TimeDispatcher • Goal is to control share of resources, not to optimize performance – important in virtualization Same module used for diverse resources Mechanism used: dispatch the most deserving client at each instant Selection of deserving client using virtual time formalism Implemented and measured in Linux

Motivation • Context: VMM for server consolidation • Multiple legacy servers share physical platform • Improved utilization and easier maintenance • Flexibility in allocating resources to virtual machines • Virtual machines typically run a single application (“appliances”)

Motivation • Assumed goal: enforce predefined allocation of resources to different virtual machines (“fair share” scheduling) • Based on importance / SLA • Can change with time or due to external events Problem: what is “30% of the resources” when there are many different resources, and diverse requirements?

Global Scheduling • “Fair share” usually applied to a single resource • But what if this resource is not a bottleneck? • Global scheduling idea: • Identify the system bottleneck resource • Apply fair share scheduling on this resource • This induces appropriate allocations on other resources • This paper: how to apply fair-share scheduling on any resource in the system

Previous Work I: Virtual Time • Accounting is inversely proportional to allocation • Schedule the client that is farthest behind

Previous Work II: Traffic Shaping • Leaky bucket • Variable requests • Constant rate transmission • Bucket represent buffer • Token bucket • Variable requests • Constant allocations • Bucket represents stored capacity

Putting them Together: RSVT • “Resource sharing”: all clients make progress continuously • Generalization of processor sharing • Each job has its ideal resource sharing progress • This is considered to be the allocation ai • Grows at constant rate • Each job has its actual consumption ci • Grows only when job runs • Scheduling priority is the difference: • pi = ai – ci

Example • Three clients • Allocations roughly 50%, 30%, 20% • Consumption always occur in resource time Consumed resource time Wallclock time

Bookkeeping • The set of active jobs is A • The relative allocation of job i is ri • During an interval T job k has run • Update allocations: • Update consumptions:

The Active Set • Active jobs (the set A) are those that can use the resource now • Allocations are relative to the active set • The active set may change • New job arrives • Job terminates • Job stops using resource temporarily • Job resumes use of resource

Grace Period • Intermittent activity: process data / send packet • should retain allocations even when inactive • Thus ai continues to grow during grace period after it becomes inactive • Grace period reflects notion of continuity • Sub-second time scale

Rebirth • Resumption after very long inactive periods should be treated as new arrivals • Due to grace period, job that becomes inactive accrues extra allocation • Forget this extra allocation after rebirth period • (set ai = ci) • Two order of magnitude larger than grace period

Implementation • Kernel module with generic functionality • Create / destroy module • Create / destroy client • Make request / set active / set inactive • Make allocations • Dispatch • Check-in (note resource usage) • Glue code for specific subsystems • Currently networking and CPU • Plan to add disk I/O

Networking Glue Code • Use the Linux QoS framework: create RSVT queueing discipline App TCP IP QoS queueing discipline NIC

Networking Glue Code App • Non-RSVT traffic has priority (e.g. NFS traffic) and is counted as dead time TCP IP no RSVT? yes enqueue send immediately select and send NIC

CPU Scheduling Glue Code • Use Linux modular scheduling core • Add an RSVT scheduling policy • RSVT module essentially replaces the policy runqueue • Initial implementation only for uniprocessors • CFS and possibly other policies also exist and have higher priority • When they run, this is considered dead time

Timer Interrupts • Linux employs timer interrupts (250 Hz) • Allocations are done at these times • Translate time into microseconds • Subtract known dead time (unavailable to us) • Divide among active clients according to relative allocations • Bound divergence of allocation from consumption • Also handling of grace period (mark as inactive) • Also handling of rebirth (set ai = ci)

Multi-Queue • At dispatch, need to find client with highest priority • But priorities change at different rates • Solution: allow only a limited discrete set of relative priorities • Each priority has a separate queue • Maintain all clients in each queue in priority order • Only need to check the first in each queue to find the maximum

Experiment – Basic Allocations

Experiment – Active Set

Experiment – Grace Period

Experiment – Rebirth

Experiment – Throttling • Two competing MPlayers • The one with higher allocation does not need all of it • Allocation tracks consumption

Conclusions • Demonstrated generic virtual-time based resource sharing dispatcher • Need to complete implementation • Support for I/O scheduling • More details, e.g. SMP support • Building block of global scheduling vision

Design and Implementation of a Generic Resource-Sharing Virtual-Time Dispatcher

Design and Implementation of a Generic Resource-Sharing Virtual-Time Dispatcher

Presentation Transcript

Design and Implementation of the Joeq Virtual Machine

Time Sharing

Resource Sharing @ USL:

Resource Sharing Rocks!

Virtual Machine Sharing  Virtual Cluster Sharing

Notify “dispatcher”

Resource Sharing:

Generic SharePoint Implementation

Generic SharePoint Implementation

The Grid Enabling Resource Sharing within Virtual Organizations

Design, implementation and evaluation of a 'generic' ePortfolio: the Newcastle experience.

Sharing our Resource Sharing

Virtual Machine Sharing  Virtual Cluster Sharing

Revolutionizing Resource Sharing

Design and Implementation of a Generic Resource-Sharing Virtual-Time Dispatcher

Resource Sharing Over a Network

Resource Sharing Over a Network

UPnP AV Architecture - Generic Interface Design And Java Implementation

Implementation of Real Time Taxi Ride Sharing

Design and Implementation of

Design and implementation of a Distributed Virtual Machine for networked computers