Cellular Disco: resource management using virtual clusters on shared memory multiprocessors

Cellular Disco: resource management using virtual clusters on shared memory multiprocessors Published in ACM 1999 by K.Govil, D. Teodosiu,Y. Huang, M. Rosenblum. Presenter: Soumya Eachempati

Motivation • Large scale shared-Memory Multiprocessors • Large number of CPUs (32-128) • NUMA Architectures • Off-the-shelf OS not scalable • Cannot handle large number of resources • Memory management not optimized for NUMA • No fault containment

Existing Solutions • Hardware partitioning • Provides fault containment • Rigid resource allocation • Low resource utilization • Cannot dynamically adapt to workload • New Operating System • Provides flexibility and efficient resource management. • Considerable effort and time Goal: To exploit hardware resources to the fullest with minimal effort while improving flexibility and fault-tolerance.

Solution: DISCO(VMM) • Virtual Machine monitor • Addresses NUMA awareness issues and scalability Issues not dealt by DISCO: • Hardware fault tolerance/containment • Resource management policies

Cellular DISCO • Approach: Convert Multiprocessor machine into a Virtual Cluster • Advantages: • Inherits the benefits of DISCO • Can support legacy OS transparently • Combines the goodness of H/W Partitioning and new OS. • Provides fault containment • Fine grained resource sharing • Less effort than developing an OS

Cellular DISCO • Internally structured into semi-independent cells. • Much less development effort compared to HIVE • No performance loss - with fault containment. WARRANTED DESIGN DECISION:Code of Cellular DISCO is correct.

Cellular Disco Architecture

Resource Management • Over-commits resources • Gives flexibility to adjust fraction of resources assigned to VM. • Restrictions on resource allocation due to fault containment. • Both CPU and memory load balancing under constraints. • Scalability • Fault containment • Avoid contention • First touch allocation, dynamic migration, replication of hot memory pages

Hardware Virtualization • VM’s interface mimics the underlying H/W. • Virtual Machine Resources (User-defined) • VCPUs, memory, I/O devices(physical) • Physical vs. machine resources(allocated dynamically - priority of VM) • VCPUs - CPUs • Physical - machine pages • VMM intercepts privileged instructions • 3 modes - user & supervisor(guest OS), kernel(VMM). • Supervisor mode all memory accesses are mapped. • Allocates machine memory to back the physical memory. • Pmap and memmap data structure. • Second level software TLB(L2TLB).

Hardware fault containment

Hardware fault containment • VMM - software fault containment. • Cell • Inter-cell communication • Inter-processor RPC • Messages - no need for locking since serialized. • Shared memory for some data structures(pmap, memmap). • Low latency, exactly once semantics • Trusted system software layer - enables us to use shared memory.

Implementation 1: MIPS R10000 • 32-processor SGI Origin 2000 • Piggybacked on IRIX 6.4(Host OS) • Guest OS - IRIX 6.2 • Spawns Cellular DISCO(CD) as a multi-threaded kernel process. • Additional overhead < 2%(time spent in host IRIX) • No fault isolation: IRIX kernel is monolithic • Solution: Some host OS support needed-one copy of host OS per cell.

I/O Request execution • Cellular Disco piggybacked on IRIX kernel

32 - MIPS R10000

Characteristics of workloads • Database - decision support workload • Pmake - IO intensive workload • Raytrace - CPU intensive • Web - kernel intensive web-server workload.

Virtualization Overheads

Fault-containment Overheads Left bar - single cell config Right bar - 8 cell system.

CPU Management • Load Balancing mechanisms: • Three types of VCPU migrations - Intra-node, Inter-node, Inter-cell. • Intra node - loss of CPU cache affinity • Inter node - cost of copying L2TLB, higher long term cost. • Inter cell - loss of both cache and node affinity, increases fault vulnerability. • Alleviates penalty by replicating pages. • Load balancing policies - idle (local load stealer) and periodic (global redistribution) balancers. • Each CPU has local run queue of VCPUs. • Gang-scheduling • Run all VCPUs of a VM simultaneously.

Load Balancing • Low contention distributed data structure - load tree. • Contention on higher level nodes • List of cells vulnerable to - VCPU. • Heavy loaded - idle balancer not enough • Local periodic balancer for 8 CPU region.

CPU Scheduling and Results • Scheduling - highest-priority gang runnable VCPU that has been waiting. Sends out RPC. • 3 configs: 32- processors. • One VM - 8 VCPUs--8 process raytrace. • 4 VMs • 8 VMs (total of 64 VCPUs). • Pmap migrated only when all VCPUs are migrated out of a cell. • Data pages also migrated for independence

Memory Management • Each cell has its own freelist of pages indexed by the home node. • Page allocation request • Satisfied from local node • Else satisfied from same cell • Else borrowed from another cell • Memory balancing • Low memory threshold for borrowing and lending • Each VM has priority list of lender cells

Memory Paging • Page Replacement • Second-chance FIFO • Avoids double paging overheads. • Tracking used pages • Use annotated OS routines • Page Sharing • Explicit marking of shared pages • Redundant Paging • Avoids by trapping every access to virtual paging disk

Implementation 2: FLASH Simulation • FLASH has hardware fault recovery support • Simulation of FLASH architecture on SimOS • Use Fault injector • Power failure • Link failure • Firmware failure (?) • Results: 100% fault containment

Fault Recovery • Hardware support needed • Determine what resources are operational • Reconfigure the machine to use good resources • Cellular Disco recovery • Step 1: All cells agree on a liveset of nodes • Step 2: Abort RPCs/messages to dead cells • Step 3: Kill VMs dependent on failed cells

Fault-recovery Times • Recovery times higher for larger memory • Requires memory scanning for fault detections

Summary • Virtual Machine Monitor • Flexible Resource Management • Legacy OS support • Cellular Disco • Cells provide fault-containment • Create Virtual Cluster • Need hardware support

Cellular Disco: resource management using virtual clusters on shared memory multiprocessors