460 likes | 669 Views
Disco: Running Commodity Operating Systems on Scalable Multiprocessors. Edouard Bugnion, Scott Devine, Mendel Rosenblum, Stanford University, 1997. Presented by Divya Parekh. Outline. Virtualization Disco description Disco performance Discussion. Virtualization.
E N D
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine, Mendel Rosenblum, Stanford University, 1997 Presented by Divya Parekh
Outline • Virtualization • Disco description • Disco performance • Discussion
Virtualization • “a technique for hiding the physical characteristics of computing resources from the way in which other systems, applications, or end users interact with those resources. This includes making a single physical resource appear to function as multiple logical resources; or it can include making multiple physical resources appear as a single logical resource”
Old idea from the 1960s • IBM VM/370 – A VMM for IBM mainframe • Multiple OS environments on expensive hardware • Desirable when few machine around • Popular research idea in 1960s and 1970s • Entire conferences on virtual machine monitors • Hardware/VMM/OS designed together • Interest died out in the 1980s and 1990s • Hardware got more cheaper • Operating systems got more powerful (e.g. multi-user)
A Return to Virtual Machines • Disco: Stanford research project (SOSP ’97) • Run commodity OSes on scalable multiprocessors • Focus on high-end: NUMA, MIPS, IRIX • Commercial virtual machines for x86 architecture • VMware Workstation (now EMC) (1999-) • Connectix VirtualPC (now Microsoft) • Research virtual machines for x86 architecture • Xen (SOSP ’03) • plex86 • OS-level virtualization • FreeBSD Jails, User-mode-linux, UMLinux
Overview • Virtual Machine • A fully protected and isolated copy of the underlying physical machine’s hardware. (definition by IBM)” • Virtual Machine Monitor • A thin layer of software that's between the hardware and the Operating system, virtualizing and managing all hardware resources. • Also known as “Hypervisor”
Classification of Virtual Machines • Type I • VMM is implemented directly on the physical hardware. • VMM performs the scheduling and allocation of the system’s resources. • IBM VM/370, Disco, VMware’s ESX Server, Xen • Type II • VMMs are built completely on top of a host OS. • The host OS provides resource allocation and standard execution environment to each “guest OS.” • User-mode Linux (UML), UMLinux
Non-Virtualizable Architectures • According to Popek and Goldberg, ” an architecture is virtualizable if the set of sensitive instructions is a subset of the set of privileged instructions.” • x86 • Several instructions can read system state in register CPL 3 without trapping • MIPS • KSEG0 bypasses TLB, reads physical memory directly
Type I contd.. • Hardware Support for Virtualization Figure: The hardware support approach to x86 Virtualization E.g. Intel Vanderpool/VT and AMD-V/SVM
Type I contd.. • Full Virtualization Figure : The binary translation approach to x86 Virtualization E.g. VMware ESX server
Type I contd.. • Paravirtualization Figure: The Paravirtualization approach to x86 Virtualization E.g. Xen
Type II • Hosted VM Architecture E.g. VMware Workstation, Connectix VirtualPC
Disco : VMM Prototype • Goals • Extend modern OS to run efficiently on shared memory multiprocessors without large changes to the OS. • A VMM built to run multiple copies of Silicon Graphics IRIX operating system on a Stanford Flash shared memory multiprocessor.
Problem Description • Multiprocessor in the market (1990s) • Innovative Hardware • Hardware faster than System Software • Customized OS are late, incompatible, and possibly bug • Commodity OS not suited for multiprocessors • Do not scale cause of lock contention, memory architecture • Do not isolate/contain faults • More Processors More failures
Solution to the problems • Resource-intensive Modification of OS (hard and time consuming, increase in size, etc) • Make a Virtual Machine Monitor (software) between OS and Hardware to resolve the problem
Two opposite Way for System Software • Address these challenges in the operating system: OS-Intensive • Hive , Hurricane, Cellular-IRIX, etc • innovative, single system image • But large effort. • Hard-partition machine into independent failure units: OS-light • Sun Enterprise10000 machine • Partial single system image • Cannot dynamically adapt the partitioning
Return to Virtual Machine Monitors • One Compromise Way between OS-intensive & OS-light – VMM • Virtual machine monitors, in combination with commodity and specialized operating systems, form a flexible system software solution for these machines • Disco was introduced to allow trading off between the costs of performance and development cost.
Advantages of this approach • Scalability • Flexibility • Hide NUMA effect • Fault Containment • Compatibility with legacy applications
Challenges Facing Virtual Machines • Overheads • Trap and emulate privileged instructions of guest OS • Access to I/O devices • Replication of memory in each VM • Resource Management • Lack of information to make good policy decisions • Communication and Sharing • Stand alone VM’s cannot communicate
Disco’s Interface • Processors • MIPS R10000 processor • Emulates all instructions, the MMU, trap architecture • Extension to support common processor operations • Enabling/disabling interrupts, accessing privileged registers • Physical memory • Contiguous, starting at address 0 • I/O devices • Virtualize devices like I/O, disks, n/w interface exclusive to VM • Physical devices multiplexed by Disco • Special abstractions for SCSI disks and network interfaces • Virtual disks for VMs • Virtual subnet across all virtual machines
Disco Implementation • Multi threaded shared memory program • Attention to NUMA memory placement, cache aware data structures and IPC patterns • Code segment of DISCO copied to each flash processor – data locality • Communicate using shared memory
Virtual CPUs • Direct Execution • execution of virtual CPU on real CPU • Sets the real machine’s registers to the virtual CPU’s • Jumps to the current PC of the virtual CPU, Direct execution on the real CPU • Challenges • Detection and fast emulation of operations that cannot be safely exported to the virtual machine privileged instructions such as TLB modification and Direct access to physical memory and I/O devices. • Maintains data structure for each virtual CPU for trap emulation • Scheduler multiplexes virtual CPU on real processor
Virtual Physical Memory • Address translation & maintains a physical-to-machine address (40 bit) mapping. • Virtual machines use physical addresses • Software reloaded translation-lookaside buffer (TLB) of the MIPS processor • Maintains pmap data structure for each VM – contains one entry for each physical to virtual mapping • pmap also has a back pointer to its virtual address to help invalidate mappings in the TLB
Contd.. • Kernel mode references on MIPS processors access memory and I/O directly - need to re-link OS code and data to a mapped address space • MIPS tags each TLB entry with Address space identifiers (ASID) • ASIDs are not virtualized - TLB need to be flushed on VM context switches • Increased TLB misses in workloads • Additional Operating system references • VM context switches • TLB misses expensive - create 2nd level software - TLB . Idea similar to cache?
NUMA Memory management • Cache misses should be satisfied from local memory (fast) rather than remote memory (slow) • Dynamic Page Migration and Replication • Pages frequently accessed by one node are migrated • Read-shared pages are replicated among all nodes • Write-shared are not moved, since maintaining consistency requires remote access anyway • Migration and replacement policy is driven by cache-miss-counting facility provided by the FLASH hardware
Transparent Page Replication • Two different virtual processors of the same virtual machine logically • read-share the same physical page, but each virtual processor accesses • a local copy. • 2. memmap tracks which virtual page references each physical page. • Used during TLB shootdown
Virtual I/O Devices • Disco intercepts all device accesses from the virtual machine and forwards them to the physical devices • Special device drivers are added to the guest OS • Disco device provide monitor call interface to pass all the arguments in single trap • Single VM accessing a device does not require virtualizing the I/O – only needs to assure exclusivity
Copy-on-write Disks • Intercept DMA requests to translate the physical addresses into machine addresses. • Maps machine page as read only to destination address page of DMA Sharing machine memory • Attempts to modify a shared page will result in a copy-on-write fault handled internally by the monitor. • Logs are maintained for each VM Modification • Modification made in main memory • Non-persistent disks are copy on write shared • E.g. Kernel text and buffer cache • E.g. File systems root disks
Transparent Sharing of Pages Creates a global buffer cache shared across VM's and reduces memory foot print of the system
Virtual Network Interface • Virtual subnet and network interface use copy on write mapping to share the read only pages • Persistent disks can be accessed using standard system protocol NFS • Provides a global buffer cache that is transparently shared by independent VMs
Transparent sharing of pages over NFS • The monitor’s networking device remaps the data page from the source’s • machine address space to the destination’s. • 2. The monitor remaps the data page from the driver’s mbuf to the clients • buffer cache.
Modifications to the IRIX 5.3 OS • Minor changes to kernel code and data segment – specific to MIPS • Relocate the unmapped segment of the virtual machine into the mapped supervisor segment of the processor– Kernel relocation • Disco drivers are same as original device drivers of IRIX • Patched HAL to use memory loads/stores instead of privileged instructions
Modifications to the IRIX 5.3 OS • Added code to HAL to pass hints to monitor for resource management • New Monitor calls to MMU to request zeroed page, unused memory reclamation • Changed mbuf management to be page-aligned • Changed bcopy to use remap (with copy-on-write)
SPLASHOS: A specialized OS • Thin specialized library OS, supported directly by Disco • No need for virtual memory subsystem since they share address space • Used for the parallel scientific applications that can span the entire machine
Disco: Performance • Experimental Setup • Disco targets the FLASH machine not available that time • Used SimOS, a machine simulator that models the hardware of MIPS-based multiprocessors for the Disco monitor. • Simulator was too slow to allow long work loads to be studied
Disco: Performance • Workloads
Disco: Performance • Execution Overhead Pmake overhead due to I/O virtualization, others due to TLB mapping Reduction of kernel time On average virtualization overhead of 3% to 16%
Disco: Performance • Memory Overheads • V: Pmake memory used if there is no sharing • M: Pmake memory used if there is sharing
Disco: Performance • Scalability Partitioning of problem into different VM’s increases scalability. Kernel synchronization time becomes smaller.
Disco: Performance • Dynamic Page Migration and replication
Conclusion • Disco VMM hides NUMA-ness from non-NUMA aware OS • Disco VMM is low(er) effort • Moderate overhead due to virtualization
Discussion • Was Disco- VMM done rightly? • Virtual Physical Memory on architectures other than MIPS • MIPS TLB is software managed • Not sure of how well other OS perform on Disco since IRIX was designed for MIPS • Not sure how HIVE, Hurricane performs comparatively • Performance of long workloads on the system • Performance of heterogeneous VMs e.g. Pmake case
Discussion Are VMM Microkernels done right?