340 likes | 580 Views
Kit Cischke 09/09/08 CS 5090. Disco: Running Commodity Operating Systems on Scalable Multiprocessors. Overview. Background What are we doing here? A Return to Virtual Machine Monitors What does Disco do? Disco: A Return to VMMs How does Disco do it? Experimental Results
E N D
Kit Cischke 09/09/08 CS 5090 Disco: Running Commodity Operating Systems on Scalable Multiprocessors
Overview • Background • What are we doing here? • A Return to Virtual Machine Monitors • What does Disco do? • Disco: A Return to VMMs • How does Disco do it? • Experimental Results • How well does Disco dance?
The Basic Problem • With the explosion of multiprocessor machines , especially of the NUMA variety, the problem of effectively using the machines becomes more immediate. • NUMA = Non-Uniform Memory Access – shows up a lot in clusters. • The authors point out that the problem applies to any major hardware innovation, not just multiprocessors.
Potential Solution • Solution: Rewrite the operating system to address fault-tolerance and scalability. • Flaws: • Rewriting will introduce bugs. • Bugs can disrupt the system or the applications. • Instabilities are usually less-tolerated on these kinds of systems because of their application space. • You may not have access to the OS.
Not So Good • Okay. So that wasn’t so good. What else do we have? • How about Virtual Machine Monitors? • A new twist on an old idea, which may work better now that we have faster processors.
Enter Disco Disco is a system VM that presents a similar fundamental machine to all of the various OS’s that might be running on the machine. These can be commodity OS’s, uniprocessor, multiprocessor or specialty systems.
Disco VMM • Fundamentally, the hardware is a cluster, but Disco introduces some global policies to manage all of the resources, which makes for better usage of the hardware. • We’ll use commodity operating systems and write the VMM. Rather than millions of lines of code, we’ll write a few thousand. • What if the resource needs exceed that of the commodity OS?
Scalability • Very simple changes to the commodity OS (maybe on the driver level or kernel extension) can allow virtual machines to share resources. • E.g., a parallel database could have a cache in shared memory and multiple virtual processors running on virtual machines. • Support for specialized OS’s that need the power of multiple processors but not all of the features offered by a commodity OS.
Further Benefits • Multiple copies of an OS naturally addresses scalability and fault containment. • Need greater scaling? Add a VM. • Only the monitor and the system protocols (NFS, etc.) need to scale. • OS or application crashes? No problem. The rest of the system is isolated. • NUMA memory management issues are addressed. • Multiple versions of different OS’s provide legacy support and convenient upgrade paths.
Not All Sunshine & Roses • VMM Overhead • Additional exception processing, instruction execution and memory to virtualize hardware. • Privileged instructions aren’t directly executed on the hardware, so we need to fake it. I/O requests need to be intercepted and remapped. • Memory overhead is rough too. • Consider having 6 copies of Vista in memory simultaneously. • Resource Management • VMM can’t make intelligent decisions about code streams without info from OS.
One Last Disadvantage • Communication • Sometimes resources simply can’t be shared the way we want. • Most of these can be mitigated though. • For example, most operating systems have good NFS support. So use it. • But… We can make it even better! (Details forthcoming.)
Introducing Disco • VMM designed for the FLASH multiprocessor machine • FLASH is an academic machine designed at Stanford University • Is a collection of nodes containing a processor, memory, and I/O. Use directory cache coherence which makes it look like a CC-NUMA machine. • Has also been ported to a number of other machines.
Disco’s Interface • The virtual CPU of Disco is an abstraction of a MIPS R10000. • Not only emulates but extends (e.g., reduces some kernel operations to simple load/store instructions. • A presented abstraction of physical memory starting at address 0 (zero). • I/O Devices • Disks, network interfaces, interrupts, clocks, etc. • Special interfaces for network and disks.
Disco’s Implementation • Implemented as a multi-threaded shared-memory program. • Careful attention paid to memory placement, cache-aware data structures and processor communication patterns. • Disco is only 13,000 lines of code. • Windows Server 2003 - ~50,000,000 • Red Hat 7.1 - ~ 30,000,000 • Mac OS X 10.4 - ~86,000,000
Disco’s Implementation • The execution of a virtual processor is mapped one-for-one to a real processor. • At each context switch, the state of a processor is made to be that of a VP. • On MIPS, Disco runs in kernel mode and puts the processor in appropriate modes for what’s being run • Supervisor mode for OS, user mode for apps • Simple scheduler allows VP’s to be time-shared across the physical processors.
Disco’s Implementation • Virtual Physical Memory • This discussion goes on for 1.5 pages. To sum up: • The OS makes requests to physical addresses, and Disco translates them to machine addresses. • Disco uses the hardware TLB for this. • Switching a different VP onto a new processor requires a TLB flush, so Disco maintains a 2nd-level TLB to offset the performance hit. • There’s a technical issue with TLBs, Kernel space and the MIPS processor that threw them for a loop.
NUMA Memory Management In an effort to mitigate the non-uniform effects of a NUMA machine, Disco does a bunch of stuff: Allocating as much memory to have “affinity” to a processor as possible. Migrates or replicates pages across virtual machines to reduce long memory accesses.
Virtual I/O Devices • Obviously Disco needs to intercept I/O requests and direct them to the actual device. • Primarily handled by installing drivers for Disco I/O in the guest OS. • DMA provides an interesting challenge, in that the DMA addresses need the same translation as regular accesses. • However, we can do some especially cool things with DMA requests to disk.
Copy-on-Write Disks • All disk DMA requests are caught and analyzed. If the data is already in memory, we don’t have to go to disk for it. • If the request is for a full page, we just update a pointer in the requesting virtual machine. • So what? • Multiple VM’s can share data without being aware of it. Only modifying the data causes a copy to be made. • Awesome for scaling up apps by using multiple copies of an OS. Only really need one copy of the OS kernel, libraries, etc.
My Favorite – Networking • The Copy-on-write disk stuff is great for non-persistent disks. But what about persistent ones? Let’s just use NFS. • But here’s a dumb thing: A VM has a copy of information it wants to send to another VM on the same physical machine. In a naïve approach, we’d let that data be duplicated, taking up extra memory pointlessly. • So, let’s use copy-on-write for our network interface too!
Virtual Network Interface • Disco provides a virtual subnet for VM’s to talk to each other. • This virtual device is Ethernet-like, but with no maximum transfer size. • Transfers are accomplished by updating pointers rather than actually copying data (until absolutely necessary). • The OS sends out the requests as NFS requests. • “Ah,” but you say. “What about the data locality as a VM starts accessing those files and memory?” • Page replication and migration!
About those Commodity OS’s • So what do we really need to do to get these commodity operating systems running on Disco? • Surprisingly a lot and a little. • Minor changes were needed to IRIX’s HAL, amounting to 2 header files and 15 lines of assembly code. This did lead to a full kernel recompile though. • Disco needs device drivers. Let’s just steal them from IRIX! • Don’t trap on every privileged register access. Convert them into normal loads/stores to special address space, linked to the privileged registers.
More Patching • “Hinting” added to HAL to help the VMM not do dumb things (or at least do fewer dumb things). • When the OS goes idle, the MIPS (usually) defaults to a low power mode. Disco just stops scheduling the VM until something interesting happens. • Other minor things were done, but that required patching the kernel.
SPLASHOS • Some high-performance apps might need most or all of the machine. The authors wrote a “thin” operating system to run SPLASH-2 applications. • Mostly proof-of-concept.
Experimental Results • Bad Idea: Target your software for a machine that doesn’t physically exist. • Like, I don’t know, FLASH? • Disco was validated using two alternatives: • SimOS • SGI Origin2000 Board that will form the basis of FLASH
Experimental Design • Use 4 representative workloads for parallel applications: • Software Development (Pmake of a large app) • Hardware Development (Verilog simulator) • Scientific Computing (Raytracing and a sorting algorithm) • Commercial Database (Sybase) • Not only are they representative, but they each have characteristics that are interesting to study • For example, Pmake is multiprogrammed, lots of short-lived processes, OS & I/O intensive.
Simplest Results Graph Overhead of Disco is pretty modest compared to the uniprocessor results. Raytrace is the lowest, at only 3%. Pmake is the highest, at 16%. The main hits come from additional traps and TLB misses (from all the flushing Disco does). Interestingly, less time is spent in the kernel in Raytrace, Engineering and Database. Running a 64-bit system mitigates the impact of TLB misses.
Memory Utilization Key thing here is how 8 VM’s doesn’t require 8x the memory of 1 VM. Interestingly, we have 8 copies of IRIX running in less than 256 MB of physical RAM!
Scalability Page migration and replication were disabled for these runs. All use 8 processors and 256 MB of memory. IRIX has a terrible bottleneck in synchronizing the system’s memory management code It also has a “lazy” evaluation policy in the virtual memory system that drags “normal” RADIX down. Overall though, check out those performance gains!
Page Migration Benefits The 100% UMA results give a lower bound on performance gains from page migration and replication. But in short, the policies work great.
Real Hardware • Experiences on the real SGI hardware pretty much confirms the simulations, at least at the uniprocessor level. • Overheads tend to be in the range of 3-8% on Pmake and the Engineering simulation.
Summing Up • Disco works pretty well. • Memory usage scales well, processor utilization scales well. • Performance overheads are relatively small for most loads. • Lots of engineering challenges, but most seem to have been overcome.
Final Thoughts • Everything in this paper seems, in retrospect, to be totally obvious. However, the combination of all of these factors seems like it would have taken just a ton of work. • Plus, I don’t think I could have done it half as well, to be honest. • Targeting a non-existent machine seems a little silly. • Overall, interesting paper.