Disco: Scalable Multiprocessor Operating System Challenges and Implementation

Disco Running Commodity Operating Systems on Scalable Multiprocessors

FLASH • cache coherent non-uniform multiprocessor • developed at Stanford • not available at the time paper was written

Problems with other approaches • many other experimental systems require large changes to uniprocessor OSes • development lags behind delivery of hardware • high costs mean system will likely introduce instabilities, which may break legacy applications • often hardware vendors are not the software vendors (think Intel/Microsoft)

Virtual Machine Monitorsnanokernel • software layer between hardware and heavyweight OS • by running multiple copies of traditional OSes, scalability issues are confined to the much smaller VM monitor • VM are natural software fault boundaries and again the size of the monitor makes hardware fault tolerance easier to implement

Virtual Machine Monitorsnanokernel • monitor handles all the NUMA related issues so that UMA OSes do not need to be made aware of non-uniformity • multiple OSes allow legacy applications to continue to run while newer versions are phased in. Could allow more experimentation with new technologies.

Disco architecture

Virtual Machine Challenges • overhead • runtime • privileged instructions must be emulated inside monitor • I/O requests must be intercepted and de/re -virtualized by monitor • memory • code/data must be replicated for the different copies of OS • each OS may have its own file system buffer, for instance

Virtual Machine Challenges • resource management • monitor will not have all the information that the OS does, so it may make poor decision. Think of spin-locking. • sound familiar? This is the same as the argument against kernel-level threads.

Virtual Machine Challenges • communication and sharing • if OSes are separated by virtual machine boundaries how do they share resources and how does information cross those VM boundaries. VMs aren’t aware they are actually on the same machine. • sound familiar? This issue motivated LRPC and URPC, specializing communication protocols for the situation where server and client reside on the same physical machine

Disco implementation • Disco emulates the MMU and the trap architecture, allowing unmodified applications and OSes to run on the VM • frequently used kernel operations can be optimized. For instance interrupt disabling is done by the OSes by load and storing to special addresses

Disco implementation • all I/O devices are virtualized, including network connections and disks, and all access to them must pass through Disco to be translated or emulated.

Disco implementation • at only 13,000 lines there is a higher ability to hand tune code. • small image size, only 72KB, also means that copies of Disco can reside in every local node, so Disco text never has to be fetched at lower non-uniform rate • machine-wide data structures are partitioned so parts that are currently being used by processor can reside in local memory

Disco implementation • scheduling VMs is similar to traditional kernels scheduling processes, eg quanta size considerations, saving state in data structures, processor affinity, etc

Disco implementationVirtual Physical Memory • Disco maintains a physical-to-machine address mapping. • machine addresses are FLASH’s 40 bit addresses

Disco implementationVirtual Physical Memory • when a heavy weight OS tries to update the TLB, Disco steps in and applies the physical-to-machine translation. Subsequent memory accesses then can go straight thru the TLB • each VM has an associated pmap in the monitor • pmap also has a back pointer to its virtual address to help invalidate mappings in the TLB

Disco implementationVirtual Physical Memory • MIPS has a tagged TLB, called address space identifier (ASID). • ASIDs are not virtualized, so TLB must be flushed on uberweight VM context switches • 2nd level software TLB?

Disco implementationHidingNUMA • cache misses are served faster from local memory rather than remote memory • read and read-shared pages are migrated to all nodes that frequently access them • write-shared are not, since maintaining consistency requires remote access anyway • migratation and replacement policy is driven by cache miss counting

Disco implementationHidingNUMA

Disco implementationHidingNUMA • memmap tracks which virtual page references each physical page. Used during TLB shootdown.

Disco implementationVirtualizing I/O • all device access is intercepted by the monitor • disk reads can be serviced by monitor and if request size is a multiple of the machines’s page size, monitor only has to remap machine pages into the VM physical memory address space. • pages are read-only and will generate a copy-on-write fault if written to

Virtualizing I/O

Virtual Networks

IRIX • small changes were required to IRIX kernel, but were due to a MIPS pecularity • new device drivers were not needed • hardware abstraction layer is where the trap, the zeroed page, unused page, and VM de-scheduling optimizations were implemented

SPLASHOS • thin OS, supported by disco • used for parallel scientific applications

Experimental Results • since FLASH was not available experiments were run on SimOS, a machine simulator • simulator was too slow, compared to actual machine, to allow long work loads to be studied

Workloads

Single VM • ran the four worloads in plain IRIX inside simulator and with a single VM running IRIX, 3% - 16% slowdown

Memory Overhead • ran pmake with 8 physical processors with six different configurations, plain IRIX, 1VM, 2VMs, 4VMs, 8VMs, and 8VMs communicating with NFS • demonstrates • 8VMs required less than twice the physical memory as plain IRIX • physical to machine mapping is a useful optimization

Scalability tests • compared the performance of pmake under the previously described configurations • summary: while 1VM showed a significant slowdown(36%), using 8VMs showed a significant speedup(40%) • also ran radix sorting algorithm on plain IRIX and on SPLASHOS/Disco. Reduced run time by 2/3

engineering ran on 8 processors, raytrace on 16 UMA machine is theoretical lower bound Page Migration and Replication

Conclusion • nanokernel is several orders of magnitude smaller than heavyweight OS, yet can run virtually unmodified OSes in virtual machine monitors • problems of execution overhead and memory footprint were addressed

Disco: Scalable Multiprocessor Operating System Challenges and Implementation

Disco: Scalable Multiprocessor Operating System Challenges and Implementation

Presentation Transcript

DISCO UDDI

Disco

DISCO

Disco Inferno

DISCO System

DISCO DUDES

HALLOWEEN DISCO

Disco

DISCO

Disco Biscuits

Disco

Disco

DISCO