330 likes | 353 Views
This article discusses the implementation details and benefits of using DISCO architecture to support multi processors, including virtual machine monitors, virtualization of processors, memory, and I/O, and dynamic page migration and replication. It also explores the challenges of scalability and performance in multi processor systems.
E N D
Supporting Multi-Processors Bernard Wong February 17, 2003
Uni-processor systems • Began with Uni-processor systems • Simple to implement uni-processor OS, allows for many assumptions • UMA, efficient locks(small impact on throughput), straight forward cache coherency • Hard to make faster
Small SMP systems • Multiple symmetric processors • Requires some modifications to the OS • Still allows for UMA • System/Memory bus becomes a contended resource • Locks have larger impact on throughput • e.g. A lock on one process can block another process (running on another processor) from making progress • Must introduce finer grain locks to improve scalability • System bus limits system size
Large Shared Memory Multi-processor • Consist of many nodes, each of which may be a uni-processor or an SMP • Access to memory often NUMA, sometimes does not even provide cache coherency • Performance very poor if used with an off the shelf SMP OS • Requirement for good performance: • Locality of service to request • Independence between services
DISCO • Uses Virtual Machine Monitors to run multiple commodity OSes on a scalable multi-processor • Virtual Machine Monitor • Additional layer between OS and hardware • Virtualizes processor, memory, I/O • OS unaware of virtualization (ideally) • Exports a simple general interface to the commodity OS
DISCO Architecture OS SMP-OS OS OS Thin OS DISCO PE PE PE PE PE PE PE Interconnect ccNUMA Multiprocessor
Implementation Details • Virtual CPUs • Uses direct execution on real CPU • Fast, most instructions run at native speeds • Must detect and emulate operations that can not be safely exported to VM • Primary privilege instructions: TLB modification, direct physical memory or I/O operations • Must also keep data-structure to save registers and other state • For when virtual CPU not scheduled to real CPU • Virtual CPUs uses affinity scheduling to maintain cache locality
Implementation Details • Virtual Physical Memory • Adds a level of address translation • Maintains physical-to-machine address mappings • Because VMs use physical addresses that start from 0 and continuing for size of VM’s memory address • Performed via emulating TLB instructions • When OS tries to insert entry into TLB, DISCO intercepts it and insert translated version • TLB flushed on virtual CPU switches • TLB lookup also more expensive due to required trap • Second level software TLB added to improve performance
Implementation Details • Virtual I/O • Intercepts all device accesses from VM through special OS device drivers • Virtualizes both disk and network I/O • DISCO allows persistent disks and non-persistent disks • Persistent disks cannot be shared • Non-persistent disk implemented via copy-on-write
Why use a VMM? • DISCO aware of NUMA-ness • Hides NUMA-ness from commodity OS • Requires less work than engineering a NUMA-aware OS • Performs better than NUMA-unaware OS • Good middle ground • How? • Dynamic page migration and page replication • Maintain locality between virtual CPU’s cache miss and memory pages to which cache miss occur
Memory Management • Pages heavily accessed by only one node are migrated to that node • Change physical to machine address mapping • Invalidates TLB entries that point to old location • Copy page to local machine • Pages that are heavily read-share and replicated to nodes move heavily accessing them • Downgrade TLB entries pointing to page to read-only • Copy pages • Update relevant TLB entries to local machine version and remove read-only
Aren’t VMs memory inefficient? • Traditionally, VMs tend to replicate memory used for each system image • Additionally, structures such as disk cache not shared • DISCO uses notion of global buffer cache to reduce memory footprint
Page sharing • DISCO keeps a data structure that maps disk sectors to memory addresses • If two VMs request for same disk sector, both assigned to same read-only buffer page • Modifications to pages performed via copy-on-write • Only works for non-persistent copy-on-write disks
Page sharing • Sharing effective even via packets when sharing data over NFS
Tornado • OS designed to take advantage of shared memory multi-processors • Object Oriented structure • Every virtual and physical resource represented by an independent object • Ensure natural locality and independence • Resource lock and data structure stored on some node as resource • Resources manage independently and at a fine grain • No global source of contention
OO structure • Example: Page fault • Separate File Cache Manager(FCM) object for different regions of memory • COR -> Cached Object Representative • All objects are specific to either the faulting process or the file(s) backing the process • Problem: Hard to make global policies
Clustered objects • Even with OO, widely shared objects can be expensive due to contention • Need replication, distribution, partition to reduce contention • Clustered Objects systematic way to do this • Gives illusion of a single object, but is actual composed of multiple component (rep) objects • Each component handle a subset of processors • Must handle consistency across reps
Clustered object implementation • Per-processor translation table • Contains pointer for to local rep of each clustered object • Created on demand via a combination of global miss handling object and clustered object specific miss handling object
Memory Allocation • Need an efficient, highly concurrent allocator that maximizes locality • Use local pools of memory • However, for small block allocation, still have problem of false sharing • Additional small pool of strictly local memory used
Synchronization • Use of objects, and additional clustered object reduces scope of lock and limits lock contention to that of a rep • Existence guarantees hard • A thread must determine whether an object is currently being de-allocated by another thread • Often require lock hierarchy where root is a global lock • DISCO uses semi-automatic garbage collector • Thread never worries needs to test for existence, no locking required
Protected Procedure Calls • Since Tornado is a microkernel, IPC traffic is significant • Need a fast IPC mechanism that maintains locality • Protected Procedure Calls (PPC) maintains locality by: • Spawning a new server thread in the same processor as client to service client request • Keeping all client specific data in data-structures stored on the client
Performance • Comparison to other large shared-memory multi-processors
Conclusion • Illustrated two different approach to make efficient use of shared memory multi-processors • DISCO adds extra layer between hardware and OS • Less engineering effort, more overhead • Tornado redesigns an OS to take advantage of locality and independence • More engineering effort, less overhead but local and independent algorithms may work poorly with real world loads