1 / 33

Supporting Multi-Processors

This article discusses the implementation details and benefits of using DISCO architecture to support multi processors, including virtual machine monitors, virtualization of processors, memory, and I/O, and dynamic page migration and replication. It also explores the challenges of scalability and performance in multi processor systems.

minerva
Download Presentation

Supporting Multi-Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Supporting Multi-Processors Bernard Wong February 17, 2003

  2. Uni-processor systems • Began with Uni-processor systems • Simple to implement uni-processor OS, allows for many assumptions • UMA, efficient locks(small impact on throughput), straight forward cache coherency • Hard to make faster

  3. Small SMP systems • Multiple symmetric processors • Requires some modifications to the OS • Still allows for UMA • System/Memory bus becomes a contended resource • Locks have larger impact on throughput • e.g. A lock on one process can block another process (running on another processor) from making progress • Must introduce finer grain locks to improve scalability • System bus limits system size

  4. Large Shared Memory Multi-processor • Consist of many nodes, each of which may be a uni-processor or an SMP • Access to memory often NUMA, sometimes does not even provide cache coherency • Performance very poor if used with an off the shelf SMP OS • Requirement for good performance: • Locality of service to request • Independence between services

  5. DISCO • Uses Virtual Machine Monitors to run multiple commodity OSes on a scalable multi-processor • Virtual Machine Monitor • Additional layer between OS and hardware • Virtualizes processor, memory, I/O • OS unaware of virtualization (ideally) • Exports a simple general interface to the commodity OS

  6. DISCO Architecture OS SMP-OS OS OS Thin OS DISCO PE PE PE PE PE PE PE Interconnect ccNUMA Multiprocessor

  7. Implementation Details • Virtual CPUs • Uses direct execution on real CPU • Fast, most instructions run at native speeds • Must detect and emulate operations that can not be safely exported to VM • Primary privilege instructions: TLB modification, direct physical memory or I/O operations • Must also keep data-structure to save registers and other state • For when virtual CPU not scheduled to real CPU • Virtual CPUs uses affinity scheduling to maintain cache locality

  8. Implementation Details • Virtual Physical Memory • Adds a level of address translation • Maintains physical-to-machine address mappings • Because VMs use physical addresses that start from 0 and continuing for size of VM’s memory address • Performed via emulating TLB instructions • When OS tries to insert entry into TLB, DISCO intercepts it and insert translated version • TLB flushed on virtual CPU switches • TLB lookup also more expensive due to required trap • Second level software TLB added to improve performance

  9. Implementation Details • Virtual I/O • Intercepts all device accesses from VM through special OS device drivers • Virtualizes both disk and network I/O • DISCO allows persistent disks and non-persistent disks • Persistent disks cannot be shared • Non-persistent disk implemented via copy-on-write

  10. Why use a VMM? • DISCO aware of NUMA-ness • Hides NUMA-ness from commodity OS • Requires less work than engineering a NUMA-aware OS • Performs better than NUMA-unaware OS • Good middle ground • How? • Dynamic page migration and page replication • Maintain locality between virtual CPU’s cache miss and memory pages to which cache miss occur

  11. Memory Management • Pages heavily accessed by only one node are migrated to that node • Change physical to machine address mapping • Invalidates TLB entries that point to old location • Copy page to local machine • Pages that are heavily read-share and replicated to nodes move heavily accessing them • Downgrade TLB entries pointing to page to read-only • Copy pages • Update relevant TLB entries to local machine version and remove read-only

  12. Page Replication

  13. Aren’t VMs memory inefficient? • Traditionally, VMs tend to replicate memory used for each system image • Additionally, structures such as disk cache not shared • DISCO uses notion of global buffer cache to reduce memory footprint

  14. Page sharing • DISCO keeps a data structure that maps disk sectors to memory addresses • If two VMs request for same disk sector, both assigned to same read-only buffer page • Modifications to pages performed via copy-on-write • Only works for non-persistent copy-on-write disks

  15. Page sharing

  16. Page sharing • Sharing effective even via packets when sharing data over NFS

  17. Virtualization overhead

  18. Data sharing

  19. Workload scalability

  20. Performance Benefits of Page Migration/Replication

  21. Tornado • OS designed to take advantage of shared memory multi-processors • Object Oriented structure • Every virtual and physical resource represented by an independent object • Ensure natural locality and independence • Resource lock and data structure stored on some node as resource • Resources manage independently and at a fine grain • No global source of contention

  22. OO structure • Example: Page fault • Separate File Cache Manager(FCM) object for different regions of memory • COR -> Cached Object Representative • All objects are specific to either the faulting process or the file(s) backing the process • Problem: Hard to make global policies

  23. Clustered objects • Even with OO, widely shared objects can be expensive due to contention • Need replication, distribution, partition to reduce contention • Clustered Objects systematic way to do this • Gives illusion of a single object, but is actual composed of multiple component (rep) objects • Each component handle a subset of processors • Must handle consistency across reps

  24. Clustered objects

  25. Clustered object implementation • Per-processor translation table • Contains pointer for to local rep of each clustered object • Created on demand via a combination of global miss handling object and clustered object specific miss handling object

  26. Memory Allocation • Need an efficient, highly concurrent allocator that maximizes locality • Use local pools of memory • However, for small block allocation, still have problem of false sharing • Additional small pool of strictly local memory used

  27. Synchronization • Use of objects, and additional clustered object reduces scope of lock and limits lock contention to that of a rep • Existence guarantees hard • A thread must determine whether an object is currently being de-allocated by another thread • Often require lock hierarchy where root is a global lock • DISCO uses semi-automatic garbage collector • Thread never worries needs to test for existence, no locking required

  28. Protected Procedure Calls • Since Tornado is a microkernel, IPC traffic is significant • Need a fast IPC mechanism that maintains locality • Protected Procedure Calls (PPC) maintains locality by: • Spawning a new server thread in the same processor as client to service client request • Keeping all client specific data in data-structures stored on the client

  29. Protected Procedure Calls

  30. Performance • Comparison to other large shared-memory multi-processors

  31. Performance (n threads in 1 process)

  32. Performance (n threads in n process)

  33. Conclusion • Illustrated two different approach to make efficient use of shared memory multi-processors • DISCO adds extra layer between hardware and OS • Less engineering effort, more overhead • Tornado redesigns an OS to take advantage of locality and independence • More engineering effort, less overhead but local and independent algorithms may work poorly with real world loads

More Related