Supporting Multi-Processors

Supporting Multi-Processors Bernard Wong February 17, 2003

Uni-processor systems • Began with Uni-processor systems • Simple to implement uni-processor OS, allows for many assumptions • UMA, efficient locks(small impact on throughput), straight forward cache coherency • Hard to make faster

Small SMP systems • Multiple symmetric processors • Requires some modifications to the OS • Still allows for UMA • System/Memory bus becomes a contended resource • Locks have larger impact on throughput • e.g. A lock on one process can block another process (running on another processor) from making progress • Must introduce finer grain locks to improve scalability • System bus limits system size

Large Shared Memory Multi-processor • Consist of many nodes, each of which may be a uni-processor or an SMP • Access to memory often NUMA, sometimes does not even provide cache coherency • Performance very poor if used with an off the shelf SMP OS • Requirement for good performance: • Locality of service to request • Independence between services

DISCO • Uses Virtual Machine Monitors to run multiple commodity OSes on a scalable multi-processor • Virtual Machine Monitor • Additional layer between OS and hardware • Virtualizes processor, memory, I/O • OS unaware of virtualization (ideally) • Exports a simple general interface to the commodity OS

DISCO Architecture OS SMP-OS OS OS Thin OS DISCO PE PE PE PE PE PE PE Interconnect ccNUMA Multiprocessor

Implementation Details • Virtual CPUs • Uses direct execution on real CPU • Fast, most instructions run at native speeds • Must detect and emulate operations that can not be safely exported to VM • Primary privilege instructions: TLB modification, direct physical memory or I/O operations • Must also keep data-structure to save registers and other state • For when virtual CPU not scheduled to real CPU • Virtual CPUs uses affinity scheduling to maintain cache locality

Implementation Details • Virtual Physical Memory • Adds a level of address translation • Maintains physical-to-machine address mappings • Because VMs use physical addresses that start from 0 and continuing for size of VM’s memory address • Performed via emulating TLB instructions • When OS tries to insert entry into TLB, DISCO intercepts it and insert translated version • TLB flushed on virtual CPU switches • TLB lookup also more expensive due to required trap • Second level software TLB added to improve performance

Implementation Details • Virtual I/O • Intercepts all device accesses from VM through special OS device drivers • Virtualizes both disk and network I/O • DISCO allows persistent disks and non-persistent disks • Persistent disks cannot be shared • Non-persistent disk implemented via copy-on-write

Why use a VMM? • DISCO aware of NUMA-ness • Hides NUMA-ness from commodity OS • Requires less work than engineering a NUMA-aware OS • Performs better than NUMA-unaware OS • Good middle ground • How? • Dynamic page migration and page replication • Maintain locality between virtual CPU’s cache miss and memory pages to which cache miss occur

Memory Management • Pages heavily accessed by only one node are migrated to that node • Change physical to machine address mapping • Invalidates TLB entries that point to old location • Copy page to local machine • Pages that are heavily read-share and replicated to nodes move heavily accessing them • Downgrade TLB entries pointing to page to read-only • Copy pages • Update relevant TLB entries to local machine version and remove read-only

Page Replication

Aren’t VMs memory inefficient? • Traditionally, VMs tend to replicate memory used for each system image • Additionally, structures such as disk cache not shared • DISCO uses notion of global buffer cache to reduce memory footprint

Page sharing • DISCO keeps a data structure that maps disk sectors to memory addresses • If two VMs request for same disk sector, both assigned to same read-only buffer page • Modifications to pages performed via copy-on-write • Only works for non-persistent copy-on-write disks

Page sharing

Page sharing • Sharing effective even via packets when sharing data over NFS

Virtualization overhead

Data sharing

Workload scalability

Performance Benefits of Page Migration/Replication

Tornado • OS designed to take advantage of shared memory multi-processors • Object Oriented structure • Every virtual and physical resource represented by an independent object • Ensure natural locality and independence • Resource lock and data structure stored on some node as resource • Resources manage independently and at a fine grain • No global source of contention

OO structure • Example: Page fault • Separate File Cache Manager(FCM) object for different regions of memory • COR -> Cached Object Representative • All objects are specific to either the faulting process or the file(s) backing the process • Problem: Hard to make global policies

Clustered objects • Even with OO, widely shared objects can be expensive due to contention • Need replication, distribution, partition to reduce contention • Clustered Objects systematic way to do this • Gives illusion of a single object, but is actual composed of multiple component (rep) objects • Each component handle a subset of processors • Must handle consistency across reps

Clustered objects

Clustered object implementation • Per-processor translation table • Contains pointer for to local rep of each clustered object • Created on demand via a combination of global miss handling object and clustered object specific miss handling object

Memory Allocation • Need an efficient, highly concurrent allocator that maximizes locality • Use local pools of memory • However, for small block allocation, still have problem of false sharing • Additional small pool of strictly local memory used

Synchronization • Use of objects, and additional clustered object reduces scope of lock and limits lock contention to that of a rep • Existence guarantees hard • A thread must determine whether an object is currently being de-allocated by another thread • Often require lock hierarchy where root is a global lock • DISCO uses semi-automatic garbage collector • Thread never worries needs to test for existence, no locking required

Protected Procedure Calls • Since Tornado is a microkernel, IPC traffic is significant • Need a fast IPC mechanism that maintains locality • Protected Procedure Calls (PPC) maintains locality by: • Spawning a new server thread in the same processor as client to service client request • Keeping all client specific data in data-structures stored on the client

Protected Procedure Calls

Performance • Comparison to other large shared-memory multi-processors

Performance (n threads in 1 process)

Performance (n threads in n process)

Conclusion • Illustrated two different approach to make efficient use of shared memory multi-processors • DISCO adds extra layer between hardware and OS • Less engineering effort, more overhead • Tornado redesigns an OS to take advantage of locality and independence • More engineering effort, less overhead but local and independent algorithms may work poorly with real world loads

Supporting Multi-Processors

Supporting Multi-Processors

Presentation Transcript

Single-Chip Multi-Processors (CMP)

Framework For Supporting Multi-Service Edge Packet Processing On Network Processors

Circuit Placement w/ Multi-core Processors

Multi-core Processors and Virtualization

Heterogeneous Multi-Core Processors

Multi-core processors

On Power and Multi-Processors

Lecture 25: Multi-core Processors

On Power and Multi-Processors

Task Partitioning for Multi-Core Network Processors

Fast Multi-Threading on Shared Memory Multi-Processors

Network Processors A generation of multi-core processors

Supporting Runtime Reconfiguration on Network Processors

Lecture 24: Intro to Multi-processors

Single-Chip Multi-Processors (CMP)

Network Processors A generation of multi-core processors

Multi-core processors

Heterogeneous Multi-Core Processors

Multi-core processors