Shared Memory Multiprocessor Kernel Communication

Kernel-Kernel communication in a shared-memory multiprocessor By ELISEU M. CHAVES,* JR.,PRAKASH CH. DAS,* THOMAS J. LEBLANC, BRIAN D. MARSH* AND MICHAEL L. SCOTT

Introduction • Computer architecture can influence multiprocessor operating system design • How to distribute kernel functionality across processors • Inter-Kernel Communication • Data Structure and Location • How to maintain Data Integrity and provide fast access

Introduction • Computer architecture can influence multiprocessor operating system design • How to distribute kernel functionality across processors • Inter-Kernel Communication • Data Structure and Location • How to maintain Data Integrity and provide fast access • Some Examples...

Introduction • Bus-based shared memory multiprocessors • Kernel code shared for all processors • Typically these have uniform memory access • Use explicit synchronization to control access to data structures (critical sections, locks, spin-locks, semaphores) • Advantages • Easy to port an existing uniprocessor kernel and add explicit synchronization • Decent performance • Simplified load balancing and global resource management • Disadvantages • Good for smaller machines but not very scalable • Increased contention leads to performance degradation

Introduction • Distributed systems & Distributed memory computers • Kernel data distributed among processors • Each processor executes a copy of the kernel • Local data operated on directly, but remote data is accessed via remote invocation • Uses implicit synchronization to control access to data • Hypercubes & Mesh-connected machines • Advantages • Works well for machines with no shared memory • Modular design gives increased fault protection • Can use non-preemption for bulk of synchronization • Disadvantages • Reliance on a communication network to send invocations • Security of data while in the network

Large Scale Shared Memory Multiprocessors • Memory can be accessed directly, but at varying cost (NUMA) • Some with coherent cache between processors • ( ccNUMA ) • Large scale shared memory multiprocessors have properties common to both • Bus-based systems • Which use Remote access • Distributed memory multi-computers • Which use Remote invocation • Kernel Designers must choose between • Remote ACCESS & • Remote INVOCATION • Tradeoff's between Remote Access/Invocation remain largely unknown (circa 1993)

Target Parameters of this Research • General purpose parallel operating system • Shared memory access to all processors • Full range of kernel services • Reasonable use of all processors • Arrange system using collection of nodes • Possibly with caches? • NUMA architecture • No cache coherence

Kernel-Kernel Communication Options • Remote Access • execution on node i, reading and writing memory on node j’s memory when necessary • Remote Invocation • Processor at node i sends a message to node j’s processor requesting an operation on i’s behalf • Bulk Data Transfer • Move data required from node j to node i • Specific request to kernel • Re-Map memory from node j to node i • Data migration • Cache coherence blurs distinction with remote access

Types of Remote Invocation [1/2] • Interrupt Level • Request executes in the interrupt handler “bottom half” • Doesn’t execute in the context of a ‘process’ • Can’t sleep • Can’t call schedule • Can’t transfer data to/from user space • Interrupts enabled during execution to allow top half to continue handling other interrupts • Mutual exclusion achieved through ‘masking interrupts’ • Course granularity can unnecessarily prohibit too many other invocations • Prohibits outgoing invocations when incoming invocations are disabled. (to avoid deadlock) • Limiting for accessing data on two different processors • Yes, lots of limitations, but very fast!

Types of Remote Invocation [2/2] • Kernel Process Level • Normal “Top Half” activity • From user space - interrupt handler performs an asynchronous trap & remote invocation executes • From kernel space – interrupt handler queues the invocation for execution when kernel becomes idle, or when control returns to user space • Has process context so requested operation free to block during its execution • Process holding semaphores or scheduler-based locks can perform process-level RI’s • Deadlock still possible, but can be avoided • Requires requesting process to block when making RI • Busy waiting would lock out other incoming process-level RI’s

Coexistence is possible • Remote access and interrupt-level RI on the same data structure • Deadlock-avoidance rule: must always be possible to perform incoming invocations wile waiting for outgoing invocation

Trade-Offs • Direct costs of remote operations • Latency of the operation • (R-1)*n < C • R: remote to local access time ratio • n: number of memory accesses • C: overhead of remote invocation • If C is greater, implement Operation using Remote Access • Notes: • In general, C is fixed so use RI for longer operations • Parameter passing not included in calculation

Trade-Offs • Indirect costs for local operations • RI may need context of invoker as parameter • Operations arranged to increase node locality?? • Access between invoker/target processor not interleaved?? • Process level RI for all remote accesses may allow data structure implementation without explicit synchronization • Uses lack of pre-emption within kernel to provide implicit synchronization • Avoiding Explicit Synchronization • Can improve speed of remote and local operations because of the high cost of explicit synchronization • Remote Memory Access may require large portions of kernel data space on other processors must be mapped into each instance of the kernel • May not scale well to large machines • Mapping on demand not a good option

Trade-Offs • Competition for processor and memory cycles • When do operations serialize • Remote Invocation – on processor that executes them • Serialize whether data is common or not • Remote Memory Access – at the memory • more chance of parallel computation because operations do more than just access data • If competing for locked data operations may serialize • Tolerance for Contention • Remote Memory Access has slightly higher throughput because of parallel operation possibility • However, this can introduce greater variance of completion time

Trade-Offs • Compatibility with conceptual kernel model • Procedure based kernel • Gives uniform model for user and kernel processes • Closely mimics hardware of UMA multiprocessor • Minimizes context switching • Fits nicely with Remote Memory Access • Message based kernel • Comparmentalization of kernel • Synchronization is subsumed by message passing • Closely mimics hardware of distributed-memory multicomputer • Easier to debug • Remote Invocation more appropriate

Case Study • Psyche on the BBN butterfly • 12:1 Remote-to-Local memory access time ratio • 6.88µs Read 32-bit word remote memory location • 0.518µs Read 32-bit word local memory location • 4.27µs Write 32-bit word remote memory location • 0.398µs Write 32-bit word local memory location • 56µs Average Latency Interrupt level Remote Invocation • 421µs Average latency Process level Remote Invocation

Latency of Kernel Operations Times in columns represent cost of operations: • col 1 & 3 measurements for original Psyche kernel • col 2 synchronization without context switching • col 4 subsumed in other operation with course grain locking • col 6 cost if data accessed were always via remote invocation – no explicit synchronization required • col 5 cost of remote invocations in a hybrid kernel

% Latency from locking • Cost of synchronization • Measure Locking OFF vs ON

Remote Access Penalty • NL RA(off) - local(off)/ RA(off) • L RA(off) – local(off)/RA(on) • Overhead of explicit synchronization and remote access is a function of complexity • Example • Interrupt level RI can be justified to avoid 11 remote refs • 6µs memory access cost & 60µs overhead of RI interrupt level

RI time as percentage of RA time • RA: locking in the case of remote access • column 6 as % of column 3 • B: locking in the case of both • column 5 as % of column 3 • Simulates Hybrid kernel • N: locking in the case of neither • column 6 as % of column 4 • Simulates desired operation being subsumed by other operation with coarse-grained locking

Not so fast • Not such a clear win for Remote Invocation • Experiments represent extreme cases • Simple enough to perform via interrupt level RI • Complex enough to absorb overhead of process level RI • Medium size operations might not accept limitations of interrupt level RI or the overhead of process level RI • Throughput for tiny operations • Remote memory accesses steal busy cycles from processor where memory is being used • Remote invocations steal the entire processor for the entire operation • Experiments show Interrupt level RI negatively affects throughput of the remote processor & remote access does not

Conclusions • Good Kernel design must consider • Cost of remote invocation mechanism • Cost of atomic operations and synchronization • Ratio of remote to local access time • Breakeven points depend on the number of remote references in any given Operation • Interrupt level RI has small number of uses • TLB shootdown • Console I/O • Results could apply for ccNUMA machines as well • Risks of putting data structures in interrupt level of the kernel might be a reason to use remote access instead for small operations

REFERENCES • Samuel J. Leffler, Marshall Kirk McKusick, Michael J. Karels, and John S. Quarterman “The Design and Implementation of the 4.3BSD UNIX Operating System” Addison Wesley Publishing Company • Andrew S. Tannenbaum “Modern Operating Systems 2nd Edition” Prentice-Hall of India • Alessandro Rubini, Jonathan Corbet, “Linux Device Drivers 2nd Edition” O’Reilly

Shared Memory Multiprocessor Kernel Communication

Shared Memory Multiprocessor Kernel Communication

Presentation Transcript

User / Kernel Communication Model

Windows Kernel Internals Advance Virtual Memory

Kernel Memory Allocator

MEMORY MANAGEMENT in Linux Kernel

Memory Management in the Kernel

Multiprocessor Kernel Performance Profiling

Kernel Methods

Micro-kernel

Kernel Modules

Linux Kernel

Kernel-Kernel Communication in a Shared-memory Multiprocessor Eliseu Chaves, et. al. May 1993

Kernel III

Kernel Memory Allocator

The Linux Kernel: Memory Management

A Lock-Free Multiprocessor OS Kernel

Kernel

Kernel Synchronization

UNIX Kernel

Kernel Memory Allocator