700 likes | 796 Views
Practical Concerns for Scalable Synchronization. Jonathan Walpole (PSU) Paul McKenney (IBM) Tom Hart (University of Toronto). The problem. “i++” is dangerous if “i” is global. CPU 0. CPU 1. load r1,i. inc r1. store r1,i. i. The problem. “i++” is dangerous if “i” is global. CPU 0
E N D
Practical Concerns for Scalable Synchronization Jonathan Walpole (PSU) Paul McKenney (IBM) Tom Hart (University of Toronto)
The problem • “i++” is dangerous if “i” is global CPU 0 CPU 1 load r1,i inc r1 store r1,i i
The problem • “i++” is dangerous if “i” is global CPU 0 load r1,i CPU 1 load r1,i load r1,i inc r1 i i store r1,i i
The problem • “i++” is dangerous if “i” is global CPU 0 inc r1 CPU 1 inc r1 load r1,i inc r1 i+1 i+1 store r1,i i
The problem • “i++” is dangerous if “i” is global CPU 0 store r1,i CPU 1 store r1,i load r1,i inc r1 i+1 i+1 store r1,i i+1
The solution – critical sections • Classic multiprocessor solution: spinlocks • CPU 1 waits for CPU 0 to release the lock • Counts are accurate, but locks have overhead! spin_lock(&mylock); i++; spin_unlock(&mylock);
Critical-section efficiency Lock Acquisition (Ta ) Critical Section (Tc ) Lock Release (Tr ) Tc Critical-section efficiency = Tc+Ta+Tr Ignoring lock contention and cache conflicts in the critical section
Critical section efficiency Critical Section Size
Questions • Have synchronization instructions got faster? • Relative to normal instructions? • In absolute terms? • What are the implications of this for the performance of operating systems? • Can we fix this problem by adding more CPUs?
What’s going on? • Taller memory hierarchies • Memory speeds have not kept up with CPU speeds • 1984: no caches needed, since instructions were slower than memory accesses • 2005: 3-4 level cache hierarchies, since instructions are orders of magnitude faster than memory accesses
Why does this matter? • Synchronization implies sharing data across CPUs • normal instructions tend to hit in top-level cache • synchronization operations tend to miss • Synchronization requires a consistent view of data • between cache and memory • across multiple CPUs • requires CPU-CPU communication • Synchronization instructions see memory latency!
… but that’s not all! • Longer pipelines • 1984: Many clock cycles per instruction • 2005: Many instructions per clock cycle • 20-stage pipelines • Out of order execution • Keeps the pipelines full • Must not reorder the critical section before its lock! • Synchronization instructions stall the pipeline!
Reordering means weak memory consistency • Memory barriers • - Additional synchronization • instructions are needed to • manage reordering
What is the cost of all this? Instruction Cost 1.45 GHz 3.06GHz IBM POWER4 Intel Xeon • Normal Instruction • 1.0 • 1.0
Atomic increment Instruction Cost 1.45 GHz 3.06GHz IBM POWER4 Intel Xeon • Normal Instruction • Atomic Increment • 1.0 • 183.1 • 1.0 • 402.3
Memory barriers Instruction Cost 1.45 GHz 3.06GHz IBM POWER4 Intel Xeon • Normal Instruction • Atomic Increment • SMP Write Memory Barrier • Read Memory Barrier • Write Memory Barrier • 1.0 • 183.1 • 328.6 • 328.9 • 400.9 • 1.0 • 402.3 • 0.0 • 402.3 • 0.0
Lock acquisition/release with LL/SC Instruction Cost 1.45 GHz 3.06GHz IBM POWER4 Intel Xeon • Normal Instruction • Atomic Increment • SMP Write Memory Barrier • Read Memory Barrier • Write Memory Barrier • Local Lock Round Trip • 1.0 • 183.1 • 328.6 • 328.9 • 400.9 • 1057.5 • 1.0 • 402.3 • 0.0 • 402.3 • 0 • 1138.8
Compare & swap unknown values (NBS) Instruction Cost 1.45 GHz 3.06GHz IBM POWER4 Intel Xeon • Normal Instruction • Atomic Increment • SMP Write Memory Barrier • Read Memory Barrier • Write Memory Barrier • Local Lock Round Trip • CAS Cache Transfer & Invalidate • 1.0 • 183.1 • 328.6 • 328.9 • 400.9 • 1057.5 • 247.1 • 1.0 • 402.3 • 0.0 • 402.3 • 0 • 1138.8 • 847.1
Compare & swap known values (spinlocks) Instruction Cost 1.45 GHz 3.06GHz IBM POWER4 Intel Xeon • Normal Instruction • Atomic Increment • SMP Write Memory Barrier • Read Memory Barrier • Write Memory Barrier • Local Lock Round Trip • CAS Cache Transfer & Invalidate • CAS Blind Cache Transfer • 1.0 • 183.1 • 328.6 • 328.9 • 400.9 • 1057.5 • 247.1 • 257.1 • 1.0 • 402.3 • 0.0 • 402.3 • 0 • 1138.8 • 847.1 • 993.9
The net result? • 1984: Lock contention was the main issue • 2005: Critical section efficiency is a key issue • Even if the lock is always free when you try to acquire it, performance can still suck!
How has this affected OS design? • Multiprocessor OS designers search for “scalable” synchronization strategies • reader-writer locking instead of global locking • data locking and partitioning • Per-CPU reader-writer locking • Non-blocking synchronization • The “common case” is read-mostly access to linked lists and hash-tables • asymmetric strategies favouring readers are good
Review - Global locking • A symmetric approach (also called “code locking”) • A critical section of code is guarded by a lock • Only one thread at a time can hold the lock • Examples include • Monitors • Java “synchronized” on global object • Linux spin_lock() on global spinlock_t • What is the problem with global locking?
Review - Global locking • A symmetric approach (also called “code locking”) • A critical section of code is guarded by a lock • Only one thread at a time can hold the lock • Examples include • Monitors • Java “synchronized” on global object • Linux spin_lock() on global spinlock_t • Global locking doesn’t scale due to lock contention!
Review - Reader-writer locking • Many readers can concurrently hold the lock • Writers exclude readers and other writers • The result? • No lock contention in read-mostly scenarios • So it should scale well, right?
Review - Reader-writer locking • Many readers can concurrently hold the lock • Writers exclude readers and other writers • The result? • No lock contention in read-mostly scenarios • So it should scale well, right? • … wrong!
CPU 0 critical section read-acquire memory barrier lock read-acquire memory barrier read-acquire memory barrier critical section CPU 1 Scalability of reader/writer locking Reader/writer locking does not scale due to critical section efficiency!
Review - Data locking • A lock per data item instead of one per collection • Per-hash-bucket locking for hash tables • CPUs acquire locks for different hash chains in parallel • CPUs incur memory-latency and pipeline-flush overheads in parallel • Data locking improves scalability by executing critical section “overhead” in parallel
Review - Per-CPU reader-writer locking • One lock per CPU (called brlock in Linux) • Readers acquire their own CPU’s lock • Writers acquire all CPU’s locks • In read-only workloads CPUs never exchange locks • no memory latency is incurred • Per-CPU R/W locking improves scalability by removing memory latency from read-lock acquisition for read-mostly scenarios
Scalability comparison • Expected scalability on read-mostly workloads • Global locking – poor due to lock contention • R/W locking – poor due to critical section efficiency • Data locking – better? • R/W data locking – better still? • Per-CPU R/W locking – the best we can do?
Actual scalability Scalability of locking strategies using read-only workloads in a hash-table benchmark Measurements taken on a 4-CPU 700 MHz P-III system Similar results are obtained on more recent CPUs
Performance at different update fractions on 8 1.45 GHz POWER4 CPUs
What are the lessons so far? • Avoid lock contention ! • Avoid synchronization instructions ! • … especially in the read-path !
How about non-blocking synchronization? • Basic idea – copy & flip pointer (no locks!) • Read a pointer to a data item • Create a private copy of the item to update in place • Swap the old item for the new one using an atomic compare & swap (CAS) instruction on its pointer • CAS fails if current pointer not equal to initial value • Retry on failure • NBS should enable fast reads … in theory!
Problems with NBS in practice • Reusing memory causes problems • Readers holding references can be hijacked during data structure traversals when memory is reclaimed • Readers see inconsistent data structures when memory is reused • How and when should memory be reclaimed?
Immediate reclamation? • In practice, readers must either • Use LL/SC to test if pointers have changed, or • Verify that version numbers associated with data structures have not changed (2 memory barriers) • Synchronization instructions slow NBS readers!
Reader-friendly solutions • Never reclaim memory ? • Type-stable memory ? • Needs free pool per data structure type • Readers can still be hijacked to the free pool • Exposes OS to denial of service attacks • Ideally, defer reclaiming memory until its safe! • Defer reclamation of a data item until references to it are no longer held by any thread
How should we defer reclamation? • Wait for a while then delete? • … but how long should you wait? • Maintain reference counts or per-CPU “hazard pointers” on data that is in use?
How should we defer reclamation? • Wait for a while then delete? • … but how long should you wait? • Maintain reference counts or per-CPU “hazard pointers” on data that is in use? • Requires synchronization in read path! • Challenge – deferring destruction without using synchronization instructions in the read path
Quiescent-state-based reclamation • Coding convention: • Don’t allow a quiescent state to occur in a read-side critical section • Reclamation strategy: • Only reclaim data after all CPUs in the system have passed through a quiescent state • Example quiescent states: • Context switch in non-preemptive kernel • Yield in preemptive kernel • Return from system call …
Coding conventions for readers • Delineate read-side critical section • rcu_read_lock() and rcu_read_unlock() primitives • may compile to nothing on most architectures • Don’t hold references outside critical sections • Re-traverse data structure to pick up reference • Don’t yield the CPU during critical sections • Don’t voluntarily yield • Don’t block, don’t leave the kernel …
Overview of the basic idea • Writers create (and publish) new versions • Using locking or NBS to synchronize with each other • Register call-backs to destroy old versions when safe • call_rcu() primitive registers a call back with a reclaimer • Call-backs are deferred and memory reclaimed in batches • Readers do not use synchronization • While they hold a reference to a version it will not be destroyed • Completion of read-side critical sections is “inferred” by the reclaimer from observation of quiescent states
Overview of RCU API Writer rcu_dereference () Reader rcu_assign_pointer () Memory Consistency of Mutable Pointers Collection of versions of Immutable Objects Reclaimer rcu_read_lock () call_rcu () synchronize_rcu ()
Can't hold reference to old version, but RCU can't tell Can't hold reference to old version May hold reference Context Switch CPU 0 RCU Read-Side Critical Section RCU Read-Side Critical Section RCU Read-Side Critical Section RCU Read-Side Critical Section RCU Read-Side Critical Section CPU 1 Remove Element Context Switch Context Switch Can't hold reference to old version Context switch as a quiescent state
Grace Period Context Switch Context Switch CPU 0 RCU Read-Side Critical Section RCU Read-Side Critical Section RCU Read-Side Critical Section RCU Read-Side Critical Section RCU Read-Side Critical Section CPU 1 Delete Element Context Switch Context Switch Grace Period Grace Period Grace periods
Quiescent states and grace periods • Example quiescent states • Context switch (non-preemptive kernels) • Voluntary context switch (preemptive kernels) • Kernel entry/exit • Blocking call • Grace periods • A period during which every CPU has gone through a quiescent state