Practical Concerns for Scalable Synchronization

Practical Concerns for Scalable Synchronization Jonathan Walpole (PSU) Paul McKenney (IBM) Tom Hart (University of Toronto)

The problem • “i++” is dangerous if “i” is global CPU 0 CPU 1 load r1,i inc r1 store r1,i i

The problem • “i++” is dangerous if “i” is global CPU 0 load r1,i CPU 1 load r1,i load r1,i inc r1 i i store r1,i i

The problem • “i++” is dangerous if “i” is global CPU 0 inc r1 CPU 1 inc r1 load r1,i inc r1 i+1 i+1 store r1,i i

The problem • “i++” is dangerous if “i” is global CPU 0 store r1,i CPU 1 store r1,i load r1,i inc r1 i+1 i+1 store r1,i i+1

The solution – critical sections • Classic multiprocessor solution: spinlocks • CPU 1 waits for CPU 0 to release the lock • Counts are accurate, but locks have overhead! spin_lock(&mylock); i++; spin_unlock(&mylock);

Critical-section efficiency Lock Acquisition (Ta ) Critical Section (Tc ) Lock Release (Tr ) Tc Critical-section efficiency = Tc+Ta+Tr Ignoring lock contention and cache conflicts in the critical section

Critical section efficiency Critical Section Size

Performance of normal instructions

Questions • Have synchronization instructions got faster? • Relative to normal instructions? • In absolute terms? • What are the implications of this for the performance of operating systems? • Can we fix this problem by adding more CPUs?

What’s going on? • Taller memory hierarchies • Memory speeds have not kept up with CPU speeds • 1984: no caches needed, since instructions were slower than memory accesses • 2005: 3-4 level cache hierarchies, since instructions are orders of magnitude faster than memory accesses

Why does this matter?

Why does this matter? • Synchronization implies sharing data across CPUs • normal instructions tend to hit in top-level cache • synchronization operations tend to miss • Synchronization requires a consistent view of data • between cache and memory • across multiple CPUs • requires CPU-CPU communication • Synchronization instructions see memory latency!

… but that’s not all! • Longer pipelines • 1984: Many clock cycles per instruction • 2005: Many instructions per clock cycle • 20-stage pipelines • Out of order execution • Keeps the pipelines full • Must not reorder the critical section before its lock! • Synchronization instructions stall the pipeline!

Reordering means weak memory consistency • Memory barriers • - Additional synchronization • instructions are needed to • manage reordering

What is the cost of all this? Instruction Cost 1.45 GHz 3.06GHz IBM POWER4 Intel Xeon • Normal Instruction • 1.0 • 1.0

Atomic increment Instruction Cost 1.45 GHz 3.06GHz IBM POWER4 Intel Xeon • Normal Instruction • Atomic Increment • 1.0 • 183.1 • 1.0 • 402.3

Memory barriers Instruction Cost 1.45 GHz 3.06GHz IBM POWER4 Intel Xeon • Normal Instruction • Atomic Increment • SMP Write Memory Barrier • Read Memory Barrier • Write Memory Barrier • 1.0 • 183.1 • 328.6 • 328.9 • 400.9 • 1.0 • 402.3 • 0.0 • 402.3 • 0.0

Lock acquisition/release with LL/SC Instruction Cost 1.45 GHz 3.06GHz IBM POWER4 Intel Xeon • Normal Instruction • Atomic Increment • SMP Write Memory Barrier • Read Memory Barrier • Write Memory Barrier • Local Lock Round Trip • 1.0 • 183.1 • 328.6 • 328.9 • 400.9 • 1057.5 • 1.0 • 402.3 • 0.0 • 402.3 • 0 • 1138.8

Compare & swap unknown values (NBS) Instruction Cost 1.45 GHz 3.06GHz IBM POWER4 Intel Xeon • Normal Instruction • Atomic Increment • SMP Write Memory Barrier • Read Memory Barrier • Write Memory Barrier • Local Lock Round Trip • CAS Cache Transfer & Invalidate • 1.0 • 183.1 • 328.6 • 328.9 • 400.9 • 1057.5 • 247.1 • 1.0 • 402.3 • 0.0 • 402.3 • 0 • 1138.8 • 847.1

Compare & swap known values (spinlocks) Instruction Cost 1.45 GHz 3.06GHz IBM POWER4 Intel Xeon • Normal Instruction • Atomic Increment • SMP Write Memory Barrier • Read Memory Barrier • Write Memory Barrier • Local Lock Round Trip • CAS Cache Transfer & Invalidate • CAS Blind Cache Transfer • 1.0 • 183.1 • 328.6 • 328.9 • 400.9 • 1057.5 • 247.1 • 257.1 • 1.0 • 402.3 • 0.0 • 402.3 • 0 • 1138.8 • 847.1 • 993.9

The net result? • 1984: Lock contention was the main issue • 2005: Critical section efficiency is a key issue • Even if the lock is always free when you try to acquire it, performance can still suck!

How has this affected OS design? • Multiprocessor OS designers search for “scalable” synchronization strategies • reader-writer locking instead of global locking • data locking and partitioning • Per-CPU reader-writer locking • Non-blocking synchronization • The “common case” is read-mostly access to linked lists and hash-tables • asymmetric strategies favouring readers are good

Review - Global locking • A symmetric approach (also called “code locking”) • A critical section of code is guarded by a lock • Only one thread at a time can hold the lock • Examples include • Monitors • Java “synchronized” on global object • Linux spin_lock() on global spinlock_t • What is the problem with global locking?

Review - Global locking • A symmetric approach (also called “code locking”) • A critical section of code is guarded by a lock • Only one thread at a time can hold the lock • Examples include • Monitors • Java “synchronized” on global object • Linux spin_lock() on global spinlock_t • Global locking doesn’t scale due to lock contention!

Review - Reader-writer locking • Many readers can concurrently hold the lock • Writers exclude readers and other writers • The result? • No lock contention in read-mostly scenarios • So it should scale well, right?

Review - Reader-writer locking • Many readers can concurrently hold the lock • Writers exclude readers and other writers • The result? • No lock contention in read-mostly scenarios • So it should scale well, right? • … wrong!

CPU 0 critical section read-acquire memory barrier lock read-acquire memory barrier read-acquire memory barrier critical section CPU 1 Scalability of reader/writer locking Reader/writer locking does not scale due to critical section efficiency!

Review - Data locking • A lock per data item instead of one per collection • Per-hash-bucket locking for hash tables • CPUs acquire locks for different hash chains in parallel • CPUs incur memory-latency and pipeline-flush overheads in parallel • Data locking improves scalability by executing critical section “overhead” in parallel

Review - Per-CPU reader-writer locking • One lock per CPU (called brlock in Linux) • Readers acquire their own CPU’s lock • Writers acquire all CPU’s locks • In read-only workloads CPUs never exchange locks • no memory latency is incurred • Per-CPU R/W locking improves scalability by removing memory latency from read-lock acquisition for read-mostly scenarios

Scalability comparison • Expected scalability on read-mostly workloads • Global locking – poor due to lock contention • R/W locking – poor due to critical section efficiency • Data locking – better? • R/W data locking – better still? • Per-CPU R/W locking – the best we can do?

Actual scalability Scalability of locking strategies using read-only workloads in a hash-table benchmark Measurements taken on a 4-CPU 700 MHz P-III system Similar results are obtained on more recent CPUs

Scalability on 1.45 GHz POWER4 CPUs

Performance at different update fractions on 8 1.45 GHz POWER4 CPUs

What are the lessons so far?

What are the lessons so far? • Avoid lock contention ! • Avoid synchronization instructions ! • … especially in the read-path !

How about non-blocking synchronization? • Basic idea – copy & flip pointer (no locks!) • Read a pointer to a data item • Create a private copy of the item to update in place • Swap the old item for the new one using an atomic compare & swap (CAS) instruction on its pointer • CAS fails if current pointer not equal to initial value • Retry on failure • NBS should enable fast reads … in theory!

Problems with NBS in practice • Reusing memory causes problems • Readers holding references can be hijacked during data structure traversals when memory is reclaimed • Readers see inconsistent data structures when memory is reused • How and when should memory be reclaimed?

Immediate reclamation?

Immediate reclamation? • In practice, readers must either • Use LL/SC to test if pointers have changed, or • Verify that version numbers associated with data structures have not changed (2 memory barriers) • Synchronization instructions slow NBS readers!

Reader-friendly solutions • Never reclaim memory ? • Type-stable memory ? • Needs free pool per data structure type • Readers can still be hijacked to the free pool • Exposes OS to denial of service attacks • Ideally, defer reclaiming memory until its safe! • Defer reclamation of a data item until references to it are no longer held by any thread

How should we defer reclamation? • Wait for a while then delete? • … but how long should you wait? • Maintain reference counts or per-CPU “hazard pointers” on data that is in use?

How should we defer reclamation? • Wait for a while then delete? • … but how long should you wait? • Maintain reference counts or per-CPU “hazard pointers” on data that is in use? • Requires synchronization in read path! • Challenge – deferring destruction without using synchronization instructions in the read path

Quiescent-state-based reclamation • Coding convention: • Don’t allow a quiescent state to occur in a read-side critical section • Reclamation strategy: • Only reclaim data after all CPUs in the system have passed through a quiescent state • Example quiescent states: • Context switch in non-preemptive kernel • Yield in preemptive kernel • Return from system call …

Coding conventions for readers • Delineate read-side critical section • rcu_read_lock() and rcu_read_unlock() primitives • may compile to nothing on most architectures • Don’t hold references outside critical sections • Re-traverse data structure to pick up reference • Don’t yield the CPU during critical sections • Don’t voluntarily yield • Don’t block, don’t leave the kernel …

Overview of the basic idea • Writers create (and publish) new versions • Using locking or NBS to synchronize with each other • Register call-backs to destroy old versions when safe • call_rcu() primitive registers a call back with a reclaimer • Call-backs are deferred and memory reclaimed in batches • Readers do not use synchronization • While they hold a reference to a version it will not be destroyed • Completion of read-side critical sections is “inferred” by the reclaimer from observation of quiescent states

Overview of RCU API Writer rcu_dereference () Reader rcu_assign_pointer () Memory Consistency of Mutable Pointers Collection of versions of Immutable Objects Reclaimer rcu_read_lock () call_rcu () synchronize_rcu ()

Can't hold reference to old version, but RCU can't tell Can't hold reference to old version May hold reference Context Switch CPU 0 RCU Read-Side Critical Section RCU Read-Side Critical Section RCU Read-Side Critical Section RCU Read-Side Critical Section RCU Read-Side Critical Section CPU 1 Remove Element Context Switch Context Switch Can't hold reference to old version Context switch as a quiescent state

Grace Period Context Switch Context Switch CPU 0 RCU Read-Side Critical Section RCU Read-Side Critical Section RCU Read-Side Critical Section RCU Read-Side Critical Section RCU Read-Side Critical Section CPU 1 Delete Element Context Switch Context Switch Grace Period Grace Period Grace periods

Quiescent states and grace periods • Example quiescent states • Context switch (non-preemptive kernels) • Voluntary context switch (preemptive kernels) • Kernel entry/exit • Blocking call • Grace periods • A period during which every CPU has gone through a quiescent state

Practical Concerns for Scalable Synchronization