Making Lockless Synchronization Fast: Performance Implications of Memory Reclamation

Making Lockless Synchronization Fast: Performance Implications of Memory Reclamation Authors: Thomas E. Hart, Paul E. McKenney, and Angela Demke Brown Presented by Pengyuan Wang 9/11/2007

Outline • Introduction • Memory Reclamation Schemes • Experimental Methodology • Performance Analysis • Conclusion

384.5 insts Motivation • Instruction/Pipeline Costs on a 8-CPU 1.45Ghz PPC (2005) source: Paul E. McKenny. Abstraction, Reality and RCU. 2005

The critical section efficiency for lock-based synchronization has been decreasing dramatically over the last 25 years. Critical Section Efficiency

Locking • Using mutual exclusion (locks) to ensure that concurrent operations do not interfere with one another. • Problems of Locking: • Expensive • Priority inversion • Convoying • deadlock

Non-blocking synchronization: • Non-blocking synchronization: • Roll back to resolve conflicting changes instead of spinning or blocking • Use atomic instructions to hide complex updates behind a single commit point do { int old = cnt->val; int new = old+1; }while ( CAS(cnt->val,old,new) ); • Properties of NBS: • Wait free: all threads will eventually make progress. (No starvation, live or dead lock) • Lock free: one thread will eventually make progress, but some threads may be indefinitely delayed. (No live or dead lock) • Obstruction free: it means that a thread will eventually make progress if all other threads are suspended. (No dead lock) • It is still expensive!!! • must use carefully crafted sequences of atomic operations to do references and updates

Lockless Synchronization • A broad class of algorithms which avoid locks • May be non-blocking, or may be not • A popular idea of lockless synchronization: separate the removal of reference and the reclamation of the memory • Example: RCU • very low or zero reader side overhead • best for read-mostly data structures • Major challenge of lockless synchronization: read/reclaim races

Read/Reclaim Race T1 N N T2

Contributions of this paper • By comparing three memory reclamation schemes to tell • What is the strength and limitations of these schemes • What factors affect their performance • What should be considered when implementing a memory reclamation scheme. • Methodology: change different factors and gauge the performance using a microbenchmark.

Blocking Schemes • What is blocking scheme? • There is no progress guarantee. • A failure of some thread may infinitely delay the reclamation, and the ensuing memory exhaustion will eventually block all threads. • Quiescent-State-Based Reclamation • “A quiescent state for thread T is a state in which T holds no references to shared nodes” • Context switch • “A grace period is a time interval [a,b] such that, after time b, all nodes removed before time a may safely be reclaimed.” = “any interval of time during which all threads pass through at least one quiescent state” • Example: read-copy update (RCU) source: Thomas E. Hart, Paul E. McKenny and Angela Demke Brown. Making Lockless Synchronization Fast: Performance Implications of Memory Reclamation. 2006

QSBR for (i=0;i<100;i++) if(list_find(L, i)) break; Quiescent_state(); Grace Period QS QS T1 QS QS T2 QS T3 Time

RCU in detail • Reader side example 1 /* Read-only search using locking*/ 2 struct el *p; 3 read_lock(&list_lock); 4 p = search(mykey); 5 if (p == NULL) { 6 /* handle error condition */ 7 } else { 8 /* access *p w/out modifying */ 9 } 10 read_unlock(&list_lock); 1 /* Read-only search using RCU*/ 2 struct el *p; 3 rcu_read_lock(); /* nop unless CONFIG_PREEMPT */ 4 p = search(mykey); 5 if (p == NULL) { 6 /* handle error condition */ 7 } else { 8 /* access *p w/out modifying */ 9 } 10 rcu_read_unlock(); /* nop unless CONFIG_PREEMPT */

RCU in detail • Writer side example 1 void delete(long mykey) 2 { 3 struct el *p; 4 spin_lock(&list_lock); 5 p = search(mykey); 6 if (p != NULL) { 7 list_del_rcu(p); 8 } 9 spin_unlock(&list_lock); 10 call_rcu(&p->rcuhead, 11 (void (*)(void *))my_free, p); 12 } 1 void delete(long mykey) 2 { 3 struct el *p; 4 write_lock(&list_lock); 5 p = search(mykey); 6 if (p != NULL) { 7 list_del(p); 8 } 9 write_unlock(&list_lock); 10 my_free(p); 11 }

Detecting Grace Period 1. An entity needing to wait for a grace period enqueues a callback onto a per-CPU list. (nxlist, curlist) 2. Some time later, this CPU informs all other CPUs of the beginning of a grace period. (bitmask) 3. As each CPU learns of the new grace period, it takes a snapshot of its quiescentstate counters. 4. Each CPU periodically compares its snapshot against the current values of its quiescent-state counters. As soon as any of the counters differ from the snapshot, the CPU records the fact that it has passed through a quiescent state. 5. The last CPU to record that it has passed through a quiescent state also records the fact that the grace period has ended. 6. As each CPU learns that a grace period has ended, it executes any of its callbacks that were waiting for the end of that grace period.

Epoch-Based Reclamation • Follows QSBR in using grace periods. • Use critical region primitive inside an operation instead of a quiescent state primitive outside the operation • Application-independent int search (struct list *l, long key) { node_t *cur; critical_enter(); for (cur = l->list_head; cur != NULL; cur = cur->next) { if (cur->key >= key) { critical_exit(); return (cur -> key == key); } } critical_exit(); return(0); }

T1 HP[0] T1 HP[1] N HP[2] T2 HP[3] T2 Non-blocking schemes • What is a non-blocking scheme? • There is no infinite delay of the reclamation, thus there is no risk of blocking due to memory exhaustion. • Hazard-Pointer-Based Reclamation • Using hazard pointers to protect nodes from reclamation by other threads.

Performance Factors • Memory Consistency (memory fences) • Data structures and Workload • Linked list/queue • Read-mostly/update-heavy • Threads and Scheduling • Contention • Preemption • Memory Constraints

Test Program while (parent's timer has not expired) { for i from 1 to 100 do { key = random key; op = random operation; d = data structures; op(d,key); } if (using QSBR) quiescent_state(); } • Execution time = test duration / total # of operations. • CPU time = execution time / min{# of threads, # of CPUs}

Scalability with Traversal Length

Scalability with Traversal Length • Explanation: the major affecting factor is per-operation atomic instructions (memory fences) • HPBR requires O(n) fences • EBR requires O(1) fence • QSBR only need one fence per several operations

Scalability with Threads(No Preemption) • The relative performance is constant

Scalability with Threads(With Preemption) • On the read-only case, the only affecting factor is the read-side overhead • On the write-heavy case, HPBR performs best due to its non-blocking design

Scalability with Threads(With Preemption) • If we use busy-waiting policy instead of yielding policy, it can stall the grace period and exhaust all memory

Fair Evaluation of Algorithms • Lock-free performance improves when the update fraction increases, because its updates require fewer atomic instructions than does locking. • One can accurately compare two lockless algorithms only when each is using the same reclamation scheme.

Conclusion • Different schemes have very different overheads. • No globally optimal scheme. • QSBR is usually the best reclamation scheme. • HPBR is good when there is preemption and high update fraction.

Making Lockless Synchronization Fast: Performance Implications of Memory Reclamation