1 / 29

Making Lockless Synchronization Fast: Performance Implications of Memory Reclamation

Making Lockless Synchronization Fast: Performance Implications of Memory Reclamation. Authors: Thomas E. Hart, Paul E. McKenney, and Angela Demke Brown Presented by Pengyuan Wang 9/11/2007. Outline. Introduction Memory Reclamation Schemes Experimental Methodology Performance Analysis

clum
Download Presentation

Making Lockless Synchronization Fast: Performance Implications of Memory Reclamation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Making Lockless Synchronization Fast: Performance Implications of Memory Reclamation Authors: Thomas E. Hart, Paul E. McKenney, and Angela Demke Brown Presented by Pengyuan Wang 9/11/2007

  2. Outline • Introduction • Memory Reclamation Schemes • Experimental Methodology • Performance Analysis • Conclusion

  3. 384.5 insts Motivation • Instruction/Pipeline Costs on a 8-CPU 1.45Ghz PPC (2005) source: Paul E. McKenny. Abstraction, Reality and RCU. 2005

  4. The critical section efficiency for lock-based synchronization has been decreasing dramatically over the last 25 years. Critical Section Efficiency

  5. Locking • Using mutual exclusion (locks) to ensure that concurrent operations do not interfere with one another. • Problems of Locking: • Expensive • Priority inversion • Convoying • deadlock

  6. Non-blocking synchronization: • Non-blocking synchronization: • Roll back to resolve conflicting changes instead of spinning or blocking • Use atomic instructions to hide complex updates behind a single commit point do { int old = cnt->val; int new = old+1; }while ( CAS(cnt->val,old,new) ); • Properties of NBS: • Wait free: all threads will eventually make progress. (No starvation, live or dead lock) • Lock free: one thread will eventually make progress, but some threads may be indefinitely delayed. (No live or dead lock) • Obstruction free: it means that a thread will eventually make progress if all other threads are suspended. (No dead lock) • It is still expensive!!! • must use carefully crafted sequences of atomic operations to do references and updates

  7. Lockless Synchronization • A broad class of algorithms which avoid locks • May be non-blocking, or may be not • A popular idea of lockless synchronization: separate the removal of reference and the reclamation of the memory • Example: RCU • very low or zero reader side overhead • best for read-mostly data structures • Major challenge of lockless synchronization: read/reclaim races

  8. Read/Reclaim Race T1 N N T2

  9. Contributions of this paper • By comparing three memory reclamation schemes to tell • What is the strength and limitations of these schemes • What factors affect their performance • What should be considered when implementing a memory reclamation scheme. • Methodology: change different factors and gauge the performance using a microbenchmark.

  10. Outline • Introduction • Memory Reclamation Schemes • Experimental Methodology • Performance Analysis • Conclusion

  11. Blocking Schemes • What is blocking scheme? • There is no progress guarantee. • A failure of some thread may infinitely delay the reclamation, and the ensuing memory exhaustion will eventually block all threads. • Quiescent-State-Based Reclamation • “A quiescent state for thread T is a state in which T holds no references to shared nodes” • Context switch • “A grace period is a time interval [a,b] such that, after time b, all nodes removed before time a may safely be reclaimed.” = “any interval of time during which all threads pass through at least one quiescent state” • Example: read-copy update (RCU) source: Thomas E. Hart, Paul E. McKenny and Angela Demke Brown. Making Lockless Synchronization Fast: Performance Implications of Memory Reclamation. 2006

  12. QSBR for (i=0;i<100;i++) if(list_find(L, i)) break; Quiescent_state(); Grace Period QS QS T1 QS QS T2 QS T3 Time

  13. RCU in detail • Reader side example 1 /* Read-only search using locking*/ 2 struct el *p; 3 read_lock(&list_lock); 4 p = search(mykey); 5 if (p == NULL) { 6 /* handle error condition */ 7 } else { 8 /* access *p w/out modifying */ 9 } 10 read_unlock(&list_lock); 1 /* Read-only search using RCU*/ 2 struct el *p; 3 rcu_read_lock(); /* nop unless CONFIG_PREEMPT */ 4 p = search(mykey); 5 if (p == NULL) { 6 /* handle error condition */ 7 } else { 8 /* access *p w/out modifying */ 9 } 10 rcu_read_unlock(); /* nop unless CONFIG_PREEMPT */

  14. RCU in detail • Writer side example 1 void delete(long mykey) 2 { 3 struct el *p; 4 spin_lock(&list_lock); 5 p = search(mykey); 6 if (p != NULL) { 7 list_del_rcu(p); 8 } 9 spin_unlock(&list_lock); 10 call_rcu(&p->rcuhead, 11 (void (*)(void *))my_free, p); 12 } 1 void delete(long mykey) 2 { 3 struct el *p; 4 write_lock(&list_lock); 5 p = search(mykey); 6 if (p != NULL) { 7 list_del(p); 8 } 9 write_unlock(&list_lock); 10 my_free(p); 11 }

  15. Detecting Grace Period 1. An entity needing to wait for a grace period enqueues a callback onto a per-CPU list. (nxlist, curlist) 2. Some time later, this CPU informs all other CPUs of the beginning of a grace period. (bitmask) 3. As each CPU learns of the new grace period, it takes a snapshot of its quiescentstate counters. 4. Each CPU periodically compares its snapshot against the current values of its quiescent-state counters. As soon as any of the counters differ from the snapshot, the CPU records the fact that it has passed through a quiescent state. 5. The last CPU to record that it has passed through a quiescent state also records the fact that the grace period has ended. 6. As each CPU learns that a grace period has ended, it executes any of its callbacks that were waiting for the end of that grace period.

  16. Epoch-Based Reclamation • Follows QSBR in using grace periods. • Use critical region primitive inside an operation instead of a quiescent state primitive outside the operation • Application-independent int search (struct list *l, long key) { node_t *cur; critical_enter(); for (cur = l->list_head; cur != NULL; cur = cur->next) { if (cur->key >= key) { critical_exit(); return (cur -> key == key); } } critical_exit(); return(0); }

  17. T1 HP[0] T1 HP[1] N HP[2] T2 HP[3] T2 Non-blocking schemes • What is a non-blocking scheme? • There is no infinite delay of the reclamation, thus there is no risk of blocking due to memory exhaustion. • Hazard-Pointer-Based Reclamation • Using hazard pointers to protect nodes from reclamation by other threads.

  18. Outline • Introduction • Memory Reclamation Schemes • Experimental Methodology • Performance Analysis • Conclusion

  19. Performance Factors • Memory Consistency (memory fences) • Data structures and Workload • Linked list/queue • Read-mostly/update-heavy • Threads and Scheduling • Contention • Preemption • Memory Constraints

  20. Test Program while (parent's timer has not expired) { for i from 1 to 100 do { key = random key; op = random operation; d = data structures; op(d,key); } if (using QSBR) quiescent_state(); } • Execution time = test duration / total # of operations. • CPU time = execution time / min{# of threads, # of CPUs}

  21. Outline • Introduction • Memory Reclamation Schemes • Experimental Methodology • Performance Analysis • Conclusion

  22. Scalability with Traversal Length

  23. Scalability with Traversal Length • Explanation: the major affecting factor is per-operation atomic instructions (memory fences) • HPBR requires O(n) fences • EBR requires O(1) fence • QSBR only need one fence per several operations

  24. Scalability with Threads(No Preemption) • The relative performance is constant

  25. Scalability with Threads(With Preemption) • On the read-only case, the only affecting factor is the read-side overhead • On the write-heavy case, HPBR performs best due to its non-blocking design

  26. Scalability with Threads(With Preemption) • If we use busy-waiting policy instead of yielding policy, it can stall the grace period and exhaust all memory

  27. Fair Evaluation of Algorithms • Lock-free performance improves when the update fraction increases, because its updates require fewer atomic instructions than does locking. • One can accurately compare two lockless algorithms only when each is using the same reclamation scheme.

  28. Outline • Introduction • Memory Reclamation Schemes • Experimental Methodology • Performance Analysis • Conclusion

  29. Conclusion • Different schemes have very different overheads. • No globally optimal scheme. • QSBR is usually the best reclamation scheme. • HPBR is good when there is preemption and high update fraction.

More Related