420 likes | 509 Views
A Dynamic Elimination-Combining Stack Algorithm. Gal Bar-Nissan, Danny Hendler and Adi Suissa Department of Computer Science, BGU, January 2011. Presnted by: Ilya Mirsky 28.03.2011. Outline. Concurrent programming terms Motivation Introduction DECS: The Algorithm
E N D
A Dynamic Elimination-Combining Stack Algorithm Gal Bar-Nissan, Danny Hendlerand AdiSuissaDepartment of Computer Science, BGU, January 2011 Presnted by: Ilya Mirsky 28.03.2011
Outline • Concurrent programming terms • Motivation • Introduction • DECS: The Algorithm • DECS Performance evaluation • NB-DECS • Summary
Concurrent programming terms • Locks (coarse and fine grained) • Non blocking algorithms • Wait-freedom • Lock-freedom • Obstruction-freedom • Linearizability • Memory Contention • Latency
Outline • Concurrent programming terms • Motivation • Introduction • DECS: The Algorithm • DECS Performance evaluation • NB-DECS • Summary
Motivation • Concurrent stacks are widely used in parallel applications and operating systems. • A simple implementation using coarse grained locking mechanism causes a “hot spot” at the central stack object and poses a sequential bottleneck. • There is a need in a scalable concurrent stack, which presents a good performance under low, medium and high workloads, with no dependency in the ratio of the operations type (push/ pop).
Outline • Concurrent programming terms • Motivation • Introduction • DECS: The Algorithm • DECS Performance evaluation • NB-DECS • Summary
Introduction • Two key synchronization paradigms for construction of scalable concurrent data structures are software combining and elimination. • The most highly scalable concurrent stack algorithm previously known is the lock-free elimination-backoff stack )Hendler, Shavit, Yershalmi). • The HSY stack is highly efficient under low contention, as well as under high contention when workload is symmetric. • Unfortunately, when workloads are asymmetric, the performance of HSY deteriorates to a sequential stack. • Flat-combining (by Hendler et al.) significantly outperforms HSY in low and medium contentions, but it does not scale and even deteriorates at high contention level.
Introduction - DECS • DECS employs both combining & elimination mechanism. • Scales well for all workload types, and outperforms other stack implementations. • Maintains the simplicity and low overhead of the HSY stack. • Uses a contention-reduction layer as a backoff scheme for a central stack- an elimination-combining layer. • A non blocking implementation is presented, NB-DECS, a lock-free variant of DECS in which threads that have waited for too long may cancel their “combining contract” and retry their operation on the central stack.
Introduction - DECS Central Stack Elimination-combining layer
Introduction - DECS Central Stack Elimination-combining layer
Introduction - DECS zzz… zzz… zzz… Central Stack Elimination-combining layer
Introduction - DECS zzz… zzz… zzz… Wake up! Central Stack Elimination-combining layer
Introduction - DECS zzz… Central Stack Elimination-combining layer
Introduction - DECS zzz… Central Stack Elimination-combining layer
Introduction - DECS zzz… Central Stack Elimination-combining layer
Outline • Concurrent programming terms • Motivation • Introduction • DECS: The Algorithm • DECS Performance evaluation • NB-DECS • Summary
DECS- The Algorithm • The data structures MultiOp int id; intop; int length; intcStatus; Cellcell; MultiOp next; MultiOp last; 1 6 4 Collision Array Locations Array Elimination-combining layer Cell Data data; Cell next; Cell Data data; Cell next; Cell Data data; Cell next; Cell Data data; Cell next; CentralStack
I wish there was someone in similar situation… DECS- The Algorithm push(data1) I wish there was someone in similar situation… Central Stack push(data2) pop()
DECS- The Algorithm multiOptInfo = initMultiOp(); multiOptInfo = initMultiOp(data);
DECS- The Algorithm Passive collider I’ll wait, maybe someone will arrive… …4 6 EMPTY Collision Array Active collider Yay, I can collide with thread 6! 6 T. 6 Locations Array …4 MultiOp id = 6 op = PUSH length = 1 cStatus= INIT cell next = NULL last MultiOp id = 2 op = POP length = 1 cStatus= INIT cell next = NULL last data1 T. 2 EMPTY
DECS- The Algorithm • Central Stack Functions
DECS- The Algorithm zzz… I see that T. 6 got PUSH, and I got POP- we can eliminate! Collision Array T. 6 Locations Array MultiOp id = 6 op = PUSH length = 1 cStatus= INIT cell next = NULL last MultiOp id = 2 op = POP length = 1 cStatus= INIT cell next = NULL last data1 T. 2 EMPTY
DECS- The Algorithm • Elimination-Combining Layer Functions
DECS- The Algorithm zzz… MultiOp id = 6 op = PUSH length = 1 cStatus= INIT cell next = NULL last MultiOp id = 6 op = PUSH length = 0 cStatus= FINISHED cell next = NULL last data1 T. 6 Working… MultiOp id = 2 op = POP length = 1 cStatus= INIT cell next = NULL last MultiOp id = 2 op = POP length = 0 cStatus= FINISHED cell next = NULL last T. 2 EMPTY
DECS- The Algorithm zzz… MultiOp id = 6 op = PUSH length = 1 cStatus= INIT cell next = NULL last MultiOp id = 6 op = PUSH length = 0 cStatus= FINISHED cell next = NULL last data1 T. 6 Done! Working… MultiOp id = 2 op = POP length = 1 cStatus= INIT cell next = NULL last MultiOp id = 2 op = POP length = 0 cStatus= FINISHED cell next = NULL last T. 2
DECS- The Algorithm Thank you T. 2, let’s go have a beer; I’m buying! zzz… Wake up man, I’ve done your job! T. 6 T. 2
Outline • Concurrent programming terms • Motivation • Introduction • DECS: The Algorithm • DECS Performance evaluation • NB-DECS • Summary
DECS Performance Evaluation • Hardware • 128-way UltraSparc T2 Plus (T5140) server. A 2 chip system, in which each chip contains 8 cores, and each core multiplexes 8 hardware threads. • Running Solaris 10 OS. • The cores in each CPU share the same L2 cache. • C++ code compiled with GCC with the –O3 flag. • Compared VS: • Treiber stack • The HSY elimination-backoff stacks • Flat-combining stack
DECS Performance Evaluation • Course of experiments • Threads repeatedly apply operations on the stack for a fixed duration of 1 sec, and the resulting throughput is measured, varying the level of concurrency from 1 to 128. • Throughput is measured on both symmetric and asymmetric workloads. • Stacks are pre-populated with enough cells so that pop operations do not operate on an empty stack. • Each data point is the average of 3 runs.
DECS Performance Evaluation • Symmetric workload X-axis: threads number
DECS Performance Evaluation • Moderately-asymmetric workload X-axis: threads number
DECS Performance Evaluation • Fully-asymmetric workload X-axis: threads number
Outline • Concurrent programming terms • Motivation • Introduction • DECS: The Algorithm • DECS Performance evaluation • NB-DECS • Summary
NB-DECS • DECS is blocking. • For some applications non-blocking implementation may be preferable because it’s more robust to thread failures. • NB-DECS is a lock-free variant of DECS that allows threads that delegated their operations to another thread, and have waited for too long, to cancel their “combining contracts”, and retry their operations.
Outline • Concurrent programming terms • Motivation • Introduction • DECS: The Algorithm • DECS Performance evaluation • NB-DECS • Summary
Summary • DECS comprises a combining-elimination layer, therefore benefits from collision of operations of reverse, as well as identical semantics. • Empirical evaluation showed that DECS outperforms all best known stack algorithms for all workloads. • NB-DECS • The idea of combining-elimination layer could be used to efficiently implement other concurrent data-structures.