520 likes | 751 Views
Wait-Free Queues with Multiple Enqueuers and Dequeuers. Alex Kogan Erez Petrank Computer Science, Technion , Israel. FIFO queues. One of the most fundamental and common data structures. enqueue. dequeue. 5. 3. 2. 9. Concurrent FIFO queues.
E N D
Wait-Free Queues with Multiple Enqueuers and Dequeuers Alex KoganErezPetrank Computer Science, Technion, Israel
FIFO queues • One of the most fundamental and common data structures enqueue dequeue 5 3 2 9
Concurrent FIFO queues • Concurrent implementation supports “correct” concurrent adding and removing elements • correct = linearizable • The access to the shared memory should be synchronized enqueue dequeue 3 2 9 dequeue dequeue empty! dequeue
Non-blocking synchronization • No thread is blocked in waiting for another thread to complete • e.g., no locks / critical sections • Progress guarantees: • Obstruction-freedom • progress is guaranteed only in the eventual absence of interference • Lock-freedom • among all threads trying to apply an operation, one will succeed • Wait-freedom • a thread completes its operation in a bounded number of steps
Lock-freedom • Among all threads trying to apply an operation, one will succeed • opportunistic approach • make attempts until succeeding • global progress • all but one threads may starve • Many efficient and scalable lock-free queue implementations
Wait-freedom • A thread completes its operation in a bounded number of steps • regardless of what other threads are doing • A highly desired property of any concurrent data structure • but, commonly regarded as inefficient and too costly to achieve • Particularly important in several domains • real-time systems • operating under SLA • heterogeneous environments
Related work: existing wait-free queues • Limited concurrency • one enqueuer and one dequeuer • multiple enqueuers, one concurrent dequeuer • multiple dequeuers, one concurrent enqueuer • Universal constructions • generic method to transform any (sequential) object into lock-free/wait-free concurrent object • expensive impractical implementations • (Almost) no experimental results [Lamport’83] [David’04] [Jayanti&Petrovic’05] [Herlihy’91]
Related work: lock-free queue [Michael & Scott’96] • One of the most scalable and efficient lock-free implementations • Widely adopted by industry • part of Java Concurrency package • Relatively simple and intuitive implementation • Based on singly-linked list of nodes 12 4 17 head tail
MS-queue brief review: enqueue CAS 12 17 9 4 CAS head tail enqueue 9
MS-queue brief review: enqueue CAS 12 17 9 5 4 CAS head tail CAS enqueue enqueue 9 5
MS-queue brief review: dequeue 12 17 9 4 12 CAS head tail dequeue
Our idea (in a nutshell) • Based on the lock-free queue by Michael & Scott • Helping mechanism • each operation is applied in a bounded time • “Wait-free” implementation scheme • each operation is applied exactly once
Helping mechanism • Each operation is assigned a dynamic age-based priority • inspired by the Doorway mechanism used in Bakery mutex Each thread accessing a queue • chooses a monotonically increasing phase number • writes down its phase and operation info in a special state array • helps all threads with a non-larger phase to apply their operations phase: long pending: boolean state entry per thread enqueue: boolean node: Node
Helping mechanism in action 4 phase 9 3 9 false true true pending false true enqueue true true true null ref ref ref node
Helping mechanism in action 4 phase 9 10 9 I need to help! true true false pending true enqueue true true true true null ref ref ref node
Helping mechanism in action 4 phase 10 9 9 I do not need to help! true pending false true true true true true true enqueue ref null ref node ref
Helping mechanism in action 4 phase 10 11 9 I need to help! I do not need to help! false true pending true true true true true enqueue false ref null ref null node
Helping mechanism in action • The number of operations that may linearize before any given operation is bounded • hence, wait-freedom phase 10 9 11 4 pending true true false true true true enqueue false true ref null null ref node
Optimized helping • The basic scheme has two drawbacks: • the number of steps executed by each thread on every operation depends on n (the number of threads) • even when there is no contention • creates scenarios where many threads help same operations • e.g., when many threads access the queue concurrently • large redundant work • Optimization: help one thread at a time, in a cyclic manner • faster threads help slower peers in parallel • reduces the amount of redundant work
How to choose the phase numbers • Every time tichooses a phase number, it is greater than the number of any thread that made its choice before ti • defines a logical order on operations and provides wait-freedom • Like in Bakery mutex: • scan through state • calculate the maximal phase value + 1 • requires O(n) steps • Alternative: use an atomic counter • requires O(1) steps 4 3 5 true false true true true true ref null ref 6!
“Wait-free” design scheme • Break each operation into three atomic steps • can be executed by different threads • cannot be interleaved • Initial change of the internal structure • concurrent operations realize that there is an operation-in-progress • Updating the state of the operation-in-progress as being performed (linearized) • Fixing the internal structure • finalizing the operation-in-progress
Internal structures 1 2 4 head tail 9 phase 4 9 false false false pending true false enqueue true null node null null state
Internal structures these elements were enqueued by Thread 0 this element was enqueued by Thread 1 enqTid: int 2 4 1 holds ID of the thread that performs / has performed the insertion of the node into the queue 0 1 0 -1 1 -1 head tail 9 phase 9 4 false pending false false true true false enqueue node null null null state
Internal structures this element was dequeued by Thread 1 deqTid: int 1 4 2 holds ID of the thread that performs / has performed the removal of the node into the queue 0 1 0 1 -1 -1 head tail 9 4 phase 9 false false pending false true false true enqueue null null null node state
enqueue operation Creating a new node 12 6 4 17 0 1 2 0 -1 -1 -1 -1 head tail phase 4 9 9 false false pending false enqueue true true false enqueue node null null null 6 state ID: 2
enqueue operation Announcing a new operation 6 17 4 12 2 0 0 1 -1 -1 -1 -1 head tail 4 9 10 phase pending true false false enqueue false true enqueue true null null node 6 state ID: 2
enqueue operation Step 1: Initial change of the internal structure CAS 17 4 12 6 0 0 1 2 -1 -1 -1 -1 head tail 4 10 9 phase true false false pending enqueue true true false enqueue node null null 6 state ID: 2
enqueue operation Step 2: Updating the state of the operation-in-progress as being performed 6 17 4 12 0 2 0 1 -1 -1 -1 -1 head tail CAS 4 9 phase 10 pending false false false enqueue enqueue true true false null null node 6 state ID: 2
enqueue operation Step 3: Fixing the internal structure 6 17 4 12 1 0 2 0 -1 -1 -1 -1 CAS head tail phase 4 9 10 pending false false false enqueue false true true enqueue null null node 6 state ID: 2
enqueue operation Step 1: Initial change of the internal structure 6 17 4 12 2 0 0 1 -1 -1 -1 -1 head tail 10 4 phase 9 false pending false true enqueue enqueue true enqueue true false null null node 3 6 state ID: 2 ID: 0
enqueue operation Creating a new node Announcing a new operation 3 6 17 4 12 0 1 0 2 0 -1 -1 -1 -1 -1 head tail phase 10 11 4 true false true pending enqueue enqueue true true enqueue true node null 3 6 state ID: 0 ID: 2
enqueue operation Step 2: Updating the state of the operation-in-progress as being performed 3 17 4 12 6 0 0 0 1 2 -1 -1 -1 -1 -1 head tail phase 10 11 4 true false true pending enqueue enqueue true enqueue true true null node 3 6 state ID: 0 ID: 2
enqueue operation Step 2: Updating the state of the operation-in-progress as being performed 3 12 4 17 6 0 1 0 0 2 -1 -1 -1 -1 -1 head tail CAS 4 10 11 phase false true pending false enqueue enqueue true enqueue true true node null 3 6 state ID: 0 ID: 2
enqueue operation Step 3: Fixing the internal structure 3 12 17 4 6 0 0 0 1 2 -1 -1 -1 -1 -1 CAS head tail phase 11 10 4 pending false false true enqueue enqueue true enqueue true true null node 3 6 state ID: 0 ID: 2
enqueue operation Step 1: Initial change of the internal structure CAS 12 17 3 4 6 0 0 1 0 2 -1 -1 -1 -1 -1 head tail 11 4 phase 10 pending false false true enqueue enqueue true enqueue true true node null 3 6 state ID: 0 ID: 2
dequeue operation 17 4 12 0 1 0 -1 -1 -1 head tail 9 4 phase 9 pending false false false dequeue true false enqueue true null node null null state ID: 2
dequeue operation Announcing a new operation 17 4 12 0 1 0 -1 -1 -1 head tail 4 10 phase 9 pending true false false dequeue true false enqueue false null null node null state ID: 2
dequeue operation Updating state to refer the first node 4 17 12 0 1 0 -1 -1 -1 head tail phase 10 9 4 pending true false false dequeue false enqueue true false CAS null null node state ID: 2
dequeue operation Step 1: Initial change of the internal structure 17 4 12 CAS 0 1 0 -1 2 -1 head tail 9 phase 10 4 pending false true false dequeue false true false enqueue null null node state ID: 2
dequeue operation Step 2: Updating the state of the operation-in-progress as being performed 17 4 12 0 1 0 -1 -1 2 head tail CAS 10 4 9 phase false false false pending dequeue true false false enqueue null node null state ID: 2
dequeue operation Step 3: Fixing the internal structure 17 4 12 0 1 0 -1 -1 2 head CAS tail 9 10 phase 4 false false false pending dequeue false enqueue true false null null node state ID: 2
Benchmarks • Enqueue-Dequeue benchmark • the queue is initially empty • each thread iteratively performs enqueue and then dequeue • 1,000,000 iterations per thread • 50%-Enqueuebenchmark • the queue is initialized with 1000 elements • each thread decides uniformly and random which operation to perform, with equal odds for enqueue and dequeue • 1,000,000 operations per thread
Tested algorithms Compared implementations: • MS-queue • Base wait-free queue • Optimized wait-free queue • Opt 1: optimized helping (help one thread at a time) • Opt 2: atomic counter-based phase calculation • Measure completion time as a function of # threads
Enqueue-Dequeuebenchmark • TBD: add figures
The impact of optimizations • TBD: add figures
Optimizing further: false sharing • Created on accesses to state array • Resolved by stretching the state with dummy pads • TBD: add figures
Optimizing further: memory management • Every attempt to update state is preceded by an allocation of a new record • these records can be reused when the attempt fails • (more) validation checks can be performed to reduce the number of failed attempts • When an operation is finished, remove the reference from state to a list node • help garbage collector
Implementing the queue without GC • Apply Hazard Pointers technique [Michael’04] • each thread is associated with hazard pointers • single-writer multi-reader registers • used by threads to point on objects they may access later • when an object should be deleted, a thread stores its address in a special stack • once in a while, it scans the stack and recycle objects only if there are no hazard pointers pointing on it • In our case, the technique can be applied with a slight modification in the dequeue method
Summary • First wait-free queue implementation supporting multiple enqueuers and dequeuers • Wait-freedom incurs an inherent trade-off • bounds the completion time of a single operation • has a cost in a “typical” case • The additional cost can be reduced and become tolerable • Proposed design scheme might be applicable for other wait-free data structures