640 likes | 772 Views
Topic 5. Synchronization and Costs for Shared Memory. “.... You will be assimilated. Resistance is futile.“ Star Trek. Synchronization. The orchestration of two or more threads (or processes) to complete a task in a correct manner and avoid any data races Data Race or Race Condition
E N D
Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek ELEG652-06F
Synchronization • The orchestration of two or more threads (or processes) to complete a task in a correct manner and avoid any data races • Data Race or Race Condition • “There is an anomaly of concurrent accesses by two or more threads to a shared memory and at least one of the accesses is a write” • Atomicity and / or serialibility ELEG652-06F
Atomicity • Atomic From the Greek “Atomos” which means indivisible • An “All or None” scheme • An instruction (or a group of them) will appear as if it was (they were) executed in a single try • All side effects of the instruction (s) in the block are seen in its totality or not all • Side effects Writes and (Causal) Reads to the variables inside the atomic block ELEG652-06F
Atomicity • Word aligned load and stores are atomic in almost all architectures • Unaligned and bigger than word accesses are usually not atomic • What happens when non-atomic operations goes wrong • The final result will be a garbled combination of values • Complete operations might be lost in the process • Strong Versus Weak Atomicity ELEG652-06F
Synchronization • Applied to Shared Variables • Synchronization might enforce ordering or not • High level Synchronization types • Semaphores • Mutex • Barriers • Critical Sections • Monitors • Conditional Variables ELEG652-06F
Semaphores • Intelligent Counters of Resources • Zero Means not available • Abstract data which has two operations involved • P probeer te verlagen: “try to decrease” Waits (Busy waits or sleeps) if the resource is not available. • V verhoog: “increase.” Frees the resource • Binary V.S. Blocking V.S. Counting Semaphores • Binary: Initial Value will allow threads to obtain it • Blocking: Initial Value will block the threads • Counting: Initial Value is not zero • Note: P and V are atomic operations!!!! ELEG652-06F
Mutex • Mutual Exclusion Lock • A binary semaphore to ensure that one thread (and only one) will access the resource • P Lock the mutex • V Unlock the mutex • It doesn’t enforce ordering • Fine V.S. Coarse grained ELEG652-06F
Barriers • A high level programming construct • Ensure that all participating threads will wait at a program point for all other (participating) threads to arrive, before they can continue • Types of Barriers • Tree Barriers (Software Assisted) • Centralized Barriers • Tournament Barriers • Fine grained Barriers • Butterfly style Barriers • Consistency Barriers (i.e. #pragma omp flush) ELEG652-06F
Critical Sections • A piece of code that is executed by one and only one thread at any point in time • If T1 finds CS in use, then it waits until the CS is free for it to use it • Special Case: • Conditional Critical Sections: Threads waits on a “given” signal to resume execution. • Better implemented with lock free techniques (i.e. Transactional Memory) ELEG652-06F
Monitors and Conditional Variables • A monitor consists of: • A set of procedures to work on shared variables • A set of shared variables • An invariant • A lock to protect from access by other threads • Conditional Variables • The invariant in a monitor (but it can be used in other schemes) • It is a signal place holder for other threads activities ELEG652-06F
Much More … • However, all of these are abstractions • Major elements • A synchronization element that ensure atomicity • Locks!!!! • A synchronization element that ensure ordering • Barriers!!!! • Implementations and types • Common types of atomic primitives • Read – Modify – Write Back cycles • Synch Overhead may break a system • Unnecessary consistency actions • Communication cost between threads • Why Distributed Memory Machines have “implicit” synchronization? ELEG652-06F
Topic 5a Locks ELEG652-06F
Implementation • Atomic Primitives • Fetch and Φ operations • Read – Modify – Write Cycles • Test and Set • Fetch and Store • Exchange register and memory • Fetch and Add • Compare and Swap • Conditionally exchange the value of a memory location ELEG652-06F
Implementation • Use by programmers to implement more complex synchronization constructs • Waiting behavior • Scheduler based: The process / thread is de-scheduled and will be scheduled in a future time • Busy Wait: The process / thread polls on the resource until it is available • Dependent on the Hardware / OS / Scheduler behavior ELEG652-06F
Types of (Software) LocksThe Spin Lock Family • The Simple Test and Set Lock • Polls a shared Boolean variable: A binary semaphore • Uses Fetch and Φ operations to operate on the binary semaphore • Expensive!!!! • Waste bandwidth • Generate Extra Busses transactions • The test test and set approach • Just poll when the lock is in use ELEG652-06F
Types of (Software) LocksThe Spin Lock Family • Delay based Locks • Spin Locks in which a delay has been introduced in testing the lock • Constant delay • Exponentional Back-off • Best Results • The test test and set scheme is not needed ELEG652-06F
Types of (Software) LocksThe Spin Lock Family Pseudo code: enum LOCK_ACTIONS = {LOCKED, UNLOCKED}; void acquire_lock(lock_t L) { int delay = 1; while(! test_and_set(L, LOCKED) ) { sleep(delay); delay *= 2; } } void release_lock(lock_t L) { L = UNLOCKED; } ELEG652-06F
Types of (Software) LocksThe Ticket Lock • Reduce the # of Fetch and Φ operations • Only one per lock acquisition • Strongly fair lock • No starvation • A FIFO service • Implementation: Two counters • A Request and Release Counters ELEG652-06F
Types of (Software) LocksThe Ticket Lock T1 T2 T3 T4 T5 0 0 Request Release T1 acquires the lock ELEG652-06F
Types of (Software) LocksThe Ticket Lock T1 T2 T3 T4 T5 1 0 Request Release T2 requests the lock ELEG652-06F
Types of (Software) LocksThe Ticket Lock T1 T2 T3 T4 T5 2 0 Request Release T3 requests the lock ELEG652-06F
Types of (Software) LocksThe Ticket Lock T1 T2 T3 T4 T5 3 1 Request Release T1 releases the lock T2 gets the lock T4 requests the lock ELEG652-06F
Types of (Software) LocksThe Ticket Lock T1 T2 T3 T4 T5 4 1 Request Release T5 requests the lock ELEG652-06F
Types of (Software) LocksThe Ticket Lock T1 T2 T3 T4 T5 5 1 Request Release T1 requests the lock ELEG652-06F
Types of (Software) LocksThe Ticket Lock T1 T2 T3 T4 T5 5 2 Request Release T2 releases the lock T3 acquires the lock ELEG652-06F
Types of (Software) LocksThe Ticket Lock • Reduce the number of Fetch and Φ operations • Only read ops on the release counter • However, still a lot of memory and network bandwidth wasted. • Back off techniques also used • Exponentional Back off • A bad idea • Constant Delay • Minimum time of holding a lock • Proportional Back off • Dependent on how many are waiting for the lock ELEG652-06F
Types of (Software) LocksThe Ticket Lock Pseudocode: unsigned int next_ticket = 0; unsigned int now_serving = 0; void acquire_lock() { unsigned int my_ticket = fetch_and_increment(next_ticket); while{ sleep(my_ticket - now_serving); if(now_serving == my_ticket) return; } } void release_lock() { now_serving = now_serving + 1; } ELEG652-06F
Types of (Software) LocksThe Array Based Queue Lock • Contention on the release counter • Cache Coherence and memory traffic • Invalidation of the counter variable and the request to a single memory bank • Two elements • An Array and a tail pointer that index such array • The array is as big as the number of processor • Fetch and store Address of the array element • Fetch and increment Tail pointer • FIFO ordering ELEG652-06F
Types of (Software) LocksThe Array Based Queue Lock T4 T1 T2 T3 T5 Enter Wait Wait Wait Wait Tail The tail pointer points to the beginning of the array The all array elements except the first one are marked to wait ELEG652-06F
Types of (Software) LocksThe Array Based Queue Lock T4 T1 T2 T3 T5 Enter Wait Wait Wait Wait Tail T1 Gets the lock ELEG652-06F
Types of (Software) LocksThe Array Based Queue Lock T4 T1 T2 T3 T5 Enter Wait Wait Wait Wait Tail T2 Requests ELEG652-06F
Types of (Software) LocksThe Array Based Queue Lock T4 T1 T2 T3 T5 Enter Wait Wait Wait Wait Tail T3 requests ELEG652-06F
Types of (Software) LocksThe Array Based Queue Lock T4 T1 T2 T3 T5 Wait Enter Wait Wait Wait Tail T1 releases T2 Gets ELEG652-06F
Types of (Software) LocksThe Array Based Queue Lock T4 T1 T2 T3 T5 Wait Enter Wait Wait Wait Tail T4 Requests ELEG652-06F
Types of (Software) LocksThe Array Based Queue Lock T4 T1 T2 T3 T5 Wait Enter Wait Wait Wait Tail T1 requests ELEG652-06F
Types of (Software) LocksThe Array Based Queue Lock T4 T1 T2 T3 T5 Wait Wait Enter Wait Wait Tail T2 releases T3 gets ELEG652-06F
Types of (Software) LocksThe Queue Locks • It uses too much memory • Linear space (relative to the number of processors) per lock. • Array • Easy to implement • Linked List: QNODE • Cache management ELEG652-06F
Types of (Software) LocksThe MCS Lock • Characteristics • FIFO ordering • Spins on locally accessible flag variables • Small amount of space per lock • Works equally well on machines with and without coherent caches • Similar to the QNODE implementation of queue locks • QNODES are assigned to local memory • Threads spins on local memory ELEG652-06F
MCS: How it works? • Each processor enqueues its own private lock variable into a queue and spins on it • key: spin locally • CC model: spin in local cache • DSM model: spin in local private memory • No contention • On lock release, the releaser unlocks the next lock in the queue • Only have bus/network contention on actual unlock • No starvation (order of lock acquisitions defined by the list) ELEG652-06F
MCS Lock • Requires atomic instruction: • compare-and-swap • fetch-and-store • If there is no compare-and-swap • an alternative release algorithm • extra complexity • loss of strict FIFO ordering • theoretical possibility of starvation • Detail: Mellor-Crummey and Scott’s 1991 paper ELEG652-06F
Tail Flag Next Tail Flag Next F = 1 Next Tail MCS: Example Init Proc 1 gets Proc 2 tries CPU 3 • CPU 1 holds the “real” lock • CPU 2, CPU 3 and CPU 4 spins on the flag • When CPU 1 releases, it releases the lock and change the flag variable of the next in the list CPU 2 CPU 4 CPU 1 ELEG652-06F
ImplementationModern Alternatives • Fetch and Φ operations • They are restrictive • Not all architecture support all of them • Problem: A general one atomic op is hard!!! • Solution: Provide two primitives to generate atomic operations • Load Linked and Store Conditional • Remember PowerPC lwarx and stwcx instructions ELEG652-06F
An ExampleSwap Exchange the contents of register R4 with memory location pointed by R1 try: mov R3, R4 ld R2, 0(R1) st R3, 0(R1) mov R4, R2 Not Atomic!!!! ELEG652-06F
An ExampleAtomic Swap Swap (Fetch and store) using ll and sc try: mov R3, R4 ll R2, 0(R1) sc R3, 0(R1) beqz R3, try mov R4, R2 In case that another processor writes to the value pointed by R1 before the sc can complete, the reservation (usually keep in register) is lost. This means that the sc will fail and the code will loop back and try again. ELEG652-06F
Another ExampleFetch and Increment and Spin Lock Fetch and Increment using ll-sc try: ll R2, 0(R1) addi R2, R2, #1 sc R2, 0(R1) beqz R2, try Spin Lock using ll-sc The exch instruction is equivalent to the Atomic Swap Instruction Block presented earlier Assume that the lock is not cacheable Note: 0 Unlocked; 1 Locked li R2, #1 lockit: exch R2, 0(R1) bnez R2, lockit ELEG652-06F
Performance Penalty Example Suppose there are 10 processors on a bus that each try to lock a variable simultaneously. Assume that each bus transaction (read miss or write miss) is 100 clock cycles long. You can ignore the time of the actual read or write of a lock held in the cache, as well as the time the lock is held (they won’t matter much!) Determine the performance penalty. ELEG652-06F
Answer It takes over 12,000 cycles total for all processor to pass through the lock! Note: the contention of the lock and the serialization of the bus transactions. See example on pp 596, Henn/Patt, 3rd Ed. ELEG652-06F
Performance Penalty • Assume the same example as before (100 cycles per bus transaction, 10 processors) but consider the case of a queue lock which only updates on a miss Paterson and Hennesy p 603 ELEG652-06F
Performance Penalty • Answer: • First time: n+1 • Subsequent access: 2(n-1) • Total: 3n – 1 • 29 Bus cycles or 2900 clock cycles ELEG652-06F
Implementing Locks Using Coherence lockit: ld R2, 0(R1) bnez R2, lockit li R2, #1 exch R2, 0(R1) bnez R2, lockit lockit: ll R2, 0(R1) bnez R2, lockit li R2, #1 sc R2, 0(R1) beqz R2, lockit ELEG652-06F