230 likes | 253 Views
This paper examines different spin lock algorithms for shared-memory multiprocessors and proposes alternative approaches to reduce communication overhead during spin waiting. The performance of these alternatives is benchmarked and compared.
E N D
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson Murphy-Hill Slide 1 (of 23)
Introduction • Shared Memory Multiprocessors • Mutual exclusion required • Almost always hardware primitives provided • Direct mutual exclusion • Mutual exclusion through locking • Interest here: short critical regions, spin locks • The problem: spinning processors cost communication bandwidth – how can we cut it? Slide 2 (of 23)
Range of Architectures • Two dimensions: • Interconnect type (multistage network or bus) • Cache type • So six architectures considered: • Multistage network without private caches • Multistage network, invalidation based cache coherence using RD • Bus without coherent private cache • Bus w/snoopy write through invalidation-based cache coherence • Bus with snoopy write-back invalidation based cache coherence • Bus with snoopy distributed write cache coherence • Architectures generally read, modify, and write atomically Slide 3 (of 23)
Why Spinlocks are Slow • Tradeoff: frequent polling gets you the lock faster, but slows everyone else down • Latency is an issue: some overhead for complicated spinlock algorithm Slide 4 (of 23)
A Spin-Waiting Algorithm • Spin on Test-and-Set while(TestAndSet(lock) = BUSY); <criticial section> Lock := CLEAR; • Slow, because: • Lock holder must content with non-lock holders • Spinning requests slow other requests Slide 5 (of 23)
Another Spin-Waiting Algorithm • Spin on Read (Test-and-Test-and-Set) while(lock=BUSY or TestAndSet(lock)=BUSY); <criticial section> lock := CLEAR; • For architectures with per-processor cache • Like previous, but no network/bus communication on read • For short critical sections, this is slow, because the time to quiesce (all processors resume spinning) dominates Slide 6 (of 23)
Reasons Why Quiescence is Slow • Elapsed time between Read and Test-and-Set • All cached copies of a lock are invalidated on a Test-and-Set, even if the test fails • Invalidation-based cache-coherence requires O(P) bus/network cycles, because a written value has to be propegated to every processor (the same one!) Slide 7 (of 23)
Validation Slide 8 (of 23)
Validation (a bit more) Slide 9 (of 23)
Now, Speed it Up… • Author presents 5 alternative approaches • Interesting approach – 4 are based on the observation that communication during spin waiting is like CSMA (Ethernet) networking protocols Slide 10 (of 23)
1/5: Static Delay on Lock Release • When a processor notices the lock has been released, it waits a fixed amount of time before trying a Test-And-Set • Each processor is assigned a static delay (slot) • Good performance: • Fewer slots, fewer spinning processors • Many slots, more spinning processors Slide 11 (of 23)
2/5: Backoff on Lock Release • Like Ethernet backoff • Wait a small amount of time between Read and Test-and-Set • If processor collides with another processor, it backs off for a greater random interval • Indirectly, processors base backoff interval on the number of spinning processors • But… Slide 12 (of 23)
More on Backoff… • Processors should not change their mean delay if another processor acquires the lock • Maximum time to delay should be bounded • Initial delay on arrival should be a fraction of the last delay Slide 13 (of 23)
3/5: Static Delay before Reference while(lock=BUSY or TestAndSet(lock)=BUSY) delay(); <criticial section> • Here you just check the lock less often • Good when: • Checking frequently, and few other spinners • Checking infrequently, many spinners Slide 14 (of 23)
4/5: Backoff before Reference while(lock=BUSY or TestAndSet(lock)=BUSY) delay(); delay += randomBackoff(); <criticial section> • Analogous to backoff on lock release • Both dynamic and static backoff are bad when the critical section is long: they just keep backing off while the lock is being held Slide 15 (of 23)
5/5: Queue • Can’t estimate backoff by number of waiting processes, can’t keep a process queue (just as slow as the lock!) • This author’s contribution (finally): Init flags[0] := HAS_LOCK; flags[1..P-1] := MUST_WAIT; queueLast := 0; Lock myPlace := ReadAndIncrement(queueLast); while(flags[myPlace mod P]=MUST_WAIT); <critical section> Unlock flags[myPlace mod P] := MUST_WAIT; flags[(myPlace+1) mod P] := HAS_LOCK; Slide 16 (of 23)
More on Queuing • Works especially well for multistage networks – each flag can be on a separate module, so a single memory location isn’t saturated with requests • Works less well if there’s a bus without cache coherence, because we still have the problem that each process has to poll for a single value in one place • Lock latency is increased (overhead), so poor performance when there’s no contention Slide 17 (of 23)
Benchmark Spin-lock Alternatives Slide 18 (of 23)
Overhead vs. Number of Slots Slide 19 (of 23)
Spin-waiting Overhead for a Burst Slide 20 (of 23)
Network Hardware Solutions • Combining Networks • Multiple paths to same memory location • Hardware Queuing • Eliminates polling across the network • Goodman’s Queue Links • Stores the name of the next processor in the queue directly in each processor’s cache • Eliminates need for memory access for queuing Slide 21 (of 23)
Bus Hardware Solutions • Invalidate cache copies ONLY when Test-and-Set succeeds • Read broadcast • Whenever some other processor reads a value which I know is invalid, I get a copy of that value too (piggyback) • Eliminates the cascade of read-misses • Special handling of Test-and-Set • Cache and bus controllers don’t mess with the bus if the lock is busy • Essentially, doesn’t do a test-and-set so long as there is a possibility it might fail Slide 22 (of 23)
Conclusions • Spin-locking performance doesn’t scale • A variant of Ethernet backoff has good results when there is little lock contention • Queuing (parallelizing lock handoff) has good results when there are waiting processors • A little supportive hardware goes a long way towards a healthy multiprocessor relationship Slide 23 (of 23)