310 likes | 666 Views
Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors. John M. Mellor-Crummey. Michael L. Scott. Joseph Garvey & Joshua San Miguel. Dance Hall Machines?. Atomic Instructions.
E N D
Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors • John M. Mellor-Crummey Michael L. Scott Joseph Garvey & Joshua San Miguel
Atomic Instructions • Various insns known as fetch_and_ф insns: test_and_set, fetch_and_store, fetch_and_add, compare_and_swap • Some can be used to simulate others but often with overhead • Some lock types require a particular primitive to be implemented or to be implemented efficiently
Test_and_set: Basic • type lock = (unlocked, locked) • procedure acquire_lock (lock *L) • while test_and_set (L) == locked ; • procedure release_lock (lock *L) • *L = unlocked
Test_and_set: Basic P P P $ $ $ Memory
Test_and_set: test_and_test_and_set • type lock = (unlocked, locked) • procedure acquire_lock (lock *L) • while 1 • if *L == unlocked • if test_and_set (L) == unlocked • return • procedure release_lock (lock *L) • *L = unlocked
Test_and_set: test_and_test_and_set P P P $ $ $ Memory
Test_and_set: test_and_set with backoff • type lock = (unlocked, locked) • procedure acquire_lock (lock *L) • delay = 1 • while test_and_set (L) == locked • pause (delay) • delay = delay * 2 • procedure release_lock (lock *L) • *L = unlocked
Ticket Lock • type lock = record • next_ticket = 0 • now_serving = 0 • procedure acquire_lock (lock *L) • my_ticket = fetch_and_increment(L->next_ticket) • while 1 • if L->now_serving == my_ticket • return • procedure release_lock (lock *L) • L->now_serving = L->now_serving + 1
Array-Based Queuing Locks • type lock = record • slots = array [0…numprocs – 1] of (has_lock, must_wait) • next_slot = 0 • procedure acquire_lock (lock *L) • my_place = fetch_and_increment (L->next_slot) • // Various modulo work to handle overflow • while L->slots[my_place] == must_wait ; • L->slots[my_place] = must_wait • procedure release_lock (lock *L) • L->slots[my_place + 1] = has_lock
Array-Based Queuing Locks P P P my_place my_place my_place $ $ $ Memory next_slot slots
MCS Locks procedure release_lock (lock *L, qnode *I) if I->next == Null if compare_and_swap (L, I, Null) return while I->next == Null ; I->next->locked = false • type qnode = record • qnode *next • bool locked • type lock = qnode* • procedure acquire_lock (lock *L, qnode *I) • I->next = Null • qnode *predecessor = fetch_and_store (L, I) • if predecessor != Null • I->locked = true • predecessor->next = I • while I->locked ;
MCS Locks 1-R L 2-B 2-R procedure release_lock (lock *L, qnode *I) if I->next == Null if compare_and_swap (L, I, Null) return while I->next == Null ; I->next->locked = false 3-B 3-R 3-E 4-R 4-B 5-B
Results: Single Processor Lock/Release Time • Butterfly’s atomic insns are very expensive • Butterfly can’t handle 24-bit pointers
Which lock should I use? fetch_and_store supported? • Atomic insns >> normal insns && 1 processor latency is very important don’t use MCS • If processes might be preempted test_and_set with exponential backoff No Yes fetch_and_increment supported? MCS Yes No Ticket test_and_set w/ exp backoff
Centralized Barrier P1 • P0 P2 P3 4 2 0 1 3
Software Combining Tree Barrier P1 • P0 P2 P3 0 1 2 P0 P2 2 1 0 0 1 2 P1 P3
Tournament Barrier P1 • P0 P2 P3 L C L W L W P0 P1 P2 P3
Dissemination Barrier P1 • P0 P2 P3 P0 P1 P2 P3
New Tree-Based Barrier P1 • P0 P2 P3 3 2 1 0 0 0 0
Barrier Decision Tree Multiprocessor? Distributed Shared Memory Broadcast-Based Cache-Coherent New Tree-Based Barrier (tree wakeup) New Tree-Based Barrier (central wakeup) Dissemination Barrier Centralized Barrier
Architectural Recommendations • No dance hall • No need for complicated hardware synch • Need a full set of fetch_and_ф