Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors • John M. Mellor-Crummey Michael L. Scott Joseph Garvey & Joshua San Miguel

Dance Hall Machines?

Atomic Instructions • Various insns known as fetch_and_ф insns: test_and_set, fetch_and_store, fetch_and_add, compare_and_swap • Some can be used to simulate others but often with overhead • Some lock types require a particular primitive to be implemented or to be implemented efficiently

Test_and_set: Basic • type lock = (unlocked, locked) • procedure acquire_lock (lock *L) • while test_and_set (L) == locked ; • procedure release_lock (lock *L) • *L = unlocked

Test_and_set: Basic P P P $ $ $ Memory

Test_and_set: test_and_test_and_set • type lock = (unlocked, locked) • procedure acquire_lock (lock *L) • while 1 • if *L == unlocked • if test_and_set (L) == unlocked • return • procedure release_lock (lock *L) • *L = unlocked

Test_and_set: test_and_test_and_set P P P $ $ $ Memory

Test_and_set: test_and_set with backoff • type lock = (unlocked, locked) • procedure acquire_lock (lock *L) • delay = 1 • while test_and_set (L) == locked • pause (delay) • delay = delay * 2 • procedure release_lock (lock *L) • *L = unlocked

Ticket Lock • type lock = record • next_ticket = 0 • now_serving = 0 • procedure acquire_lock (lock *L) • my_ticket = fetch_and_increment(L->next_ticket) • while 1 • if L->now_serving == my_ticket • return • procedure release_lock (lock *L) • L->now_serving = L->now_serving + 1

Array-Based Queuing Locks • type lock = record • slots = array [0…numprocs – 1] of (has_lock, must_wait) • next_slot = 0 • procedure acquire_lock (lock *L) • my_place = fetch_and_increment (L->next_slot) • // Various modulo work to handle overflow • while L->slots[my_place] == must_wait ; • L->slots[my_place] = must_wait • procedure release_lock (lock *L) • L->slots[my_place + 1] = has_lock

Array-Based Queuing Locks P P P my_place my_place my_place $ $ $ Memory next_slot slots

MCS Locks procedure release_lock (lock *L, qnode *I) if I->next == Null if compare_and_swap (L, I, Null) return while I->next == Null ; I->next->locked = false • type qnode = record • qnode *next • bool locked • type lock = qnode* • procedure acquire_lock (lock *L, qnode *I) • I->next = Null • qnode *predecessor = fetch_and_store (L, I) • if predecessor != Null • I->locked = true • predecessor->next = I • while I->locked ;

MCS Locks 1-R L 2-B 2-R procedure release_lock (lock *L, qnode *I) if I->next == Null if compare_and_swap (L, I, Null) return while I->next == Null ; I->next->locked = false 3-B 3-R 3-E 4-R 4-B 5-B

Results: Scalability – Distributed Memory Architecture

Results: Scalability – Cache Coherent Architecture

Results: Single Processor Lock/Release Time • Butterfly’s atomic insns are very expensive • Butterfly can’t handle 24-bit pointers

Results: Network Congestion

Which lock should I use? fetch_and_store supported? • Atomic insns >> normal insns && 1 processor latency is very important  don’t use MCS • If processes might be preempted  test_and_set with exponential backoff No Yes fetch_and_increment supported? MCS Yes No Ticket test_and_set w/ exp backoff

Centralized Barrier P1 • P0 P2 P3   4 2 0 1 3

Software Combining Tree Barrier P1 • P0 P2 P3   0 1 2 P0 P2     2 1 0 0 1 2 P1 P3

Tournament Barrier P1 • P0 P2 P3     L C         L W L W P0 P1 P2 P3

Dissemination Barrier P1 • P0 P2 P3                 P0 P1 P2 P3

New Tree-Based Barrier P1 • P0 P2 P3    3 2 1 0       0 0    0

Summary

Results – Distributed Shared Memory

Results – Broadcast-Based Cache-Coherent

Results – Local vs. Remote Spinning

Barrier Decision Tree Multiprocessor? Distributed Shared Memory Broadcast-Based Cache-Coherent New Tree-Based Barrier (tree wakeup) New Tree-Based Barrier (central wakeup) Dissemination Barrier Centralized Barrier

Architectural Recommendations • No dance hall • No need for complicated hardware synch • Need a full set of fetch_and_ф

Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

Presentation Transcript

Shared Memory Multiprocessors

Synchronization with shared memory

Synchronization with shared memory

Shared Memory Multiprocessors

Shared Memory Multiprocessors

Architectural Support for Scalable Speculative Parallelization in Shared-Memory Multiprocessors

Multiprocessors—Synchronization

Scalable Distributed Memory Multiprocessors

12 – Shared Memory Synchronization

Shared Memory Multiprocessors

The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors

Shared Memory Multiprocessors

URPC for Shared Memory Multiprocessors

Reactive Synchronization Algorithms for Multiprocessors

Reactive Synchronization Algorithms for Multiprocessors

Scalable Reader-Writer Synchronization for Shared-Memory Multiprocessors

Synchronization in Shared Memory

Shared Memory Multiprocessors

Shared Memory Multiprocessors

Shared Memory Multiprocessors

Multiprocessors— Performance, Synchronization, Memory Consistency Models

Lecture 18: Shared-Memory Multiprocessors