710 likes | 935 Views
ECE 1747: Parallel Programming. Basics of Parallel Architectures: Shared-Memory Machines. Two Parallel Architectures. Shared memory machines. Distributed memory machines. Shared Memory: Logical View. Shared memory space. proc1. proc2. proc3. procN. Shared Memory Machines.
E N D
ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines
Two Parallel Architectures • Shared memory machines. • Distributed memory machines.
Shared Memory: Logical View Sharedmemoryspace proc1 proc2 proc3 procN
Shared Memory Machines • Small number of processors: shared memory with coherent caches (SMP). • Larger number of processors: distributed shared memory with coherent caches (CC-NUMA).
SMPs • 2- or 4-processors PCs are now commodity. • Good price/performance ratio. • Memory sometimes bottleneck (see later). • Typical price (8-node): ~ $20-40k.
Physical Implementation Sharedmemory bus cache1 cache2 cache3 cacheN proc1 proc2 proc3 procN
Shared Memory Machines • Small number of processors: shared memory with coherent caches (SMP). • Larger number of processors: distributed shared memory with coherent caches (CC-NUMA).
CC-NUMA: Physical Implementation mem1 mem2 mem3 memN inter- connect cache1 cache2 cache3 cacheN proc1 proc2 proc3 procN
Caches in Multiprocessors • Suffer from the coherence problem: • same line appears in two or more caches • one processor writes word in line • other processors now can read stale data • Leads to need for a coherence protocol • avoids coherence problems • Many exist, will just look at simple one.
What is coherence? • What does it mean to be shared? • Intuitively, read last value written. • Notion is not well-defined in a system without a global clock.
The Notion of “last written” in a Multi-processor System r(x) P0 w(x) P1 P2 w(x) P3 r(x)
The Notion of “last written” in a Single-machine System w(x) w(x) r(x) r(x)
Coherence: a Clean Definition • Is achieved by referring back to the single machine case. • Called sequential consistency.
Sequential Consistency (SC) • Memory is sequentially consistent if and only if it behaves “as if” the processors were executing in a time-shared fashion on a single machine.
Returning to our Example r(x) P0 w(x) P1 P2 w(x) P3 r(x)
Another Way of Defining SC • All memory references of a single process execute in program order. • All writes are globally ordered.
SC: Example 1 Initial values of x,y are 0. w(x,1) w(y,1) r(x) r(y) What are possible final values?
SC: Example 2 w(x,1) w(y,1) r(y) r(x)
SC: Example 3 w(x,1) w(y,1) r(y) r(x)
SC: Example 4 r(x) w(x,1) w(x,2) r(x)
Implementation • Many ways of implementing SC. • In fact, sometimes stronger conditions. • Will look at a simple one: MSI protocol.
Physical Implementation Sharedmemory bus cache1 cache2 cache3 cacheN proc1 proc2 proc3 procN
Fundamental Assumption • The bus is a reliable, ordered broadcast bus. • Every message sent by a processor is received by all other processors in the same order. • Also called a snooping bus • Processors (or caches) snoop on the bus.
States of a Cache Line • Invalid • Shared • read-only, one of many cached copies • Modified • read-write, sole valid copy
Processor Transactions • processor read(x) • processor write(x)
Bus Transactions • bus read(x) • asks for copy with no intent to modify • bus read-exclusive(x) • asks for copy with intent to modify
State Diagram: Step 0 I S M
State Diagram: Step 1 PrRd/BuRd I S M
State Diagram: Step 2 PrRd/- PrRd/BuRd I S M
State Diagram: Step 3 PrWr/BuRdX PrRd/- PrRd/BuRd I S M
State Diagram: Step 4 PrWr/BuRdX PrRd/- PrRd/BuRd PrWr/BuRdX I S M
State Diagram: Step 5 PrWr/BuRdX PrRd/- PrWr/- PrRd/BuRd PrWr/BuRdX I S M
State Diagram: Step 6 PrWr/BuRdX PrRd/- PrWr/- PrRd/BuRd PrWr/BuRdX I S M BuRd/Flush
State Diagram: Step 7 PrWr/BuRdX PrRd/- PrWr/- PrRd/BuRd PrWr/BuRdX I S M BuRd/Flush BuRd/-
State Diagram: Step 8 PrWr/BuRdX PrRd/- PrWr/- PrRd/BuRd PrWr/BuRdX I S M BuRdX/- BuRd/Flush BuRd/-
State Diagram: Step 9 PrWr/BuRdX PrRd/- PrWr/- PrRd/BuRd PrWr/BuRdX I S M BuRdX/- BuRd/Flush BuRd/- BuRdX/Flush
In Reality • Most machines use a slightly more complicated protocol (4 states instead of 3). • See architecture books (MESI protocol).
Problem: False Sharing • Occurs when two or more processors access different data in same cache line, and at least one of them writes. • Leads to ping-pong effect.
False Sharing: Example (1 of 3) #pragma omp parallel for schedule(cyclic) for( i=0; i<n; i++ ) a[i] = b[i]; • Let’s assume: • p = 2 • element of a takes 4 words • cache line has 32 words
False Sharing: Example (2 of 3) cache line a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] Written by processor 0 Written by processor 1
False Sharing: Example (3 of 3) a[2] a[4] a[0] P0 ... inv data a[3] a[5] P1 a[1]
Summary • Sequential consistency. • Bus-based coherence protocols. • False sharing.
Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors J.M. Mellor-Crummey, M.L. Scott (MCS Locks)
Introduction • Busy-waiting techniques – heavily used in synchronization on shared memory MPs • Two general categories: locks and barriers • Locks ensure mutual exclusion • Barriers provide phase separation in an application
Problem • Busy-waiting synchronization constructs tend to: • Have significant impact on network traffic due to cache invalidations • Contention leads to poor scalability • Main cause: spinning on remote variables
The Proposed Solution • Minimize access to remote variables • Instead, spin on local variables • Claim: • It can be done all in software (no need for fancy and costly hardware support) • Spinning on local variables will minimize contention, allow for good scalability, and good performance
Spin Lock 1: Test-and-Set Lock • Repeatedly test-and-set a boolean flag indicating whether the lock is held • Problem: contention for the flag (read-modify-write instructions are expensive) • Causes lots of network traffic, especially on cache-coherent architectures (because of cache invalidations) • Variation: test-and-test-and-set – less traffic
Test-and-test with Backoff Lock • Pause between successive test-and-set (“backoff”) • T&S with backoff idea: while test&set (L) fails { pause (delay); delay = delay * 2; }
Spin Lock 2: The Ticket Lock • 2 counters (nr_requests, and nr_releases) • Lock acquire: fetch-and-increment on the nr_requests counter, waits until its “ticket” is equal to the value of the nr_releases counter • Lock release: increment of the nr_releases counter
Spin Lock 2: The Ticket Lock • Advantage over T&S: polls with read operations only • Still generates lots of traffic and contention • Can further improve by using backoff