420 likes | 598 Views
ECE 1747: Parallel Programming. Basics of Parallel Architectures: Shared-Memory Machines. Two Parallel Architectures. Shared memory machines. Distributed memory machines. Shared Memory: Logical View. Shared memory space. proc1. proc2. proc3. procN. Shared Memory Machines.
E N D
ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines
Two Parallel Architectures • Shared memory machines. • Distributed memory machines.
Shared Memory: Logical View Sharedmemoryspace proc1 proc2 proc3 procN
Shared Memory Machines • Small number of processors: shared memory with coherent caches (SMP). • Larger number of processors: distributed shared memory with coherent caches (CC-NUMA).
SMPs • 2- or 4-processors PCs are now commodity. • Good price/performance ratio. • Memory sometimes bottleneck (see later). • Typical price (8-node): ~ $20-40k.
Physical Implementation Sharedmemory bus cache1 cache2 cache3 cacheN proc1 proc2 proc3 procN
Shared Memory Machines • Small number of processors: shared memory with coherent caches (SMP). • Larger number of processors: distributed shared memory with coherent caches (CC-NUMA).
CC-NUMA: Physical Implementation mem1 mem2 mem3 memN inter- connect cache1 cache2 cache3 cacheN proc1 proc2 proc3 procN
Caches in Multiprocessors • Suffer from the coherence problem: • same line appears in two or more caches • one processor writes word in line • other processors now can read stale data • Leads to need for a coherence protocol • avoids coherence problems • Many exist, will just look at simple one.
What is coherence? • What does it mean to be shared? • Intuitively, read last value written. • Notion is not well-defined in a system without a global clock.
The Notion of “last written” in a Multi-processor System r(x) P0 w(x) P1 P2 w(x) P3 r(x)
The Notion of “last written” in a Single-machine System w(x) w(x) r(x) r(x)
Coherence: a Clean Definition • Is achieved by referring back to the single machine case. • Called sequential consistency.
Sequential Consistency (SC) • Memory is sequentially consistent if and only if it behaves “as if” the processors were executing in a time-shared fashion on a single machine.
Returning to our Example r(x) P0 w(x) P1 P2 w(x) P3 r(x)
Another Way of Defining SC • All memory references of a single process execute in program order. • All writes are globally ordered.
SC: Example 1 Initial values of x,y are 0. w(x,1) w(y,1) r(x) r(y) What are possible final values?
SC: Example 2 w(x,1) w(y,1) r(y) r(x)
SC: Example 3 w(x,1) w(y,1) r(y) r(x)
SC: Example 4 r(x) w(x,1) w(x,2) r(x)
Implementation • Many ways of implementing SC. • In fact, sometimes stronger conditions. • Will look at a simple one: MSI protocol.
Physical Implementation Sharedmemory bus cache1 cache2 cache3 cacheN proc1 proc2 proc3 procN
Fundamental Assumption • The bus is a reliable, ordered broadcast bus. • Every message sent by a processor is received by all other processors in the same order. • Also called a snooping bus • Processors (or caches) snoop on the bus.
States of a Cache Line • Invalid • Shared • read-only, one of many cached copies • Modified • read-write, sole valid copy
Processor Transactions • processor read(x) • processor write(x)
Bus Transactions • bus read(x) • asks for copy with no intent to modify • bus read-exclusive(x) • asks for copy with intent to modify
State Diagram: Step 0 I S M
State Diagram: Step 1 PrRd/BuRd I S M
State Diagram: Step 2 PrRd/- PrRd/BuRd I S M
State Diagram: Step 3 PrWr/BuRdX PrRd/- PrRd/BuRd I S M
State Diagram: Step 4 PrWr/BuRdX PrRd/- PrRd/BuRd PrWr/BuRdX I S M
State Diagram: Step 5 PrWr/BuRdX PrRd/- PrWr/- PrRd/BuRd PrWr/BuRdX I S M
State Diagram: Step 6 PrWr/BuRdX PrRd/- PrWr/- PrRd/BuRd PrWr/BuRdX I S M BuRd/Flush
State Diagram: Step 7 PrWr/BuRdX PrRd/- PrWr/- PrRd/BuRd PrWr/BuRdX I S M BuRd/Flush BuRd/-
State Diagram: Step 8 PrWr/BuRdX PrRd/- PrWr/- PrRd/BuRd PrWr/BuRdX I S M BuRdX/- BuRd/Flush BuRd/-
State Diagram: Step 9 PrWr/BuRdX PrRd/- PrWr/- PrRd/BuRd PrWr/BuRdX I S M BuRdX/- BuRd/Flush BuRd/- BuRdX/Flush
In Reality • Most machines use a slightly more complicated protocol (4 states instead of 3). • See architecture books (MESI protocol).
Problem: False Sharing • Occurs when two or more processors access different data in same cache line, and at least one of them writes. • Leads to ping-pong effect.
False Sharing: Example (1 of 3) for( i=0; i<n; i++ ) a[i] = b[i]; • Let’s assume we parallelize code: • p = 2 • element of a takes 4 words • cache line has 32 words
False Sharing: Example (2 of 3) cache line a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] Written by processor 0 Written by processor 1
False Sharing: Example (3 of 3) a[2] a[4] a[0] P0 ... inv data a[3] a[5] P1 a[1]
Summary • Sequential consistency. • Bus-based coherence protocols. • False sharing.