ECE 1747: Parallel Programming

ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines

Two Parallel Architectures • Shared memory machines. • Distributed memory machines.

Shared Memory: Logical View Sharedmemoryspace proc1 proc2 proc3 procN

Shared Memory Machines • Small number of processors: shared memory with coherent caches (SMP). • Larger number of processors: distributed shared memory with coherent caches (CC-NUMA).

SMPs • 2- or 4-processors PCs are now commodity. • Good price/performance ratio. • Memory sometimes bottleneck (see later). • Typical price (8-node): ~ $20-40k.

Physical Implementation Sharedmemory bus cache1 cache2 cache3 cacheN proc1 proc2 proc3 procN

Shared Memory Machines • Small number of processors: shared memory with coherent caches (SMP). • Larger number of processors: distributed shared memory with coherent caches (CC-NUMA).

CC-NUMA: Physical Implementation mem1 mem2 mem3 memN inter- connect cache1 cache2 cache3 cacheN proc1 proc2 proc3 procN

Caches in Multiprocessors • Suffer from the coherence problem: • same line appears in two or more caches • one processor writes word in line • other processors now can read stale data • Leads to need for a coherence protocol • avoids coherence problems • Many exist, will just look at simple one.

What is coherence? • What does it mean to be shared? • Intuitively, read last value written. • Notion is not well-defined in a system without a global clock.

The Notion of “last written” in a Multi-processor System r(x) P0 w(x) P1 P2 w(x) P3 r(x)

The Notion of “last written” in a Single-machine System w(x) w(x) r(x) r(x)

Coherence: a Clean Definition • Is achieved by referring back to the single machine case. • Called sequential consistency.

Sequential Consistency (SC) • Memory is sequentially consistent if and only if it behaves “as if” the processors were executing in a time-shared fashion on a single machine.

Returning to our Example r(x) P0 w(x) P1 P2 w(x) P3 r(x)

Another Way of Defining SC • All memory references of a single process execute in program order. • All writes are globally ordered.

SC: Example 1 Initial values of x,y are 0. w(x,1) w(y,1) r(x) r(y) What are possible final values?

SC: Example 2 w(x,1) w(y,1) r(y) r(x)

SC: Example 3 w(x,1) w(y,1) r(y) r(x)

SC: Example 4 r(x) w(x,1) w(x,2) r(x)

Implementation • Many ways of implementing SC. • In fact, sometimes stronger conditions. • Will look at a simple one: MSI protocol.

Physical Implementation Sharedmemory bus cache1 cache2 cache3 cacheN proc1 proc2 proc3 procN

Fundamental Assumption • The bus is a reliable, ordered broadcast bus. • Every message sent by a processor is received by all other processors in the same order. • Also called a snooping bus • Processors (or caches) snoop on the bus.

States of a Cache Line • Invalid • Shared • read-only, one of many cached copies • Modified • read-write, sole valid copy

Processor Transactions • processor read(x) • processor write(x)

Bus Transactions • bus read(x) • asks for copy with no intent to modify • bus read-exclusive(x) • asks for copy with intent to modify

State Diagram: Step 0 I S M

State Diagram: Step 1 PrRd/BuRd I S M

State Diagram: Step 2 PrRd/- PrRd/BuRd I S M

State Diagram: Step 3 PrWr/BuRdX PrRd/- PrRd/BuRd I S M

State Diagram: Step 4 PrWr/BuRdX PrRd/- PrRd/BuRd PrWr/BuRdX I S M

State Diagram: Step 5 PrWr/BuRdX PrRd/- PrWr/- PrRd/BuRd PrWr/BuRdX I S M

State Diagram: Step 6 PrWr/BuRdX PrRd/- PrWr/- PrRd/BuRd PrWr/BuRdX I S M BuRd/Flush

State Diagram: Step 7 PrWr/BuRdX PrRd/- PrWr/- PrRd/BuRd PrWr/BuRdX I S M BuRd/Flush BuRd/-

State Diagram: Step 8 PrWr/BuRdX PrRd/- PrWr/- PrRd/BuRd PrWr/BuRdX I S M BuRdX/- BuRd/Flush BuRd/-

State Diagram: Step 9 PrWr/BuRdX PrRd/- PrWr/- PrRd/BuRd PrWr/BuRdX I S M BuRdX/- BuRd/Flush BuRd/- BuRdX/Flush

In Reality • Most machines use a slightly more complicated protocol (4 states instead of 3). • See architecture books (MESI protocol).

Problem: False Sharing • Occurs when two or more processors access different data in same cache line, and at least one of them writes. • Leads to ping-pong effect.

False Sharing: Example (1 of 3) for( i=0; i<n; i++ ) a[i] = b[i]; • Let’s assume we parallelize code: • p = 2 • element of a takes 4 words • cache line has 32 words

False Sharing: Example (2 of 3) cache line a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] Written by processor 0 Written by processor 1

False Sharing: Example (3 of 3) a[2] a[4] a[0] P0 ... inv data a[3] a[5] P1 a[1]

Summary • Sequential consistency. • Bus-based coherence protocols. • False sharing.

ECE 1747: Parallel Programming