380 likes | 581 Views
Distributed Shared Memory (part 1). Distributed Shared Memory (DSM). shared memory. network. mem0. mem1. mem2. memN. proc0. proc1. proc2. procN. Shared memory programming. Standard – pthread synchronizations Barriers Locks Semaphores. Sequential SOR.
E N D
Distributed Shared Memory (DSM) shared memory network mem0 mem1 mem2 memN ... proc0 proc1 proc2 procN
Shared memory programming • Standard – pthread • synchronizations • Barriers • Locks • Semaphores
Sequential SOR for some number of timesteps/iterations { for (i=0; i<n; i++ ) for( j=1, j<n, j++ ) temp[i][j] = 0.25 * ( grid[i-1][j] + grid[i+1][j] grid[i][j-1] + grid[i][j+1] ); for( i=0; i<n; i++ ) for( j=1; j<n; j++ ) grid[i][j] = temp[i][j]; }
Parallel SOR with Barriers (1 of 2) void* sor (void* arg) { int slice = (int)arg; int from = (slice * (n-1))/p + 1; int to = ((slice+1) * (n-1))/p + 1; for some number of iterations { … } }
Parallel SOR with Barriers (2 of 2) for (i=from; i<to; i++) for (j=1; j<n; j++) temp[i][j] = 0.25 * (grid[i-1][j] + grid[i+1][j] + grid[i][j-1] + grid[i][j+1]); barrier(); for (i=from; i<to; i++) for (j=1; j<n; j++) grid[i][j]=temp[i][j]; barrier();
Differences between SMP and Software DSM • Delay: tradeoffs, such as block size • Software => traps: cost of read/write misses • Goals of caches: multiprocessor = performance, dist. system = transparency • bus vs. long networks: reliance on serialization and broadcast.
Consequent differences in protocols and applications • Bigger block size • Cost amortization, higher hit ratio for larger blocks? • Reduced overhead • But therefore... • Migration vs. Replication • False sharing increases • DSM protocol more complex: Must handle lost, corrupted, and out-of-order packets • Above, coupled with cost of traps, => SDSM consistency cost much higher!
Results of high consistency costs • Manage sharing more carefully • Align data to page boundaries
Consistency Models • Sequential Consistency • All processors observe the same order • Must correspond to some serial order • Only ordering constraint is that reads/writes of P1 appear in the same order, but no restrictions on relative ordering between processors.
Common consistency protocols • Write update • Multicast update to all replicas • Write invalidate • Invalidate cached copies in p2, p3 • Cache miss if p2/p3 access X • Valid data from other cache
Conventional Implementation • As proposed by Li & Hudak, TOCS ‘86. • Use virtual memory to implement sharing. • Shared memory divided up by virtual memory pages. • Use single-writer, multiple-reader write-invalidate coherence protocol. • Keep pages in one of three states: • invalid, read-only, read-write
Example shared memory proc0 proc1 proc2 procN
Example: Read Access Hit read proc0 proc1 proc2 procN
Example: Write Access Hit write proc0 proc1 proc2 procN
Example: Read Access Miss read proc0 proc1 proc2 procN
Example: Read Fault read fault proc0 proc1 proc2 procN
Example: Replication on Read read proc0 proc1 proc2 procN
Example: Write Access Miss write proc0 proc1 proc2 procN
Example: Write Fault write fault proc0 proc1 proc2 procN
Example: Write Invalidation write proc0 proc1 proc2 procN
Example: Write Access to Read-Only write proc0 proc1 proc2 procN
Example: Write Fault write fault proc0 proc1 proc2 procN
Example: Write Invalidation write proc0 proc1 proc2 procN
How to Remember Locations? • Broadcast on miss (as in SMP). • Static home. • Dynamic home or owner.
Ownership and Owner Location • Owner is the last writer. • Owner maintains copyset. • Every processor maintains probable owner (not always the real owner).
Ownership Location • Every read or write miss is sent to (local) probable owner. • If owner, handle appropriately, else forward to probable owner.
Ownership Modification • If write miss, new writer becomes owner, and all forwarders set probable owner to requester. • If read miss, set probable owner to responding processor.
Example • Initially, owner(page0) = p0, and probable owner(page0) = p0 everywhere. • Write miss by p1, sends message to its probable owner (p0), handled there, new owner = p1, probable owner(0) on p0 = 1. • Read miss by p2, sends message to probable owner (p0), forwarded to probable owner (p1), handled there, probable owner(0) on p2 becomes p1.
Implement synchronizations • Use messages to implement synchronizations
Barriers • Designate one processor as barrier manager. • When a process waits at a barrier, it sends an arrival message to the barrier manager and waits. • When barrier manager has received all messages, it sends a departure message to all processes.
Locks • Designate one process as the lock manager for a particular lock. • When a process acquires a lock, it sends an acquire message to the manager and waits. • Manager forwards message to last acquirer. • If lock free, send lock grant message. • If lock held, hold on to request until free, and then send lock grant message.
Problem: False Sharing • Concurrent access to different data within the same consistency unit. • With page as consistency unit, lots of opportunity for false sharing. • Two flavors: • read-write • write-write
Read-Write False Sharing (Cont.) w(x) w(x) w(x) r(x) r(y) r(y) synch
Read-Write False Sharing (Cont.) w(x) w(x) w(x) r(x) r(y) r(y) synch
Write-Write False Sharing w(x) w(x) w(x) r(x) w(y) w(y) synch
Summary • Software shared memory on distributed memory hardware. • Uses virtual memory. • Home migration to improve locality • important because of high latencies. • Sequential consistency suffers from false sharing