320 likes | 455 Views
ECE 1747: Parallel Programming. Distributed Shared Memory (DSM). Multiprocessor (SMP). proc1. proc3. proc2. X=0. X=0. X=0. X=0. Consistency Models. Sequential Consistency All processors observe the same order Must correspond to some serial order
E N D
ECE 1747: Parallel Programming Distributed Shared Memory (DSM)
Multiprocessor (SMP) proc1 proc3 proc2 X=0 X=0 X=0 X=0
Consistency Models • Sequential Consistency • All processors observe the same order • Must correspond to some serial order • Only ordering constraint is that reads/writes of P1 appear in the same order, but no restrictions on relative ordering between processors.
Common consistency protocols • Write update • Multicast update to all replicas • Write invalidate • Invalidate cached copies in p2, p3 • Cache miss if p2/p3 access X • Valid data from other cache
Distributed Shared Memory (DSM) shared memory network mem0 mem1 mem2 memN ... proc0 proc1 proc2 procN
DSM programming • Standard – pthread-like • synchronizations • Barriers • Locks • Semaphores
Sequential SOR for some number of timesteps/iterations { for (i=0; i<n; i++ ) for( j=1, j<n, j++ ) temp[i][j] = 0.25 * ( grid[i-1][j] + grid[i+1][j] grid[i][j-1] + grid[i][j+1] ); for( i=0; i<n; i++ ) for( j=1; j<n; j++ ) grid[i][j] = temp[i][j]; }
Parallel SOR with Barriers (1 of 2) void* sor (void* arg) { int slice = (int)arg; int from = (slice * (n-1))/p + 1; int to = ((slice+1) * (n-1))/p + 1; for some number of iterations { … } }
Parallel SOR with Barriers (2 of 2) for (i=from; i<to; i++) for (j=1; j<n; j++) temp[i][j] = 0.25 * (grid[i-1][j] + grid[i+1][j] + grid[i][j-1] + grid[i][j+1]); barrier(); for (i=from; i<to; i++) for (j=1; j<n; j++) grid[i][j]=temp[i][j]; barrier();
Sequential Consistency DSM • As proposed by Li & Hudak, TOCS ‘86. • Use virtual memory to implement sharing. • Shared memory divided up by virtual memory pages. • Use an SMP-like coherence protocol. • Keep pages in one of three states: • invalid, read-only, read-write
SC implementation • Synchronous read/write • Writes must be propagated before moving on to the next operation
Read-Write False Sharing (Cont.) w(x) w(x) w(x) r(x) r(y) r(y)
Read-Write False Sharing (Cont.) w(x) w(x) w(x) r(x) r(y) r(y) synch
Weak Consistency (WEAKC) • Data modifications are only propagated at the time of synchronization. • Works fine if program is properly synchronized through system primitives. • All programs should be …
Read-Write False Sharing (Before) w(x) w(x) w(x) r(x) r(y) r(y) synch
Read-Write False Sharing (WEAKC) w(x) w(x) r(y) r(y) r(x) synch
Write-Write False Sharing w(x) w(x) w(x) r(x) w(y) w(y) synch
Write-Write False Sharing (WEAKC) w(x) w(x) w(x) r(x) w(y) w(y) synch
Multiple Writer (MW) Protocols • Allows multiple writers per page. • Modifications merged at synchronization (according to weakc definition). • Modifications are recorded through a mechanism called twinning and diffing.
Write-Write False Sharing and MW w(x) w(x) w(x) w(y) r(x) w(y) synch
Creating a diff (delta) Diff (delta) twin w(x) ... w(x) write- protected write- protected writable
Write-Write False Sharing and MW x synch twin w(x) w(x) w(x) x w(y) r(x) w(y) x twin y y
Release Consistency (RC) • Distinguish acquires from releases • Ordinary read/write wait until the previous acquire is performed • Release waits until previous read/write are performed • Acquire/release are sequentially consistent w.r.t. one another
Eager & Lazy Release Consistency • Eager release consistency: transfer consistency information at release of a lock. • Lazy release consistency: transfer consistency information at acquire of a lock.
Eager Release Consistency w(x) rel p1 acq w(x) rel p2 Acq w(x) rel p3 acq r(x) p4
Lazy Release Consistency w(x) rel p1 acq w(x) rel p2 Acq w(x) rel p3 acq r(x) p4
Lazy Release Consistency • Acquiring processor determines witch modifications it needs to see. w(x) rel p1 acq w(y) rel p2 acq r(x) r(y) p3 synch
Vector Timestamps 1 0 0 0 0 0 w(x) rel p1 1 1 0 acq w(y) rel 0 0 0 p2 acq r(x) r(y) p3 0 0 0
DSM Summary • Relaxed consistency • application’s definition of correctness • >70% performance of corresponding message passing applications