Definitions

Definitions • A Synchronous application is one where all processes must reach certain points before execution continues. • Local synchronization is a requirement that a subset of processes (usually neighbors) reach a synchronous point before execution continues. • A barrier is the basic message passing mechanism for synchronizing processes. • Deadlock occurs when groups of processors are permanently waiting for messages that cannot be satisfied because the sending processes are also permanently waiting for messages.

P0 P0 Barrier P0 Executing P0 Waiting P0 Barrier Illustration C: MPI_Barrier(MPI_COMM_WORLD); mpiJava: MPI.COMM_WORLD.Barrier();

Counter (linear) Barrier Master Processor For (i=0; i<P; i++) // Arrival Phase Receive null message from any processor For (i=0; i<P; i++) // Departure Phase Send null message to release slaves Slave Processors Send null message to enter barrier Receive null message for barrier release Note: This logic avoids processors arriving before prior release

P0 P1 P2 P3 P4 P5 P6 P7 Tree (non-linear) Barrier P0 P1 P2 P3 P4 P5 P6 Release Phase P7 Entry Phase Note: Implementation logic is similar to divide and conquer

P0 P1 P2 P3 P4 P5 P6 P7 P0 P1 P2 P3 P4 P5 P6 P7 P0 P1 P2 P3 P4 P5 P6 P7 P0 P1 P2 P3 P4 P5 P6 P7 Barrier Barrier • Stage 1: P0p1; p2p3; p4p5; p6p7 • Stage 2: p0p2; p1p3; p4p6; p5p7 • Stage 3: p0p4; p1p5; p2p6; p3p7

Local Synchronization Synchronize with neighbors before proceeding • Even Processors Send null message to processor i-1 Receive null message from processor i-1 Send null message to processor i+1 Receive null message from processor i+1 • Odd Numbered Processors Receive null message from processor i+1 Send null message to processor i+1 Receive null message from processor i-1 Send null message to processor i-1 • Notes: • Local Synchronization is an incomplete barrier • processors exit after receiving messages from their neighbors • Deadlock can occur if the message passing order is incorrect. • MPI_Sendrecv() and MPI_Sendrecv_replace() are deadlock free

Local Synchronization Example • Heat Distribution Problem • Goal • Determine final temperature at each n x n grid point • Initial boundary condition • Know initial temperatures at the end points • Cannot proceed to next iteration until local synchronization completes DO Average each grid point with its neighbors UNTIL temperature changes are small enough New Value = (∑neighbors)/4

Sequential Heat Distribution Code Initialize rows 0,n and columns 0,n of g and h Iteration = 0 DO FOR (i=1; i<n; i++) FOR (j=1; j<n; j++) IF (iteration %2) hi,j = (gi-1,j+gi+1,j+gi,j-1+gi,j+1)/4 ELSE gi,j = (hi-1,j+hi+1,j+hi,j-1+hi,j+1)/4 iteration++ UNTIL max (|gi – hi|)<tolerance or iteration>MAX

p0 p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 p13 p14 p15 Block or Strip Partitioning Assign portions of the grid to processors in the topology • Block Partitioning • Eight messages exchanged at each iteration • Data exchanged per message is n/P1/2 • Strip Partitioning • Four messages exchanged at each iteration • Data exchanged per message is n • Question: Which is better? Blocks p0 p1 p2 p3 p4 p5 p6 p7 Column Strips

Cells to north Pi Cells to east Cells to west Cells to south Parallel Implementation • Algorithm Modifications • Declare “ghost” rows to hold adjacent data (10 x 10 for 8 x 8 block) • Exchange data with neighbor processors • Perform the calculation for the local grid cells

Heat Distribution Partitioning Main logic For each iteration For each point compute new temperature SendRcv(row-1,col,point) SendRcv(row+1,col,point) SendRcv(row,col-1,point) SendRcv(row,col+1,point) SendRcv(row,col) if row,col is not local if myrank even Send(point,prow,col) Recv(point,prow,col) Else Recv(point,prow,col) Send(point,prow,col)

Fully Synchronized Example • Data Parallel Computations • Simultaneously apply the same operation to different data • Sequential Code for (i=1; i<n; i++) a[i] = someFunction(a[i]) • Shared Memory Code Forall (i=0; i<n; i++) {bodyOfInstructions} • In these cases, the for loop is a natural barrier • Distributed processing For local a[i]; {someFunction(a[i])} barrier();

Data Parallel Example A[] += k A[0] += k A[1] += k A[n-1] += k p0 p1 pn • All processors execute instructions in “lock step” • Forall (i=0; i<n; i++) a[i] += k • Note: Multicomputer configurations partition numbers in blocks

Prefix Sum Problem Note: Prefix Sum algorithm works for any associative operation • Definition: Given numbers a[i]; i=0; i<n, the prefix sum of a[i] is: a[i] += a[0] + a[1] + … + a[i-1] • Application: Radix Sort • Sequential codefor (j=0;j<lg(n);j++) for (i=2j; i<n; i++) a[i] += a[i-2j]; • Parallel shared memory codefor (j=0; j<lg(n); j++) forall (i=2j; i<n; i++) a[i] += a[i-2j]; • Parallel distributed memory code for (j=1; j<= log(n); j++) if (myrank>=2j-1 receive(sum, myrank – 2j-1) a[myrank] = a[myrank] += sum else send(a[myrank], myrank + 2j)

Prefix Sum Illustration

Synchronous Iteration • Processes synchronize at each iteration step • Example: Simulation of Natural Processes • Shared memory code for (j=0; j<n; j++) forall (i=0; i<N; i++) body(i); • Distributed memory code for (j=0; j<n; j++) body(myRank); barrier();

Example: n equations of n unknowns an-1,0x0 + an,1x1 …+ an,n-1xn-1 = bk∙∙∙ ak,0x0 + ak,1x1 …+ ak,n-1xn-1 = bk∙∙∙ a1,0x0 + a1,1x1 …+ a1,n-1xn-1 = b1 a0,0x0 + a0,1x1 …+ a0,n-1xn-1 = b0 • Or rewrite equations as follows: xk=(bk–ak,0x0-…-ak,j-1xj-1-ak,j+1xj+1-…-ak,n-1xn-1)/ak,k= (bk - ∑j≠kai,j xj)/ai,i

Jacobi Iteration xi • Jacobi Iteration xnewi = initial guess DO xi = xnewi xnewi = Calculated next guess UNTIL ∑i|xnewi – xi|<tolerance • Jacobi iteration always converges if:ak,k > ∑i≠k ai,0 Error Iteration i i+1

xnew0 xnew1 xnewn-1 xi Allgather() xnewi into xi Parallel Jacobi Code xi = bi DO for each i sum = -ai,i * xi FOR (j=0; j<n; j++) sum += ai,i * xj xnewi = (bi – sum)/ai,i allgather(xnewi) barrier() Until iterations>MAX or ∑i|xnewi – xi|<tolerance

Additional Jacobi Notes Time • What if P (processor count) < n? • Answer: Allocate blocks of variables to processors • Block Allocation • Allocate consecutive xi to processors • Cyclic Allocation • Allocate x0, xP, … to p0 • Allocate x1, xp+1, … to p1 … etc. • Question: Which allocation scheme is better? Computation Communication 4 8 12 16 20 24 Processors Jacobi Performance

Cellular Automata Definition • The System has a finite grid of cells • Each cell can assume a finite number of states • Neighbor cells affect a cell according to rule set • All cell changes of state occur simultaneously • The system iterates through a number of generations • Serious Applications: • Fluid and gas dynamics • Biological growth • Airplane wing airflow • Erosion modeling • Groundwater pollution

Conway’s Game of Life • The grid is a two dimension array of cells • The grid ends can optionally wrap around (like a torus) • Each cell • Can hold one “organism” • There are eight neighbor cells • North, Northeast, East, Southeast, South, Southwest, West, Northwest • Rules (run the simulation over many generations) • Organism dies from loneliness if 0 or 1 organisms live in neighbor cells • Organism survives if 2 organisms live in adjacent cells • An empty cell with 3 living neighbors gives birth to organisms in every empty adjacent cell • Organism dies from overpopulation >= 4 organisms live in neighbor cells

Sharks and Fishes • The grid (ocean) is modeled by a three dimension array • The grid ends can optionally wrap around (like a torus) • Each cell • Can hold either a fish or a shark, but not both • There are twenty six neighbor cells • Rules for fish • Fish move randomly to empty adjacent cells • If there are no empty adjacent cells, fish stay put • Fish of breeding age leave a baby fish in the vacating cell • Fish die after x generations • Rules for sharks • Sharks randomly move to adjacent cells with fish, eating the fish • If no adjacent cells have fish, the shark moves randomly to empty cells. It stays put if there are no empty cells • Sharks of breeding age leave a baby shark in the vacating cell • Sharks that die if they don’t eat a fish for y generations

Definitions

Definitions

Presentation Transcript

Definitions

Definitions

Definitions

Definitions

Definitions

Definitions?

Definitions

Definitions

Definitions

Definitions

Definitions

Energy Storage Definitions/Definitions

Definitions:

Definitions

Definitions

Definitions

Definitions

Definitions :

Definitions

Definitions:

Definitions

Definitions