440 likes | 546 Views
Introductory Parallel Applications. Courtesy: Dr. David Walker, Cardiff University. Example1 – wave equation Courtesy: David Walker, Cardiff University. Problem – Vibrating string of length L fixed at both ends given an initial displacement. To determine Ψ (x, t). displacement. x=L. x=0.
E N D
Introductory Parallel Applications Courtesy: Dr. David Walker, Cardiff University
Example1 – wave equationCourtesy: David Walker, Cardiff University • Problem – Vibrating string of length L fixed at both ends given an initial displacement. • To determine Ψ(x, t) displacement x=L x=0
Vibrating String Problem • To solve ∂2Ψ/c2∂t2 - ∂2Ψ/∂x2 = 0 • Subject to Ψ(0, t) = Ψ(L, t) = 0; Ψ(x, 0) = u(x) • Numerical approximation: Ψi(t+∆t) = 2Ψi(t) – Ψi(t-∆t) + tau2(Ψi-1(t) -2Ψi(t) + Ψ i+1(t)) where tau = c ∆t/ ∆x and u(x) = sinx
Vibrating String Problem – Sequential Problem for(i=0; i<npoints; i++){ x = 2*pi*i/(npoints-1); x = sin(x); ps[i] = old_ps[i] = x; } for(j=0; j<nsteps; j++){ for(i=1; i<npoints-1; i++){ new_ps[i] = 2*ps[i] – old_ps[i] + tau*tau*(ps[i-1] + 2ps[i] + ps[i+1]); } for(i=0; i<npoints; i++){ old_ps[i] = ps[i]; ps[i] = new_ps[i]; } }
Vibrating String – Parallel Formulation • The points are divided across processes • Communication between neighboring processors • Data structure – an array for every processor with sizes local_size+2
Vibrating String – Parallel Code /* Set up 1D topology */ MPI_Cart_create(MPI_COMM_WORLD,1,&nprocs,&periods, reorder,&new_comm); MPI_Cart_coords (new_comm, rank, 1, &mypos) MPI_Cart_shift (new_comm, 0, -1, &right, &left); /* Initialise array */ local_npts = npoints/nprocs; lbound = mypos*local_npts; for(i=0;i<local_npts;i++){ x = 2.0*PI*(double)(lbound+i)/(double)(npoints-1); x = sin(x); psi[i+1] = old_psi[i+1] = x; }
Vibrating String – Parallel Code /* Do the updates */ istart = (mypos==0) ? 2 : 1; iend = (mypos==nprocs-1) ? local_npts-1 : local_npts; for(j=0;j<NSTEPS;j++){ MPI_Sendrecv (&psi[1], 1, MPI_DOUBLE, left, 111, &psi[local_npts+1], 1, MPI_DOUBLE, right, 111, new_comm, &status); MPI_Sendrecv (&psi[local_npts], 1, MPI_DOUBLE, right, 111, &psi[0], 1, MPI_DOUBLE, left, 111, new_comm, &status); for(i=istart;i<=iend;i++){ new_psi[i] = 2.0*psi[i] - old_psi[i] + tau*tau*(psi[i-1]- 2.0*psi[i]+psi[i+1]); } for(i=1;i<=local_npts;i++){ old_psi[i] = psi[i]; psi[i] = new_psi[i]; } }
Example 2 – Laplace equation problemCourtesy: Dr. David Walker, Cardiff University • Electric field around a conducting object in a box • To solve ∂2Ψ/∂x2 + ∂2Ψ/∂y2 = 0 • Subject to Ψ(0, y) = Ψ(L1, y) = 0; Ψ(x, 0) = Ψ(x, L2) = 0; Ψ(x, y) =1 on S(x, y) x=0 x=L1 y=0 S(x, y) y=L2
Laplace equation – Numerical approximation • We consider a square problem, L1 = L2 = L • Ψi,jk = ¼(Ψi-1,jk-1 + Ψi+1,jk-1 + Ψi,j-1k-1 + Ψi,j+1k-1) 0 1
Laplace Equation – Sequential code double phi[NXMAX][NYMAX]; double old_phi[NXMAX][NYMAX]; int mask[NXMAX][NYMAX]; setup_grid(phi, nptsx, nptsy, mask); for(k=0; k<nsteps; k++){ for(i=0; i<nptsx; i++){ for(j=0; j<nptsy; j++){ old_phi[i][j] = phi[i, j]; } } for(i=0; i<nptsx; i++){ for(j=0; j<nptsy; j++){ if(mask[i][j]) phi[i][j] = ¼*(old_phi[i][j-1] + old_phi[i][j+1] + old_phi[i-1][[j] + old_phi[i][[j-1]); } } }
Laplace Equation – Sequential code setup_grid(phi, nptsx, nptsy, mask){ for(i=0; i<nptsx; i++){ for(j=0; j<nptsy; j++){ phi[i, j] = 0.0; mask[i, j] = 1; } } mask[0…nptsx-1][0] = mask[0..nptsx-1][nptsy-1] = mask[0][0..nptsy-1] = mask[nptsx-1][0..nptsy-1] = 0; mask[nptsx/2, nptsx/2-1][nptsy/2, nptsy/2-1] = 0; phi[nptsx/2, nptsx/2-1][nptsy/2, nptsy/2-1] = 1;
Laplace Equation - Parallelization • A 2-D topology of processes are considered • The updates of boundary values for a given processor may need communication from its neighboring processors. • Following are main data structures double phi[NYMAX+2][NXMAX+2]; double oldphi[NYMAX+2][NXMAX+2]; int mask[NYMAX+2][NXMAX+2];
LAPLACE Equation - Parallelization /* Work out number of processes in each direction of the process mesh */ dims[0] = dims[1] = 0; MPI_Dims_create (nprocs, 2, dims); npy = dims[0]; npx = dims[1]; /* Set up 2D topology */ periods[0] = periods[1] = 0; MPI_Cart_create (MPI_COMM_WORLD, 2, dims, periods, 1, &new_comm); MPI_Cart_coords (new_comm, rank, 2, coords); myposy = coords[0]; myposx = coords[1]; /* Determine neighbouring processes for communication shifts */ MPI_Cart_shift (new_comm, 0, -1, &down, &up); MPI_Cart_shift (new_comm, 1, -1, &right, &left); /* Initialise arrays */ setup_grid (phi, npx, npy, myposx, myposy, nptsx, nptsy, &local_nptsx, &local_nptsy, mask);
Laplace Equation - Parallelization /* Iterate to find solution */ for(k=1;k<=nsteps;k++){ for(i=1;i<=local_nptsy;i++) for(j=1;j<=local_nptsx;j++) oldphi[i][j] = phi[i][j]; MPI_Sendrecv (&oldphi[1][1], local_nptsy, MPI_DOUBLE, up, 111, &oldphi[local_nptsx+1][1], local_nptsy, MPI_DOUBLE,down, 111, new_comm, &status); MPI_Sendrecv (&oldphi[local_nptsx][1],local_nptsy,MPI_DOUBLE,down,112, &oldphi[0][1], local_nptsy, MPI_DOUBLE, up, 112, new_comm, &status);
Laplace Equation - Parallelization for(i=1;i<=local_nptsx;i++) sbuf[i-1] = oldphi[i][1]; MPI_Sendrecv (sbuf, local_nptsx, MPI_DOUBLE, left, 113, rbuf, local_nptsx, MPI_DOUBLE, right, 113, new_comm, &status); for(j=1;j<=local_nptsy;j++) oldphi[i][local_nptsy+1] = rbuf[i-1]; for(i=1;i<=local_nptsx;i++) sbuf[i-1] = oldphi[i][local_nptsy]; MPI_Sendrecv (sbuf, local_nptsx, MPI_DOUBLE, right, 114, rbuf, local_nptsx, MPI_DOUBLE, left, 114, new_comm, &status); for(i=1;i<=local_nptsx;i++) oldphi[i][0] = rbuf[i-1]; for(i=1;i<=local_nptsx;i++) for(j=1;j<=local_nptsy;j++) if (mask[i][j]) phi[i][j] = 0.25*(oldphi[i][j-1] + oldphi[i][j+1] + oldphi[i-1][j] + oldphi[i+1][j]); }
Laplace Equation - Parallelization setup_grid (phi, npx, npy, myposx, myposy, nptsx, nptsy, local_ptrx, local_ptry, mask){ local_nptsx = nptsx/npx; local_nptsy = nptsy/npy; for(j=0;j<=local_nptsy+1;j++) for(i=0;i<=local_nptsx+1;i++){ phi[i][j] = 0.0; mask[i][j] = 1; } if (myposx == 0) mask[1][1..local_nptsy] = 0; if (myposx == npx-1) mask[local_nptsx][1..local_nptsy] = 0; if (myposy == 0) mask[1..local_nptsx][1] =0; if (myposy == npy-1) mask[1..local_nptsx][local_nptsy] = 0; for(i=1;i<=local_nptsx;i++){ global_x = local_nptsx*myposx + i - 1; if (global_x == nptsx/2 || global_x == nptsx/2-1){ for(j=1;j<=local_nptsx;j++){ global_y = local_nptsy*myposy + j - 1; if (global_y == nptsy/2 || global_y == nptsy/2-1){ mask[i][j] = 0; phi[i][j] = 1.0; } } } } *local_ptrx = local_nptsx; *local_ptry = local_nptsy;
Laplace Equation – Performance Analysis • For n2 grid size, N = PxQ processor mesh, tcalc time for a flop, tshift time for communication per grid point • Tseq = • T(N) = 4n2tcalc (4n2/N)tcalc + (2n/P)tshift + (2n/Q)tshift
A more irregular problem – Molecular Dynamics SimulationCourtesy: Dr. David Walker, Cardiff University • In the previous problems, communication requirements are known in advance • In the current Molecular Dynamics simulation problem, the amount of data that are communicated between processors are not known in advance • The communication is slightly irregular
The Problem • A domain consisting of number of particles (molecules) • Each molecule, i is exerted a force, fij by another molecule, j • The sum of all the forces, Fi = ∑jfij makes the particles assume a new position and velocity • Particles that are r distance apart do not influence each other • Given initial velocities and positions of particles, their movements are followed for discrete time steps
MD - Solution • The cutoff distance, r is used to reduce the time for summation from O(n2) r r Domain decomposed into cells of size rxr Particles in one cell interact with particles in the neighbouring 8 cells and particles in the same cell
MD - Solution Data structures: An array of all particles. Each element holds <position, velocity> A 2D array of linked lists, one for each cell. Each element of a linked list contains pointers to particles. struct particle{ double position[2]; double velocity[2]; } Particles[MAX_PARTICLES]; struct list{ particle* part; struct list* next; }*List[MAX_CELLSX][MAX_CELLSY]; Particles Linked List
MD – Sequential Logic Initialize Particles and Lists; for each time step for each particle i Let cell(k, l) hold i F[i] = 0; for each particle j in this cell and neighboring 8 cells, and are r distance from i{ F[i]+= f[i, j]; } update particle[i].{position, velocity} due to F[i]; if new position in new cell (m,n) update Lists[k,l] and Lists[m,n]
MD – Parallel Version A 2D array of processors similar to Laplace Each processor holds a set of cells r • Differences: • A processor can communicate with the diagonal neighbors • Amount of data communicated varies over time steps • Receiver does not know the amount of data r
MDS – parallel solution • Steps 1. Communication – Each processor communicates parameters of the particles on the boundary cells to its 8 neighboring cells Challenges – to communicate diagonal cells 2. Update – Each processor calculates new particle velocities and positions 3. Migration – Particles may migrate to cells in other processors • Other challenges: • Appropriate packing of data. • Particles may have to go through several hops during migration • Assumptions: • 1. For simplicity, let us assume that particles are transported to only neighboring cells during migration
MDS – parallel solution – 1st step • Communication of boundary data B B b b A b A a a a b b a a b a b a b a b a b b B B a a A b A a C C c c D c D d d d c c d d c d c d c d c d c c C d d C D c d d
MDS – parallel solution – 1st step • Communication of boundary data B A B B b A a B b a B b A b a b A a A b a a b a b a b a b b a a b b a b a b a b a a b a b a b a b a a b B b b a a B A b b B B A a a A B b a A b A a C D C c D d c d c d a a A b b B b A B a C D C D d D d C c C d c D C c c D d c d c d d d c c c d c c d d d d c c c d d c c d c c d d c c d c d d d d D D C c c C c c d C d d C D d C c d c d
MDS – parallel solution – 1st step • Communication of boundary data B A B B b A a B b a B b A b a b A a A b a a b a b a b a b b a a b b a b a b a b a a b a b a b a b a a b B b b a a B A b b B B A a a A B b a A b A a C C D D C c D d c d c d a a A b b A B b A B B a C D C D d D d C c C d c D C c c D d c d c d d d c c c d c c d d d d c c c d d c c d c c d d c c d c d d d d D D C c c C c c d C d d C D d C c d c d Can be achieved by ? Shift left, shift right, shift up, shift down
MDS – parallel solution – 1st step Left shift nsend = 0; for(i=0; i<local_cellsx; i++){ for each particle p in cell (i, 1){ pack position of p in sbuf nsend += 2 } } MPI_Sendrecv(sbuf, nsend, …, left,.. rbuf, max_particles*2, …, right, &status); MPI_Getcount(status, MPI_DOUBLE, &nrecv); particles = nrecv/2; for(i=0; i<particles; i++){ read (x,y) from next 2 positions in rbuf; add (x,y) to particles[local_particle_count+i]; determine cell k, l for the particle Add it to list (k, l); }
MDS – parallel solution – 2nd step Update: • Similar to sequential program. • A processor has all the required information for calculating Fi for all its particles • Thus new position and velocity determined. • If new position belongs to the same cell in the same processor, do nothing • If new position belongs to the different cell in the same processor, update link lists for old and new cells.
MDS – parallel solution – 3rd step • If new position belongs to the different cell in a different processor – particle migration for each particle p update {position, velocity} determine new cell if new cell # old cell delete p from list of old cell if(different processor) pack p into appropriate communication buffer remove p from particle array Shift left Shift right Shift up Shift down
MDS – parallel solution – 3rd step • This shifting is a bit different from the previous shifting • A processor may just act as a transit point for a particle • Hence particles have to be packed with care Shift left: MPI_Sendrecv(leftbuf, nsend_left, …, left rbuf, max_size*4, .., right, &status); MPI_Getcount(status, MPI_DOUBLE, &nrecv); particles = nrecv/4; for(i=0; i<particles; i++){ read next 4 numbers in {x, y vx, vy} if(particle in this process) add particle to particle array determine cell add particle to list for the cell else put data in the appropriate comm. buffer for the next up or down shifts }
MDS – comments • Generic solution • A particle can move to any cell • Force can be from any distance • Load balancing
A Dynamical System- WaTor Courtesy: Dr. David Walker, Cardiff • A 2-D ocean in which sharks and fish survive • 2 important features • Need for dynamic load distribution • Potential conflicts due to updates by different processors • Features shared by other advanced parallel applications
WaTor – The problem • Ocean divided into grids • Each grid cell can be empty or have a fish or a shark • Grid initially populated with fishes and sharks in a random manner • Population evolves over discrete time steps according to certain rules
WaTor - Rules Fish: • At each time step, a fish tries to move to a neighboring empty cell. If not empty, it stays • If a fish reaches a breeding age, when it moves, it breeds, leaving behind a fish of age 0. Fish cannot breed if it doesn’t move. • Fish never starves
WaTor - Rules Shark: • At each time step, if one of the neighboring cells has a fish, the shark moves to that cell eating the fish. If not and if one of the neighboring cells is empty, the shark moves there. Otherwise, it stays. • If a shark reaches a breeding age, when it moves, it breeds, leaving behind a shark of age 0. shark cannot breed if it doesn’t move. • Sharks eat only fish. If a shark reaches a startvation age (time steps since last eaten), it dies.
Inputs and Data Structures • Sequential Code Logic • Initialize ocean array and swimmers list • In each time step, go through the swimmers in the order in which they are stored and perform updates Inputs: • Size of the grid • Distribution of sharks and fishes • Shark and fish breeding ages • Shark starvation age Data structures: A 2-D grid of cells struct ocean{ int type /* shark or fish or empty */ struct swimmer* occupier; }ocean[MAXX][MAXY] A linked list of swimmers struct swimmer{ int type; int x,y; int age; int last_ate; int iteration; swimmer* prev; swimmer* next; } *List;
Towards a Parallel Code • 2-D data distribution similar to Laplace and molecular dynamics is used. Each processor holds a grid of ocean cells. • For communication, each processor needs data from 4 neighboring processors. • 2 new challenges – potential for conflicts, load balancing
1st Challenge – Potential for Conflicts • Unlike previous problems, border cells may change during updates due to fish or shark movement • Border cells need to be communicated back to the original processor. Hence update step involves communication • In the meantime, the original processor may have updated the border cell. Hence potential conflicts Time T S F S F F F F S F S F S F F F F F F F F Time T+1
2 Techniques • Rollback updates for those particles (fish or shark) that have crossed processor boundary and are in potential conflicts. • May lead to several rollbacks until a free space is found. • 2nd technique is synchronization during updates to avoid conflicts in the first place.
2 Techniques • During update, a processor x sends its data first to processor y, allows y to perform its updates, get the updates from y, and then performs its own updates. • Synchronization can be done by sub-partitioning. • Divide a grid owned by a processor into sub-grids. 2 2 1 1 4 4 3 3 update
Load Imbalance • The workload distribution changes over time • 2-D block distribution is not optimal Techniques: • Dynamic load balancer • Static load balancing by a different data distribution
Dynamic load balancing • Performed at each time step • Orthogonal recursive bisection Problems: Complexity in finding the neighbors and data for communication
Static Data Distribution • Cyclic 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 3,0 3,1 3,2 3,3 3,0 3,1 3,2 3,3 Increase in boundary data; increase in communication Problems: