390 likes | 514 Views
Parallel Processing (CS 676) Lecture: Grouping Data and Communicators in MPI. Jeremy R. Johnson *Parts of this lecture was derived from chapters 6,7 in Pacheco. Introduction.
E N D
Parallel Processing (CS 676)Lecture: Grouping Data and Communicators in MPI Jeremy R. Johnson *Parts of this lecture was derived from chapters 6,7 in Pacheco Parallel Processing
Introduction • Objective: To introduce MPI commands for creating types and communicators. To discuss performance models and considerations in MPI and ways of reducing communication. • Topics • MPI Datatypes and packing • Revised version of Get_Data • Matrix transposition • Creating communicators • Topologies • Grids • Matrix Multiplication • Fox’s algorithm • Performance Model Parallel Processing
Derived Types • Due to latency of communication it is usually a good idea to package up several elements into a single message • MPI_Send and MPI_Recv, allow message to be given by a start address, basic type, and a count. • This allows multiple data elements to be sent in one message • Requires elements to be of the same type • Must be contiguous • A generalized type • {(t0,d0),…,(tn-1,dn-1)} • ti is an existing type • di is a displacement Parallel Processing
Functions for Creating MPI Types • int MPI_Type_struct(int count, int block_lengths[], MPI_Aint displacements[], MPI_Datatype typelist[], MPI_Datatype new_mpi_t) • MPI_Address( void* location, MPI_Aint* address) • int MPI_Type_commit(MPI_Datatype* new_mpi_t) Parallel Processing
Other Derived Datatype Constructors • int MPI_Type_vector(int count, int block_length, int stride, MPI_Datatype element_type, MPI_Datatype new_mpi_t) • int MPI_Type_contiguous(int count, MPI_Datatype old_type, MPI_Datatype new_mpi_t) • int MPI_Type_indexed(int count, int block_lengths[], int displacements[], MPI_Datatype old_type, MPI_Datatype new_mpi_t) Parallel Processing
Transpose float A[10][10]; /* stored in row-major order. */ /* Send 3rd row of A from process 0 to process 1. */ If (my_rank == 0) { MPI_Send(&(A[2][0]), 10, MPI_FLOAT, 1, 0, MPI_COMM_WORLD); } else { MPI_Recv(&(A[2][0]), 10, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, &status); } /* Doesn’t work for columns, since not contiguous. */ Parallel Processing
Transpose float A[10][10]; /* stored in row-major order. */ /* Send 3rd column of A from process 0 to process 1. */ MPI_Type_vector(10, 1, 10, MPI_FLOAT, &column_mpi_t); MPI_Type_commit(&column_mpi_t); If (my_rank == 0) { MPI_Send(&(A[0][2]), 1, column_mpi_t, 1, 0, MPI_COMM_WORLD); } else { MPI_Recv(&(A[0][2]), 1, column_mpi_t, 0, 0, MPI_COMM_WORLD, &status); } Parallel Processing
Upper Triangular Matrix float A[n][n]; /* Complete matrix */ float T[n][n]; /* Upper triangle. */ int displacements[n]; int block_lengths[n]; MPI_Datatype index_mpi_t; for(i = 0; i < n; i++) { block_lengths[i] = n-i; displacements[i] = (n+1)*i; } MPI_Type_indexed(n, block_lengths, displacments, MPI_FLOAT, &index_mpi_t); MPI_Type_commit(&index_mpi_t); if (my_rank == 0) { MPI_Send(A, 1, index_mpi_t, 1, 0, MPI_COMM_WORLD); else MPI_Recv(T, 1, index_mpi_t, 0, 0, MPI_COMM_WORLD, &status); Parallel Processing
Type Matching • When can a receiving process match the data sent by a sending process? • MPI_Send(message, send_count, send_mpi_t, 1, 0, MPI_COMM_WORLD) • MPI_Recv(message, recv_count, recv_mpi_t, 0, 0,MPI_COMM_WORLD,&status) • Given a derived type {(t0,d0),…,(tn-1,dn-1)} • Displacements do not matter • Type signatures {t0,…,tn-1} and {u0,…,um-1} must be compatible • n m, ti =ui for i=0,…,n-1 • For collective communications sending and receiving types must be identical Parallel Processing
Type Matching Example • For type column_mpi_t (column of 10 10 array of floats) • {(MPI_FLOAT,0), (MPI_FLOAT,10*sizeof(float)), • (MPI_FLOAT,20*sizeof(float)),…,(MPI_FLOAT,90*sizeof(float))} • Signature is {MPI_FLOAT,…,MPI_FLOAT}, MPI_FLOAT 10 times • Example: Can send column to row float A[10][10]; /* stored in row-major order. */ If (my_rank == 0) MPI_Send(&(A[0][0]), 1, column_mpi_t, 1, 0, MPI_COMM_WORLD); else if (my_rank == 0) MPI_Recv(&(A[0][0]),10, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, &status); Parallel Processing
Pack and Unpack • MPI_Pack and MPI_Unpack allow a user to copy non-contiguous memory locations into a contiguous buffer and to copy a contiguous buffer into non-contiguous memory locations • int MPI_Pack(void* pack_data, int in_count, MPI_Datatype datatype, void* buffer, int buffer_size, int* position, MPI_Comm comm) • On input copy data starting at location &buffer + position • On output position points to first location in the buffer after pack_data • int MPI_Unpack(void* buffer, int size, int* position,void* unpack_data, int count,MPI_Datatype datatype, MPI_Comm comm) • Data starting at location &buffer + *position is copied into memory referenced by unpack_data • count data elements of type datatype are copied into unpack_data • position is updated to point to location in buffer after the data just copied • Messages constructed with MPI_Pack should be communicated with datatype argument MPI_PACKED Parallel Processing
Deciding which Method to Use • Creating a derived type has overhead associated with it. • Depends on number of times type will be used. • Can avoid system buffering with Pack/Unpack • Can use variable length messages with Pack/Unpack Parallel Processing
Variable Length Messages float* entries; int* column_subscripts; int nonzeros; int position; int row_number; char buffer[HUGE]; if (my_rank == 0) { position = 0; MPI_Pack(&nonzeros,1,MPI_INT,buffer,HUGE,&position,MPI_COMM_WORLD); MPI_Pack(row_number,1,MPI_INT,buffer,HUGE,&position,MPI_COMM_WORLD); MPI_Pack(entries,nonzeros,MPI_FLOAT,buffer,HUGE, &position,MPI_COMM_WORLD); MPI_Pack(column_subscripts,nonzeros,MPI_INT,buffer,HUGE, &position,MPI_COMM_WORLD); MPI_Send(buffer,position,MPI_PACKED, 1,0,MPI_COMM_WORLD); } Parallel Processing
Variable Length Messages (cont) else { MPI_Recv(buffer,HUGE,MPI_PACKED, 0,0, MPI_COMM_WORLD, &status); position = 0; MPI_UnPack(buffer,HUGE,&position, &nonzeros, MPI_INT,MPI_COMM_WORLD); MPI_UnPack(buffer,HUGE,&position, &row_number, MPI_INT, MPI_COMM_WORLD); entries = (float *) malloc(nonzeros*sizeof(float)); column_subscripts = (int *) malloc(nonzeros*sizeof(int)); MPI_UnPack(buffer,HUGE,&position, entries, MPI_FLOAT,nonzeros, MPI_COMM_WORLD); MPI_UnPack(buffer,HUGE,&position, &column_subscripts, MPI_INT, nonzeros, MPI_COMM_WORLD); } Parallel Processing
Communicators • A mechanism to treat a subset of processes as a universe for communication (both point-to-point and collective) • Types • intra-communicator • inter-communicator • Components • group (ordered collection of processes) • context (unique identifier) • optional additional information such as topology • Create new communicators from existing communicators Parallel Processing
Working with Groups, Contexts, and Communicators • MPI_Comm_group(MPI_Comm comm, MPI_Group* group) • MPI_Group_incl(MPI_Group old_group, int new_group_size, int ranks_in_old_group[]; MPI_Group* new_group) • MPI_Comm_create(MPI_Comm old_comm, MPI_Group group, MPI_Comm* new_com) Parallel Processing
Creating a Communicator /* Create communicator out of first row of q^2 processes organized in a q q grid in row-major order. */ MPI_Group group_world; MPI_Group first_row_group; MPI_Comm first_row_comm; int* process_ranks; process_ranks = (int*) malloc(q*sizeof(int)); for (proc = 0; proc < q; proc++) process_ranks[proc] = proc; MPI_Comm_Group(MPI_COMM_WORLD,&group_world); MPI_Group_incl(group_world,q,process_ranks,&first_row_group); MPI_Comm_create(MPI_COMM_WORLD,first_row_group,&first_row_comm); Parallel Processing
Using a Communicator /* Broadcast first block to all processes in the same row. */ int my_rank_in_first_row; float* A00; if (my_rank < q) { MPI_Comm_rank(first_row_comm,&my_rank_in_first_row); A_00 = (float *) malloc(n_bar*n_bar*sizeof(float)); if (my_rank_in_first_row == 0) { /* initialize A_00 */} MPI_Bcast(A_00,n_bar*n_bar,MPI_FLOAT,0,first_row_comm); } Parallel Processing
MPI_Comm_split • int MPI_Comm_split(MPI_Comm old_comm, int split_key, int rank_key, MPI_Comm new_comm) MPI_Comm my_row_comm; int my_row; /* my_rank is in MPI_COMM_WORLD, q*q = p */ my_row = my_rank/q; MPI_Comm_split(MPI_COMM_WORLD,my_row,my_rank,&my_row_comm); /* Creates q new communicators. Processes with the same value of split_key form a new group. The rank in the new communicator is determined by rank_key. Order is preserved. If the same rank_key is used, then the choice is arbitrary. */ Parallel Processing
Topologies • Communicators can have attributes. One such attribute is a topology. • A topology is a mechanism for associating a different addressing scheme with processes belonging to a group. • Provides a virtual interconnection organization of processes that may be convenient for a particular algorithm. • Types • Cartesian (grid) • Graph Parallel Processing
Working with Cartesian Topologies • MPI_Cart_create(MPI_Comm old_comm, int number_of_dims, int dim_sizes[], int wrap_around[], int reorder, MPI_Comm* cart_comm) • MPI_Cart_rank(MPI_Comm comm, int rank, int number_of_dims, int* rank) • MPI_Cart_coords(MPI_Comm comm, int rank, int number_of_dims, int coordinates[]) • MPI_Cart_sub(MPI_Comm cart_comm, int free_coords[], MPI_Comm* new_comm) Parallel Processing
Creating a Cartesian Topology /* create communicator with 2D grid topology. */ MPI_Comm grid_comm; int dim_sizes[2]; int wrap_around[2]; int reorder = 1; dim_sizes[0] = dim_sizes[1] = q; wrap_around[0] = wrap_around[1] = 1; MPI_Cart_create(MPI_COMM_WORLD, 2, dim_sizes,wrap_around,reorder, &grid_comm); Parallel Processing
Creating a Sub-Cartesian Topology int free_coords[2]; MPI_Comm row_comm; /* create communicator for each row of grid_comm */ free_coords[0] = 0; free_coords[1] = 1; MPI_Cart_sub(grid_comm, free_coords, &row_comm) /* create communicator for each column of grid_comm */ free_coords[0] = 1; free_coords[1] = 0; MPI_Cart_sub(grid_comm, free_coords, &col_comm) Parallel Processing
Cartesian Addressing int coordinates[2]; int my_grid_rank; MPI_Comm_rank(grid_comm, &my_grid_rank); MPI_Cart_coords(grid_comm, my_grid_rank,2, coordinates); /* inverse operation */ MPI_Cart_rank(grid_comm, coordinates, &my_grid_rank) Parallel Processing
Matrix Multiplication • Let A, B be n n matrices, and C = A*B void Serial_matrix_mult(MATRIX_T A, MATRIX_T B, MATRIX_T C, int n) { int i,j,k; for (i=0; i< n; i++) for (j=0; j< n;j++) { C[i][j] = 0.0; for (k=0; k < n;k++) C[i][j] = C[i][j] + A[i][k]*B[k][j]; } } Parallel Processing
Parallel Matrix Multiplication /* distribute matrices by rows. */ void Parallel_matrix_mult(MATRIX_T A, MATRIX_T B, MATRIX_T C, int n) { for each column of B { Allgather(column); Compute dot product of my row of A with column; } /* can distribute matrices by blocks of rows. Also B could be distributed by * columns */ Parallel Processing
Cyclic Matrix Multiplication /* Arrange processors in a circle, storing rows of A and B in each process. Ci.* = Ai,0 * B0,* + … + Ai,n-1 * Bn-1,* */ void Parallel_matrix_mult(MATRIX_T A, MATRIX_T B, MATRIX_T C, int n) { i = rank; Blocal = ith row of B; Alocal = ith row of A; Clocal = 0; /* ith row of C */ dest = (i+1) % n; src = (i-1) % n; for (k=0;k<n;k++) { Ci,* = Ci,* + Ai,i+kmod n* Bi+k mod n,* send_recv(Blocal,dest,src); } Parallel Processing
Fox’s Matrix Multiplication • Let A, B be p p matrices, and C = A*B • Organize processors into a q q grid, q = sqrt(p) • Store (i,j) block on processor (i,j) • Broadcast elements of A as k = 0,…,q-1 • Cyclically rotate elements of B. Parallel Processing
Example Parallel Processing
Fox’s Matrix Multiplication /* Uses a block matrix allocation. Group processors in a q × q grid, where q = sqrt(p). Processor (i,j) stores Aij and initially Bij */ void Parallel_matrix_mult(MATRIX_T A, MATRIX_T B, MATRIX_T C, int n) { i = my process row; j = my process column; dest = ((i-1) % q,j); src = ((i+1) % q,j); for (stage=0;stage < q; stage++) { k_bar = (i + stage) mod q; Broadcast A[i,k_bar] across process row i; C[i,j] = C[i,j] + A[i,k_bar]*B[k_bar,j]; Send B[k_bar,j] to dest; Receive B[(k_bar+1) mod q,j] from source; } Parallel Processing
Variant of Fox’s Matrix Multiplication • Let A, B be q q matrices, and C = A*B • Organize processors into a sqrt(p) sqrt(p) grid • Store (i,j+i mod q) block of A and (i+j mod q,j) block of B on processor (i,j) • Cyclically rotate rows of A to the left. • Cyclically rotate columns of B upward. Parallel Processing
Example Parallel Processing
Variant of Fox’s Matrix Multiplication /* Uses a block matrix allocation. Group processors in a q × q grid, where q = sqrt(p). Processor (i,j) stores Ai,i+j and initially Bi+i,j */ void Parallel_matrix_mult(MATRIX_T A, MATRIX_T B, MATRIX_T C, int n) { i = my process row; j = my process column; coldest = ((i-1) % q,j); colsrc = ((i+1) % q,j); rowdest = (i,(j-1) % q); rowsrc = (i,(j+1) % q); for (stage=0;stage < q; stage++) { k_bar = (i +j + stage) mod q; C[i,j] = C[i,j] + A[i,k_bar]*B[k_bar,j]; Send_Recv A[i,k_bar] to/from rowdest,rowsrc; Send_Recv B[k_bar,j] to/from coldest, colsrc; } Parallel Processing
Performance Model • Communication cost: C(n) = +n • = latency • 1/ = bandwidth • Empirically determine and by measuring time to send/recv messages with different lengths • Least squares fit Parallel Processing
Analysis of Matrix Multiplication • Let A, B be n n matrices, and C = A*B • Sequential cost • T(n)= an3+bn2+cn+d = (n3) • Least squares fit • T(n) an3 Parallel Processing
Analysis of Parallel Matrix Multiplication (Allgather) • Let A, B be n n matrices, and C = A*B • Let p = number of processors • Store ith block of n/p rows of A, B, and C on process i • Parallel computing time: (n3/p + plog(p) + n2log(p)) • Computation time: p(n/p n n/p) = n3/p • Communication time: p(log(p)(+n2/p) [Allgather] Parallel Processing
Analysis of Parallel Matrix Multiplication (Cyclic) • Let A, B be n n matrices, and C = A*B • Let p = number of processors • Store ith block of n/p rows of A, B, and C on process i • Parallel computing time: (n3/p + p + n2) • Computation time: p(n/p n/p n ) = n3/p • Communication time: p(+n2/p) Parallel Processing
Analysis of Parallel Matrix Multiplication (Fox) • Let A, B be n n matrices, and C = A*B • Let p = q2 number of processors organized in a q q grid • Store (i,j)th n/q n/q block of A, B, and C on process (i,j) • Parallel computing time: • (n3/p + qlog(q) + log(q)n2/q) • Computation time: q(n/q n/q n/q ) = n3/q2 = n3/p • Communication time: qlog(q)(+(n/q)2) Parallel Processing
Analysis of Parallel Matrix Multiplication (Fox Variant) • Let A, B be n n matrices, and C = A*B • Let p = q2 number of processors organized in a q q grid • Store (i,j)th n/q n/q block of A, B, and C on process (i,j) • Parallel computing time: • (n3/p + q + n2/q) • Computation time: q(n/q n/q n/q ) = n3/q2 = n3/p • Communication time: q(+(n/q)2) Parallel Processing