COLUMNWISE BLOCK-STRIPED DECOMPOSITION

COLUMNWISE BLOCK-STRIPEDDECOMPOSITION KamolovNizomiddin Distribution processing lab. 20126142

Columnwise Block Striped Matrix • Partitioning through domain decomposition • Task associated with • Column of matrix • Vector element

Proc 3 Proc 2 Processor 1’s initial computation Processor 0’s initial computation Matrix-Vector Multiplication c0 = a0,0 b0 + a0,1 b1 + a0,2 b2 + a0,3 b3 + a4,4 b4 c1 = a1,0 b0 + a1,1 b1 + a1,2 b2 + a1,3 b3 + a1,4 b4 c2 = a2,0 b0 + a2,1 b1 + a2,2 b2 + a2,3 b3 + a2,4 b4 c3 = a3,0 b0 + a3,1 b1 + a3,2 b2 + a3,3 b3 + b3,4 b4 c4 = a4,0 b0 + a4,1 b1 + a4,2 b2 + a4,3 b3 + a4,4 b4

All-to-all Exchange (Before) P0 P1 P2 P3 P4

All-to-all Exchange (After) P0 P1 P2 P3 P4

Multiplications b ~c Column i of A All-to-all exchange b c b ~c Column i of A Column i of A Reduction Phases of Parallel Algorithm b Column i of A

Agglomeration and Mapping • Static number of tasks • Regular communication pattern (all-to-all) • Computation time per task is constant • Strategy: • Agglomerate groups of columns • Create one task per MPI process

Complexity Analysis • Sequential algorithm complexity: (n2) • Parallel algorithm computational complexity: (n2/p) • Communication complexity of all-to-all: (logp + n) • (if pipelined, else p+n = (p-1)*(c+n/p)) • Overall complexity: (n2/p + log p + n)

Isoefficiency Analysis • Sequential time complexity: (n2) • Only parallel overhead is all-to-all • When n is large, message transmission time dominates message latency • Parallel communication time: (n) • n2  Cpn  n  Cp • Scalability function same as rowwise algorithm: C2p

File Reading a Block-Column Matrix Function read_col_striped_matrix makes use of MPI library routine MPI_Scattervto distribute rows among the processes.

MPI_Scatterv

Header for MPI_Scatterv intMPI_Scatterv ( void *send_buffer, int *send_cnt, int *send_disp, MPI_Datatypesend_type, void *receive_buffer, intreceive_cnt, MPI_Datatypereceive_type, int root, MPI_Comm communicator)

Function parameters send_buffer: pointer to the buffer containing the elements to be scattered. send_cnt: element i is the number of contiguous elements in send_buffergoing to process i. send_disp: element i is the offset in send_buffer of the first element going to process i. send_type: the type of the elements in send_buffer. Recv_buffer: pointer to the buffer containing this process's portion of the elements received. Recv_cnt: the number of elements this process will receive. Recv_type: the type of the elements in recv buffer. root: the IDof the process with the data to scatter. communicator: the communicator in which the scatter is occurring. MPI_Scattervis a collective communication function-all of the proessesin a communicator participate in its execution.

Printing a Block-Column Matrix • Data motion opposite to that we did when reading the matrix • Replace “scatter” with “gather” • print_col_striped_matrixmakes use of MPI function MPI_Gathervto collect row elements onto process 0

Function MPI_Gatherv

Header for MPI_Gatherv intMPI_Gatherv ( void *send_buffer, intsend_cnt, MPI_Datatypesend_type, void *receive_buffer, int *receive_cnt, int *receive_disp, MPI_Datatypereceive_type, int root, MPI_Comm communicator)

Function parameters Send_typeis the type of the elements in send _ buffer. Recv_bufferis the starting address of the buffer used to collect incoming elements (as well as the elements the process is sending to itself). Recv_countis an array; clement iis the number of elements to receive from process i. Recv_displacementis an array; element iis the starting point in recv_buffer of the elements received from process i. recv_ type is-the type the elements should be converted to before they are put in recv_buffer. Communicatorindicates the set of processes participating in the all-to-all exchange.

Function MPI_Alltoallv

Header for MPI_Alltoallv intMPI_Alltoallv ( void *send_buffer, int *send_cnt, int *send_disp, MPI_Datatypesend_type, void *receive_buffer, int *receive_cnt, int *receive_disp, MPI_Datatypereceive_type, MPI_Comm communicator)

Count/Displacement Arrays • MPI_Alltoallv requires two pairs of count/displacement arrays • First pair for values being sent • send_cnt: number of elements • send_disp: index of first element • Second pair for values being received • recv_cnt: number of elements • recv_disp: index of first element

Function create_uniform_xfer_arrays • First array • How many elements received from each process (always same value) • Uses ID and utility macro block_size • Second array • Starting position of each process’ block • Assume blocks in process rank order

Run-time Expression • : inner product loop iteration time • Computational time:  nn/p • All-gather requires p-1 messages, each of length about n/p • 8 bytes per element • Total execution time: nn/p + (p-1)( + (8n/p)/)

Benchmarking Results

Speedup of three MPI programs multiplying a 1,000 x 1,000 matrix by a 1,000-element vector on a commodity cluster. The speedup of the program based on columnwiseblock-striped implementation.

Thank you very much for your listening

COLUMNWISE BLOCK-STRIPED DECOMPOSITION