1 / 25

COLUMNWISE BLOCK-STRIPED DECOMPOSITION

COLUMNWISE BLOCK-STRIPED DECOMPOSITION . Kamolov Nizomiddin Distribution processing lab. 20126142 . Columnwise Block Striped Matrix. Partitioning through domain decomposition Task associated with Column of matrix Vector element. Proc 3. Proc 2. Processor 1’s initial computation.

blue
Download Presentation

COLUMNWISE BLOCK-STRIPED DECOMPOSITION

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COLUMNWISE BLOCK-STRIPEDDECOMPOSITION KamolovNizomiddin Distribution processing lab. 20126142

  2. Columnwise Block Striped Matrix • Partitioning through domain decomposition • Task associated with • Column of matrix • Vector element

  3. Proc 3 Proc 2 Processor 1’s initial computation Processor 0’s initial computation Matrix-Vector Multiplication c0 = a0,0 b0 + a0,1 b1 + a0,2 b2 + a0,3 b3 + a4,4 b4 c1 = a1,0 b0 + a1,1 b1 + a1,2 b2 + a1,3 b3 + a1,4 b4 c2 = a2,0 b0 + a2,1 b1 + a2,2 b2 + a2,3 b3 + a2,4 b4 c3 = a3,0 b0 + a3,1 b1 + a3,2 b2 + a3,3 b3 + b3,4 b4 c4 = a4,0 b0 + a4,1 b1 + a4,2 b2 + a4,3 b3 + a4,4 b4

  4. All-to-all Exchange (Before) P0 P1 P2 P3 P4

  5. All-to-all Exchange (After) P0 P1 P2 P3 P4

  6. Multiplications b ~c Column i of A All-to-all exchange b c b ~c Column i of A Column i of A Reduction Phases of Parallel Algorithm b Column i of A

  7. Agglomeration and Mapping • Static number of tasks • Regular communication pattern (all-to-all) • Computation time per task is constant • Strategy: • Agglomerate groups of columns • Create one task per MPI process

  8. Complexity Analysis • Sequential algorithm complexity: (n2) • Parallel algorithm computational complexity: (n2/p) • Communication complexity of all-to-all: (logp + n) • (if pipelined, else p+n = (p-1)*(c+n/p)) • Overall complexity: (n2/p + log p + n)

  9. Isoefficiency Analysis • Sequential time complexity: (n2) • Only parallel overhead is all-to-all • When n is large, message transmission time dominates message latency • Parallel communication time: (n) • n2  Cpn  n  Cp • Scalability function same as rowwise algorithm: C2p

  10. File Reading a Block-Column Matrix Function read_col_striped_matrix makes use of MPI library routine MPI_Scattervto distribute rows among the processes.

  11. MPI_Scatterv

  12. Header for MPI_Scatterv intMPI_Scatterv ( void *send_buffer, int *send_cnt, int *send_disp, MPI_Datatypesend_type, void *receive_buffer, intreceive_cnt, MPI_Datatypereceive_type, int root, MPI_Comm communicator)

  13. Function parameters send_buffer: pointer to the buffer containing the elements to be scattered. send_cnt: element i is the number of contiguous elements in send_buffergoing to process i. send_disp: element i is the offset in send_buffer of the first element going to process i. send_type: the type of the elements in send_buffer. Recv_buffer: pointer to the buffer containing this process's portion of the elements received. Recv_cnt: the number of elements this process will receive. Recv_type: the type of the elements in recv buffer. root: the IDof the process with the data to scatter. communicator: the communicator in which the scatter is occurring. MPI_Scattervis a collective communication function-all of the proessesin a communicator participate in its execution.

  14. Printing a Block-Column Matrix • Data motion opposite to that we did when reading the matrix • Replace “scatter” with “gather” • print_col_striped_matrixmakes use of MPI function MPI_Gathervto collect row elements onto process 0

  15. Function MPI_Gatherv

  16. Header for MPI_Gatherv intMPI_Gatherv ( void *send_buffer, intsend_cnt, MPI_Datatypesend_type, void *receive_buffer, int *receive_cnt, int *receive_disp, MPI_Datatypereceive_type, int root, MPI_Comm communicator)

  17. Function parameters Send_typeis the type of the elements in send _ buffer. Recv_bufferis the starting address of the buffer used to collect incoming elements (as well as the elements the process is sending to itself). Recv_countis an array; clement iis the number of elements to receive from process i. Recv_displacementis an array; element iis the starting point in recv_buffer of the elements received from process i. recv_ type is-the type the elements should be converted to before they are put in recv_buffer. Communicatorindicates the set of processes participating in the all-to-all exchange.

  18. Function MPI_Alltoallv

  19. Header for MPI_Alltoallv intMPI_Alltoallv ( void *send_buffer, int *send_cnt, int *send_disp, MPI_Datatypesend_type, void *receive_buffer, int *receive_cnt, int *receive_disp, MPI_Datatypereceive_type, MPI_Comm communicator)

  20. Count/Displacement Arrays • MPI_Alltoallv requires two pairs of count/displacement arrays • First pair for values being sent • send_cnt: number of elements • send_disp: index of first element • Second pair for values being received • recv_cnt: number of elements • recv_disp: index of first element

  21. Function create_uniform_xfer_arrays • First array • How many elements received from each process (always same value) • Uses ID and utility macro block_size • Second array • Starting position of each process’ block • Assume blocks in process rank order

  22. Run-time Expression • : inner product loop iteration time • Computational time:  nn/p • All-gather requires p-1 messages, each of length about n/p • 8 bytes per element • Total execution time: nn/p + (p-1)( + (8n/p)/)

  23. Benchmarking Results

  24. Speedup of three MPI programs multiplying a 1,000 x 1,000 matrix by a 1,000-element vector on a commodity cluster. The speedup of the program based on columnwiseblock-striped implementation.

  25. Thank you very much for your listening

More Related