150 likes | 263 Views
Teng Ma @PAL Group Meeting. Design and Implementation of Non-Blocking Communication for Cell Messaging Layer. Outline. Cell Messaging Layer(CML2.5) Non-Blocking Communication Performance Discussion. Cell Messaging Layer(CML2.5) . Supports a subset of the MPI library.
E N D
Teng Ma @PAL Group Meeting Design and Implementation of Non-Blocking Communication for Cell Messaging Layer
Outline • Cell Messaging Layer(CML2.5) • Non-Blocking Communication • Performance • Discussion
Cell Messaging Layer(CML2.5) • Supports a subset of the MPI library. • MPI_Init, MPI_Comm_rank, MPI_Comm_size, MPI_Send, MPI_Recv, MPI_Barrier, MPI_Bcast, MPI_Reduce,MPI_Allreduce, MPI_Wtime, MPI_Abort, and MPI_Finalize. • My target: • MPI_Isend, MPI_Irecv, MPI_Wait, MPI_Waitall, MPI_Test
Inter-nodes communication Inter-Cell communication in the Cell Messaging Layer Non-blocking communication does not change the method of intra-nodes communication.
Intra-nodes communication • This is semantic of CML receiver-initiated method. Recvreqs[rank][tag]
Problems introduced by Non-Blocking for intra-nodes • Different from the blocking communication, we need another information 'issue order' to do matching for non-blocking communication. Sender(rank0) Receiver(rank1) MPI_Isend(buf,count,MPI_INT, 2,0,...) MPI_Irecv(buf,count,MPI_INT, 0,3,...) MPI_Isend(buf,count,MPI_INT, 1,3,...) MPI_Irecv(buf,count,MPI_INT, 3,0,...) MPI_Isend(buf,count,MPI_INT, 1,4,...) MPI_Irecv(buf,count,MPI_INT, 0,3,...) MPI_Isend(buf,count,MPI_INT, 1,3,...) MPI_Irecv(buf,count,MPI_INT, 0,4,...) Rank uses red font; Tag uses blue font We need three information to do matching(Rank, Tag, index(issue order))!
3D array implementation • Two changes: • Changes 'Recvreqs' to 3D array like Recvreqs[Rank][Tag][Index]. • Sender and receiver generate index locally to stand for the issue order for the same rank and tag. After the operation is done, return this index. • Requirement • Requires a fast algorithm which can get and return index in constant time. Vector algorithm is better.
Two index generation methods Method 1 • The index is the index of 1st 0 bit in a bit string is the index and meanwhile set this position's bit as `1'. Returning index is just to set this bit as `0'. • Pros: O(1) algorithm. • Cons: Need modulo operation which is expensive in SPE. It's scalar algorithm which can't make use of SPU intrinsics instructions. Method 2 • Preallocated a vector with 16 char variables and initialized as (0, 1,2,...,15). Getting index is just to get an element from the vector, and set the element as '-1' and rotate left. Returning index is searching from the rightmost element of the vector, and find '-1' and set back as the index. • Pros: vector algorithm which can make use of spu instrict instructions. • Cons: worst case is O(N) algorithm which needs search '-1' to return index and non '-1' to get index. And it only supports 0-15 index.
Pros and cons of 3D implementation • Pros: • Sender can use local info(rank, tag and index)to do matching fast. • Less changes from CML2.5. • Cons: • Waste of memory. (70KB memory pre-allocated for Recvreqs to support 4 tags and 16 outstanding operations. 64*17*4*16) • Index generation and returning is expensive in SPU. • Only support limited tags.
2D array implementation • Recvreqs[rank+1][Tag][Index] ==>Recvreqs[rank+1][OUTSTANDING_OP] An example of using searching to do matching on 2D array Recvreqs[rank+1][OUTSTANDING_OP] for out of order finishing requests.
Pros and Cons of 2D array implementation • Pros: • Save memory use in SPU. (17KB memory preallocated for Recvreqs. It can support any tag and maximum 16 outstanding operations) • Get rid of expensive operation--index generation and index returning. • Cons: • Sender needs to search in a row for the matching. The worst case is O(#OUTSTANDING).
Effect of increasing outstanding op number (2D array implementation) 128KB message 0 Bytes message # of outstanding op is configured by users according to the application.
Conclusion • CML has Non-blocking communication now!! • The bandwidth of CML_2D for 192KB messages is 23.908GB/s(93.4% of theory peak performance 25.6GB/s). • The overhead of latency brought by non-blocking can be accepted. • Users can configure outstanding # according to applications.