Design and Implementation of Non-Blocking Communication for Cell Messaging Layer

Teng Ma @PAL Group Meeting Design and Implementation of Non-Blocking Communication for Cell Messaging Layer

Outline • Cell Messaging Layer(CML2.5)‏ • Non-Blocking Communication • Performance • Discussion

Cell Messaging Layer(CML2.5)‏ • Supports a subset of the MPI library. • MPI_Init, MPI_Comm_rank, MPI_Comm_size, MPI_Send, MPI_Recv, MPI_Barrier, MPI_Bcast, MPI_Reduce,MPI_Allreduce, MPI_Wtime, MPI_Abort, and MPI_Finalize. • My target: • MPI_Isend, MPI_Irecv, MPI_Wait, MPI_Waitall, MPI_Test

Inter-nodes communication Inter-Cell communication in the Cell Messaging Layer Non-blocking communication does not change the method of intra-nodes communication.

Intra-nodes communication • This is semantic of CML receiver-initiated method. Recvreqs[rank][tag]

Problems introduced by Non-Blocking for intra-nodes • Different from the blocking communication, we need another information 'issue order' to do matching for non-blocking communication. Sender(rank0)‏ Receiver(rank1)‏ MPI_Isend(buf,count,MPI_INT, 2,0,...)‏ MPI_Irecv(buf,count,MPI_INT, 0,3,...)‏ MPI_Isend(buf,count,MPI_INT, 1,3,...)‏ MPI_Irecv(buf,count,MPI_INT, 3,0,...)‏ MPI_Isend(buf,count,MPI_INT, 1,4,...)‏ MPI_Irecv(buf,count,MPI_INT, 0,3,...)‏ MPI_Isend(buf,count,MPI_INT, 1,3,...)‏ MPI_Irecv(buf,count,MPI_INT, 0,4,...)‏ Rank uses red font; Tag uses blue font We need three information to do matching(Rank, Tag, index(issue order))!

3D array implementation • Two changes: • Changes 'Recvreqs' to 3D array like Recvreqs[Rank][Tag][Index]. • Sender and receiver generate index locally to stand for the issue order for the same rank and tag. After the operation is done, return this index. • Requirement • Requires a fast algorithm which can get and return index in constant time. Vector algorithm is better.

Two index generation methods Method 1 • The index is the index of 1st 0 bit in a bit string is the index and meanwhile set this position's bit as `1'. Returning index is just to set this bit as `0'. • Pros: O(1) algorithm. • Cons: Need modulo operation which is expensive in SPE. It's scalar algorithm which can't make use of SPU intrinsics instructions. Method 2 • Preallocated a vector with 16 char variables and initialized as (0, 1,2,...,15). Getting index is just to get an element from the vector, and set the element as '-1' and rotate left. Returning index is searching from the rightmost element of the vector, and find '-1' and set back as the index. • Pros: vector algorithm which can make use of spu instrict instructions. • Cons: worst case is O(N) algorithm which needs search '-1' to return index and non '-1' to get index. And it only supports 0-15 index.

Pros and cons of 3D implementation • Pros: • Sender can use local info(rank, tag and index)to do matching fast. • Less changes from CML2.5. • Cons: • Waste of memory. (70KB memory pre-allocated for Recvreqs to support 4 tags and 16 outstanding operations. 64*17*4*16)‏ • Index generation and returning is expensive in SPU. • Only support limited tags.

2D array implementation • Recvreqs[rank+1][Tag][Index] ==>Recvreqs[rank+1][OUTSTANDING_OP] An example of using searching to do matching on 2D array Recvreqs[rank+1][OUTSTANDING_OP] for out of order finishing requests.

Pros and Cons of 2D array implementation • Pros: • Save memory use in SPU. (17KB memory preallocated for Recvreqs. It can support any tag and maximum 16 outstanding operations)‏ • Get rid of expensive operation--index generation and index returning. • Cons: • Sender needs to search in a row for the matching. The worst case is O(#OUTSTANDING).

Performance—latency

Performance—Bandwidth(2D array)‏

Effect of increasing outstanding op number (2D array implementation) 128KB message 0 Bytes message # of outstanding op is configured by users according to the application.

Conclusion • CML has Non-blocking communication now!! • The bandwidth of CML_2D for 192KB messages is 23.908GB/s(93.4% of theory peak performance 25.6GB/s). • The overhead of latency brought by non-blocking can be accepted. • Users can configure outstanding # according to applications.

Design and Implementation of Non-Blocking Communication for Cell Messaging Layer

Design and Implementation of Non-Blocking Communication for Cell Messaging Layer

Presentation Transcript

Async IO, Non Blocking IO, Blocking IO and Multithreading

Authentication and Upper-Layer Messaging

Cell Specialization and Cell Communication

Case studies and Communication Strategies: Communication and Messaging

Non-blocking Caches

Cache Lab Implementation and Blocking

Design and Evaluation of Non-Blocking Collective I/O Operations

Cache Lab Implementation and Blocking

Non-blocking I/O

Non-Blocking Communications

Messaging, MOMs and Group Communication

Layer 3 Messaging and Call Procedures

Simple, Fast and Practical Non-Blocking and Blocking Concurrent Queue Algorithms

Non-blocking I/O

Blocking / Non-Blocking Send and Receive Operations

Pertemuan 10 Non Blocking

Cross-Layer Design for Wireless Communication Networks

Cache Lab Implementation and Blocking

Design and Implementation of

Messaging, MOMs and Group Communication

Non-Blocking Communications

Messaging and Group Communication