150 likes | 320 Views
New Progress in Open MPI p2p communication: Elan and Sicortex. Teng Ma, George Bosilca @2008 ICL retreat. P2p communication in Open-MPI. MPI application. MPI level. PML(p2p management layer) OB1 or DR . BML(BTL management layer). MX BTL. Elan BTL. UDAPL BTL. SM BTL. OFUD BTL.
E N D
New Progress in Open MPI p2p communication: Elan and Sicortex Teng Ma, George Bosilca @2008 ICL retreat
P2p communication in Open-MPI MPI application MPI level PML(p2p management layer) OB1 or DR BML(BTL management layer) MX BTL Elan BTL UDAPL BTL SM BTL OFUD BTL SCTP BTL GM BTL Openib BTL TCP BTL …… will come soon Sicortex BTL Xensocket BTL
recalls for 1st elan btl version • Use elan Tport to implement btl’s send interface and elan RDMA to implement btl’s put and get interfaces. • Provide comparable bandwidth with vender’s quadrics MPI but still have some problem in latency
Memory copy issue Open MPI elan btl Quadrics MPI Elan system buffer Elan system buffer User buffer User buffer Btl buffer Copy Copy Copy
Elan queue send/recv • It doesn’t need pre-registered buffers to receive. The message is stored in elan system buffer (in elan queue). • Elan queue has better performance than elan tport for the message size<=2KB. 2KB is one slot size of elan queue.
elan btl’s status now… • Fix the bug of backward rank initialization and finalization bug.( no bug now) • Support multi-rail on single node. • Use elan’s queue, tport and RDMA to do Open-mpi send and put protocol. • the latency of small message improves a lot. • Provide Multi-thread support.
Programming environment • MPI library (libscmpi.a) • Slurm • DMA library(libscdma.a)
An example of do “get” by Sicortex DMA enigne • recvbuf = (char *) (((uintptr_t) &bigbuf[65536]) & (~65535ULL)); // 64KB alignment • ret = scdma_map_bds(ctx, 3, recvbuf, rs->bd_count); // map into dma buffer • void *cmd = (void *) scdma_cq_head_spinwait(ctx); //find a cmd header • uint64_t segmentComplete = 0; • scdma_build_s_bf_bf_cmdend_put (cmd, • peers[client>serverRank].route_handles[0], • peers[client->serverRank].ports[0], • client->returnRank, • rs->bd_base + i, 0, // source • 3 + i, 0, //destination • sysconf(_SC_PAGESIZE), // size of transfer • 0, • (uintptr_t) &segmentComplete); • __asm__ volatile("sync"); /* force those out to memory */ • scdma_cq_post(ctx); // issue the command to dma engine
Future work • Improve the elan’s latency using tport to send. • Finish the development of Sicortex btl