100 likes | 296 Views
Application Mapping Over OFIWG SFI. Sean Hefty. MPI Over SFI Example. MPI Implementation over SFI Demonstrates possible usage model Initialization Send injection Send Completions Polling RMA Counters Completions. Query Interfaces: Tagged. Reliable unconnected endpoint.
E N D
Application Mapping Over OFIWG SFI Sean Hefty
MPI Over SFI Example • MPI Implementation over SFI • Demonstrates possible usage model • Initialization • Send injection • Send Completions • Polling • RMA • Counters • Completions
Query Interfaces: Tagged Reliable unconnected endpoint /* Tagged provider */ hints.type = FID_RDM; #ifdef MPIDI_USE_AV_MAP hints.addr_format= FI_ADDR; #else hints.addr_format= FI_ADDR_INDEX; #endif hints.protocol = FI_PROTO_UNSPEC; hints.ep_cap = FI_TAGGED | FI_BUFFERED_RECV | FI_REMOTE_COMPLETE | FI_CANCEL; hints.op_flags= FI_REMOTE_COMPLETE; Address vector optimized for minimal memory footprint and no internal lookups Transport agnostic Behavior required by endpoint Default flags to apply to data transfer operations
Query Interfaces: RMA/Atomics Separate endpoint for RMA operations /* RMA provider */ hints.type= FID_RDM; #ifdef MPIDI_USE_AV_MAP hints.addr_format = FI_ADDR; #else hints.addr_format = FI_ADDR_INDEX; #endif hints.protocol = FI_PROTO_UNSPEC; hints.ep_cap = FI_RMA | FI_ATOMICS | FI_REMOTE_COMPLETE | FI_REMOTE_READ | FI_REMOTE_WRITE; hints.op_flags = FI_REMOTE_COMPLETE; Support for RMA and atomic operations Remote RMA read and write support
Query Interfaces: Message Queue Event queue optimized to report tagged completions eq_attr.mask= FI_EQ_ATTR_MASK_V1; eq_attr.domain= FI_EQ_DOMAIN_COMP; eq_attr.format= FI_EQ_FORMAT_TAGGED; fi_eq_open(domainfd, &eq_attr, &p2p_eqfd, NULL); eq_attr.mask= FI_EQ_ATTR_MASK_V1; eq_attr.domain= FI_EQ_DOMAIN_COMP; eq_attr.format= FI_EQ_FORMAT_DATA; fi_eq_open(domainfd, &eq_attr, rma_eqfd, NULL); fi_bind(tagged_epfd, p2p_eqfd, FI_SEND | FI_RECV); fi_bind(rma_epfd, rma_eqfd, FI_READ | FI_WRITE); Event queue optimized to report RMA completions Associate endpoints with event queues
Query Limits Query endpoint limits optlen= sizeof(max_buffered_send); fi_getopt(tagged_epfd, FI_OPT_ENDPOINT, FI_OPT_MAX_INJECTED_SEND, &max_buffered_send, &optlen); optlen= sizeof(max_send); fi_getopt(tagged_epfd, FI_OPT_ENDPOINT, FI_OPT_MAX_MSG_SIZE, &max_send, &optlen); Maximum ‘inject’ data size – buffer is reusable immediately after function call returns Maximum application level message size
Short Send intMPIDI_Send(buf, count, datatype, rank, tag, comm, context_offset, **request) { data_sz = get_size(count, datatype); if (data_sz <= max_buffered_send) { match_bits= init_sendtag(comm->context_id + context_offset, comm->rank, tag, 0); fi_tinjectto(tagged_epfd, buf, data_sz, COMM_TO_PHYS(comm, rank), match_bits); } else { ... } } Small sends map directly to tagged-injectto call Fabric address provided directly to provider
Large Message Send Large sends require request allocation intMPIDI_Send(buf, count, datatype, rank, tag, comm, context_offset, **request) { /* code for type calculations, tag creation, etc */ REQUEST_CREATE(sreq); fi_tsendto(MPIDI_Global.tagged_epfd,send_buf, data_sz, NULL, COMM_TO_PHYS(comm,rank), match_bits, &(REQ_OF2(sreq)->of2_context)); *request = sreq; } SFI completion context embedded in request object
Progress/Polling for Completions Fields align on tagged entry to data_entry intMPIDI_Progress() { eq_tagged_entry_twc; fid_eq_tfd[2] = {p2p_eqfd, rma_eqfd}; for(i=0;i<2;i++) { MPID_Request *req; rc = fi_eq_read(fd[i],(void *)&wc, sizeof(wc)); handle_errs(rc); req = context_to_request(wc.op_context); req->callback(req); } }
RMA Completions (Counters and Completions) intMPIDI_Win_fence(MPID_Win *win) { /* synchronize software counters via completions */ PROGRESS_WHILE(win->started!=win->completed); /* Syncronize hardware counters */ fi_sync(WIN_OF2(win)->rma_epfd, FI_WRITE|FI_READ|FI_BLOCK, NULL); /* Notify any request based objects that use counter completion */ RequestQ->notify() }