480 likes | 641 Views
Profile Guided MPI Protocol Selection for Point-to-Point Communication Calls. Aniruddha Marathe, David K. Lowenthal Department of Computer Science The University of Arizona Tucson, AZ {amarathe,dkl}@cs.arizona.edu. Zheng Gu, Matthew Small, Xin Yuan Department of Computer Science
E N D
Profile Guided MPI Protocol Selection for Point-to-Point Communication Calls Aniruddha Marathe, David K. Lowenthal Department of Computer Science The University of Arizona Tucson, AZ {amarathe,dkl}@cs.arizona.edu Zheng Gu, Matthew Small, Xin Yuan Department of Computer Science Florida State University Tallahassee, FL {zgu,small,xyuan}@cs.fsu.edu 5/9/11 1
Motivation • Need for anon-line protocol selection scheme: • Optimal protocol for a communication routine: application and architecture specific • Existing approaches • Off-line: Protocol selection at program compilation time • Static: One protocol per application • Difficult to adapt to program’s runtime characteristics 5/9/11 2
Contributions • - On-line protocol selection algorithm • - Protocol cost model • Employed by the on-line protocol selection algorithm to estimate the total execution time per protocol • - Sender-initiated Post-copy protocol • A novel protocol to complement the existing set of protocols 5/9/11 3
On-line Protocol Selection Algorithm • Selects the optimal communication protocol for a • communication phase dynamically • Protocol selection algorithm split into two phases: • Phase 1: Execution time estimation per protocol • Phase 2 (optimization): Buffer usage profiling • System works with four protocols 5/9/11 4
On-line Protocol Selection Algorithm Rank 1 Rank 2 Rank 3 … Rank n Execution of phase 1 of a sample application: n tasks m MPI calls per task 5/9/11 5
On-line Protocol Selection Algorithm Phase 1 (Estimating Execution Times) Rank 1 Rank 2 Rank 3 … Rank n Start of phase 5/9/11 6
On-line Protocol Selection Algorithm Phase 1 (Estimating Execution Times) Rank 1 Rank 2 Rank 3 … Rank n Start of phase MPI Call 1 5/9/11 7
On-line Protocol Selection Algorithm Phase 1 (Estimating Execution Times) Rank 1 Rank 2 Rank 3 … Rank n Start of phase MPI Call 1 tprotocol tprotocol tprotocol tprotocol MPI Call 2 5/9/11 8
On-line Protocol Selection Algorithm Phase 1 (Estimating Execution Times) Rank 1 Rank 2 Rank 3 … Rank n Start of phase MPI Call 1 MPI Call 2 tprotocol tprotocol tprotocol tprotocol MPI Call m 5/9/11 9
On-line Protocol Selection Algorithm Phase 1 (Estimating Execution Times) Rank 1 Rank 2 Rank 3 … Rank n Start of phase MPI Call 1 MPI Call 2 tprotocol tprotocol tprotocol tprotocol MPI Call m End of phase 5/9/11 10
On-line Protocol Selection Algorithm Phase 1 (Estimating Execution Times) Rank 1 Rank 2 Rank 3 … Rank n Start of phase MPI Call 1 MPI Call 2 MPI Call m End of phase t t Protocol Selection Optimal Protocol = min(t) t 5/9/11 11
On-line Protocol Selection Algorithm Phase 1 (Estimating Execution Times) Rank 1 Rank 2 Rank 3 … Rank n Start of phase MPI Call 1 MPI Call 2 MPI Call m End of phase t t Protocol Selection Optimal Protocol = min(t) t - Execution time linear in # MPI calls per phase 5/9/11 12
Point-to-Point Protocols • - Our system uses the following protocols • Existing Protocols (Yuan et al. 2009): • Pre-copy • Sender-initiated Rendezvous • Receiver-initiated Rendezvous • New protocol • Post-copy - Protocols categorized based on: • Message size • Arrival patterns of the communicating tasks 5/9/11 13
Pre-copy Protocol MPI Call Data Operation Time Sender Receiver 5/9/11 14
Pre-copy Protocol MPI Call Data Operation Sender Receiver Time MPI_Send 5/9/11 15
Pre-copy Protocol MPI Call Data Operation Time Sender Receiver MPI_Send Local buffer copy 5/9/11 16
Pre-copy Protocol MPI Call Data Operation Time Sender Receiver MPI_Send Local buffer copy RDMA Write Request 5/9/11 17
Pre-copy Protocol MPI Call Data Operation Time Sender Receiver MPI_Send Local buffer copy RDMA Write Request MPI_Recv MPI_Barrier 5/9/11 18
Pre-copy Protocol MPI Call Data Operation Time Sender Receiver MPI_Send Local buffer copy RDMA Write Request MPI_Recv MPI_Barrier Data RDMA Read 5/9/11 19
Pre-copy Protocol MPI Call Data Operation Time Sender Receiver MPI_Send Local buffer copy RDMA Write Request MPI_Recv MPI_Barrier Data RDMA Read ACK RDMA Write 5/9/11 20
Pre-copy Protocol MPI Call Data Operation Time Sender Receiver MPI_Send Local buffer copy RDMA Write Request MPI_Recv MPI_Barrier Data RDMA Read Sender Idle ACK RDMA Write MPI_Barrier 5/9/11 21
Post-copy Protocol MPI Call Data Operation Time Sender Receiver 5/9/11 22
Post-copy Protocol MPI Call Data Operation Time Sender Receiver MPI_Send 5/9/11 23
Post-copy Protocol MPI Call Data Operation Time Sender Receiver MPI_Send Request + Data RDMA Write 5/9/11 24
Post-copy Protocol MPI Call Data Operation Time Sender Receiver MPI_Send Request + Data RDMA Write MPI_Barrier 5/9/11 25
Post-copy Protocol MPI Call Data Operation Time Sender Receiver MPI_Send Request + Data RDMA Write MPI_Recv MPI_Barrier 5/9/11 26
Post-copy Protocol MPI Call Data Operation Time Sender Receiver MPI_Send Request + Data RDMA Write MPI_Recv MPI_Barrier Local buffer copy 5/9/11 27
Post-copy Protocol MPI Call Data Operation Time Sender Receiver MPI_Send Request + Data RDMA Write MPI_Recv MPI_Barrier Local buffer copy ACK 5/9/11 28
Post-copy Protocol MPI Call Data Operation Time Sender Receiver MPI_Send Request + Data RDMA Write MPI_Recv MPI_Barrier Local buffer copy Sender Idle ACK MPI_Barrier - Sender spends significantly less idle time compared to Pre-copy 5/9/11 29
Protocol Cost Model • - Supports five basic MPI operations: • MPI_Send • MPI_Recv • MPI_Isend • MPI_Irecv • MPI_Wait • Important terms: • tmemreg - Buffer registration time • tmemcopy - Buffer memory copy time • trdma_read - BufferRDMA Read time • trdma_write - Buffer RDMA Write time • tfunc_delay - Constant book-keeping time 5/9/11 30
Post-copy Protocol Cost Model: Sender Early Sender Receiver MPI_Isend tmemreg trdma_write tfunc_delay MPI_Wait MPI_Irecv tfunc_delay tmemcopy tfunc_delay MPI_Wait tfunc_delay 5/9/11 31
Post-copy Protocol Cost Model: Sender Early Sender Receiver MPI_Isend tmemreg trdma_write tfunc_delay MPI_Wait MPI_Irecv tfunc_delay tmemcopy tfunc_delay MPI_Wait tfunc_delay Sender = Total time tmemreg + trdma_write + 2 x tfunc_delay + 2 xtfunc_delay Receiver = Total time tmemcopy 5/9/11 32
Post-copy Protocol Cost Model: Receiver Early Sender Receiver MPI_Irecv tfunc_delay MPI_Isend tmemreg MPI_Wait trdma_write twait_delay tfunc_delay MPI_Wait tmemcopy tfunc_delay tfunc_delay 5/9/11 33
Post-copy Protocol Cost Model: Receiver Early Sender Receiver MPI_Irecv tfunc_delay MPI_Isend tmemreg MPI_Wait trdma_write twait_delay tfunc_delay MPI_Wait tmemcopy tfunc_delay tfunc_delay + 2 xtfunc_delay + trdma_write tmemreg Sender = Total time twait_delay+ tmemcopy+ 2 xtfunc_delay Receiver = Total time 5/9/11 34
Optimization: Buffer Usage Profiling • - Example code snippet: ... MPI_Send(buff1, ...); MPI_Recv(buff2, ...); MPI_Send(buff3, ...); MPI_Recv(buff1, ...); ... 5/9/11 35
Optimization: Buffer Usage Profiling Phase 2 (Buffer usage profiling) Rank 1 Rank 2 Rank 3 Rank n Start of phase 5/9/11 36
Optimization: Buffer Usage Profiling Phase 2 (Buffer usage profiling) Rank 1 Rank 2 Rank 3 Rank n Start of phase MPI_Send(Buff 1) MPI_Recv(Buff 2) MPI_Send(Buff 3) 5/9/11 37
Optimization: Buffer Usage Profiling Phase 2 (Buffer usage profiling) Rank 1 Rank 2 Rank 3 Rank n Start of phase MPI_Send(Buff 1) MPI_Recv(Buff 2) MPI_Send(Buff 3) MPI_Recv(Buff 1) 5/9/11 38
Optimization: Buffer Usage Profiling Phase 2 (Buffer usage profiling) Rank 1 Rank 2 Rank 3 Rank n Start of phase MPI_Send(Buff 1) MPI_Recv(Buff 2) MPI_Send(Buff 3) MPI_Recv(Buff 1) 5/9/11 39
Optimization: Buffer Usage Profiling • Conversion of synchronous calls to asynchronous calls ... MPI_Send(buff1, ...); MPI_Recv(buff2, ...); MPI_Send(buff3, ...); MPI_Recv(buff1, ...); ... 5/9/11 40
Optimization: Buffer Usage Profiling • Conversion of synchronous calls to asynchronous calls ... MPI_Send(buff1, ...); MPI_Recv(buff2, ...); MPI_Send(buff3, ...); MPI_Recv(buff1, ...); ... Buffer Usage Profile ... MPI_Isend(buff1, ..., req1); MPI_Recv(buff2, ...); MPI_Send(buff3, ...); MPI_Wait(req1, ...); MPI_Recv(buff1, ...); ... 5/9/11 41
Performance Evaluation • - Test Cluster: • Intel Xeon Processors (64 bit) • 8-core 2.33 GHz • 8 GB System Memory • 16 nodes • Infiniband Interconnect - Software: MVAPICH 2 - Benchmarks: • Sparse Matrix • CG • Sweep3D • Microbenchmarks 5/9/11 42
Performance Evaluation • Single communication phase per application 5/9/11 43
Performance Evaluation - System chose optimal protocol for each phase dynamically 5/9/11 44
Performance Evaluation Real Modeled • Real and modeled execution times for Sparse Matrix Application • Modeling accuracy: 95% to 99% • Modeling overhead: less than 1% of total execution time 5/9/11 45
Summary • Our system for on-line protocol selection was successfully tested for real and microbenchmarks. • - Protocol cost model: high accuracy with negligible overhead. • - Sender-initiated Post-copy protocol was successfully implemented. 5/9/11 46
Questions? 5/9/11 47
Thank You! 5/9/11 48