Profile Guided MPI Protocol Selection for Point-to-Point Communication Calls

Profile Guided MPI Protocol Selection for Point-to-Point Communication Calls Aniruddha Marathe, David K. Lowenthal Department of Computer Science The University of Arizona Tucson, AZ {amarathe,dkl}@cs.arizona.edu Zheng Gu, Matthew Small, Xin Yuan Department of Computer Science Florida State University Tallahassee, FL {zgu,small,xyuan}@cs.fsu.edu 5/9/11 1

Motivation • Need for anon-line protocol selection scheme: • Optimal protocol for a communication routine: application and architecture specific • Existing approaches • Off-line: Protocol selection at program compilation time • Static: One protocol per application • Difficult to adapt to program’s runtime characteristics 5/9/11 2

Contributions • - On-line protocol selection algorithm • - Protocol cost model • Employed by the on-line protocol selection algorithm to estimate the total execution time per protocol • - Sender-initiated Post-copy protocol • A novel protocol to complement the existing set of protocols 5/9/11 3

On-line Protocol Selection Algorithm • Selects the optimal communication protocol for a • communication phase dynamically • Protocol selection algorithm split into two phases: • Phase 1: Execution time estimation per protocol • Phase 2 (optimization): Buffer usage profiling • System works with four protocols 5/9/11 4

On-line Protocol Selection Algorithm Rank 1 Rank 2 Rank 3 … Rank n Execution of phase 1 of a sample application: n tasks m MPI calls per task 5/9/11 5

On-line Protocol Selection Algorithm Phase 1 (Estimating Execution Times) Rank 1 Rank 2 Rank 3 … Rank n Start of phase 5/9/11 6

On-line Protocol Selection Algorithm Phase 1 (Estimating Execution Times) Rank 1 Rank 2 Rank 3 … Rank n Start of phase MPI Call 1 5/9/11 7

On-line Protocol Selection Algorithm Phase 1 (Estimating Execution Times) Rank 1 Rank 2 Rank 3 … Rank n Start of phase MPI Call 1 tprotocol tprotocol tprotocol tprotocol MPI Call 2 5/9/11 8

On-line Protocol Selection Algorithm Phase 1 (Estimating Execution Times) Rank 1 Rank 2 Rank 3 … Rank n Start of phase MPI Call 1 MPI Call 2 tprotocol tprotocol tprotocol tprotocol MPI Call m 5/9/11 9

On-line Protocol Selection Algorithm Phase 1 (Estimating Execution Times) Rank 1 Rank 2 Rank 3 … Rank n Start of phase MPI Call 1 MPI Call 2 tprotocol tprotocol tprotocol tprotocol MPI Call m End of phase 5/9/11 10

On-line Protocol Selection Algorithm Phase 1 (Estimating Execution Times) Rank 1 Rank 2 Rank 3 … Rank n Start of phase MPI Call 1 MPI Call 2 MPI Call m End of phase t t Protocol Selection Optimal Protocol = min(t) t 5/9/11 11

On-line Protocol Selection Algorithm Phase 1 (Estimating Execution Times) Rank 1 Rank 2 Rank 3 … Rank n Start of phase MPI Call 1 MPI Call 2 MPI Call m End of phase t t Protocol Selection Optimal Protocol = min(t) t - Execution time linear in # MPI calls per phase 5/9/11 12

Point-to-Point Protocols • - Our system uses the following protocols • Existing Protocols (Yuan et al. 2009): • Pre-copy • Sender-initiated Rendezvous • Receiver-initiated Rendezvous • New protocol • Post-copy - Protocols categorized based on: • Message size • Arrival patterns of the communicating tasks 5/9/11 13

Pre-copy Protocol MPI Call Data Operation Time Sender Receiver 5/9/11 14

Pre-copy Protocol MPI Call Data Operation Sender Receiver Time MPI_Send 5/9/11 15

Pre-copy Protocol MPI Call Data Operation Time Sender Receiver MPI_Send Local buffer copy 5/9/11 16

Pre-copy Protocol MPI Call Data Operation Time Sender Receiver MPI_Send Local buffer copy RDMA Write Request 5/9/11 17

Pre-copy Protocol MPI Call Data Operation Time Sender Receiver MPI_Send Local buffer copy RDMA Write Request MPI_Recv MPI_Barrier 5/9/11 18

Pre-copy Protocol MPI Call Data Operation Time Sender Receiver MPI_Send Local buffer copy RDMA Write Request MPI_Recv MPI_Barrier Data RDMA Read 5/9/11 19

Pre-copy Protocol MPI Call Data Operation Time Sender Receiver MPI_Send Local buffer copy RDMA Write Request MPI_Recv MPI_Barrier Data RDMA Read ACK RDMA Write 5/9/11 20

Pre-copy Protocol MPI Call Data Operation Time Sender Receiver MPI_Send Local buffer copy RDMA Write Request MPI_Recv MPI_Barrier Data RDMA Read Sender Idle ACK RDMA Write MPI_Barrier 5/9/11 21

Post-copy Protocol MPI Call Data Operation Time Sender Receiver 5/9/11 22

Post-copy Protocol MPI Call Data Operation Time Sender Receiver MPI_Send 5/9/11 23

Post-copy Protocol MPI Call Data Operation Time Sender Receiver MPI_Send Request + Data RDMA Write 5/9/11 24

Post-copy Protocol MPI Call Data Operation Time Sender Receiver MPI_Send Request + Data RDMA Write MPI_Barrier 5/9/11 25

Post-copy Protocol MPI Call Data Operation Time Sender Receiver MPI_Send Request + Data RDMA Write MPI_Recv MPI_Barrier 5/9/11 26

Post-copy Protocol MPI Call Data Operation Time Sender Receiver MPI_Send Request + Data RDMA Write MPI_Recv MPI_Barrier Local buffer copy 5/9/11 27

Post-copy Protocol MPI Call Data Operation Time Sender Receiver MPI_Send Request + Data RDMA Write MPI_Recv MPI_Barrier Local buffer copy ACK 5/9/11 28

Post-copy Protocol MPI Call Data Operation Time Sender Receiver MPI_Send Request + Data RDMA Write MPI_Recv MPI_Barrier Local buffer copy Sender Idle ACK MPI_Barrier - Sender spends significantly less idle time compared to Pre-copy 5/9/11 29

Protocol Cost Model • - Supports five basic MPI operations: • MPI_Send • MPI_Recv • MPI_Isend • MPI_Irecv • MPI_Wait • Important terms: • tmemreg - Buffer registration time • tmemcopy - Buffer memory copy time • trdma_read - BufferRDMA Read time • trdma_write - Buffer RDMA Write time • tfunc_delay - Constant book-keeping time 5/9/11 30

Post-copy Protocol Cost Model: Sender Early Sender Receiver MPI_Isend tmemreg trdma_write tfunc_delay MPI_Wait MPI_Irecv tfunc_delay tmemcopy tfunc_delay MPI_Wait tfunc_delay 5/9/11 31

Post-copy Protocol Cost Model: Sender Early Sender Receiver MPI_Isend tmemreg trdma_write tfunc_delay MPI_Wait MPI_Irecv tfunc_delay tmemcopy tfunc_delay MPI_Wait tfunc_delay Sender = Total time tmemreg + trdma_write + 2 x tfunc_delay + 2 xtfunc_delay Receiver = Total time tmemcopy 5/9/11 32

Post-copy Protocol Cost Model: Receiver Early Sender Receiver MPI_Irecv tfunc_delay MPI_Isend tmemreg MPI_Wait trdma_write twait_delay tfunc_delay MPI_Wait tmemcopy tfunc_delay tfunc_delay 5/9/11 33

Post-copy Protocol Cost Model: Receiver Early Sender Receiver MPI_Irecv tfunc_delay MPI_Isend tmemreg MPI_Wait trdma_write twait_delay tfunc_delay MPI_Wait tmemcopy tfunc_delay tfunc_delay + 2 xtfunc_delay + trdma_write tmemreg Sender = Total time twait_delay+ tmemcopy+ 2 xtfunc_delay Receiver = Total time 5/9/11 34

Optimization: Buffer Usage Profiling • - Example code snippet: ... MPI_Send(buff1, ...); MPI_Recv(buff2, ...); MPI_Send(buff3, ...); MPI_Recv(buff1, ...); ... 5/9/11 35

Optimization: Buffer Usage Profiling Phase 2 (Buffer usage profiling) Rank 1 Rank 2 Rank 3 Rank n Start of phase 5/9/11 36

Optimization: Buffer Usage Profiling Phase 2 (Buffer usage profiling) Rank 1 Rank 2 Rank 3 Rank n Start of phase MPI_Send(Buff 1) MPI_Recv(Buff 2) MPI_Send(Buff 3) 5/9/11 37

Optimization: Buffer Usage Profiling Phase 2 (Buffer usage profiling) Rank 1 Rank 2 Rank 3 Rank n Start of phase MPI_Send(Buff 1) MPI_Recv(Buff 2) MPI_Send(Buff 3) MPI_Recv(Buff 1) 5/9/11 38

Optimization: Buffer Usage Profiling Phase 2 (Buffer usage profiling) Rank 1 Rank 2 Rank 3 Rank n Start of phase MPI_Send(Buff 1) MPI_Recv(Buff 2) MPI_Send(Buff 3) MPI_Recv(Buff 1) 5/9/11 39

Optimization: Buffer Usage Profiling • Conversion of synchronous calls to asynchronous calls ... MPI_Send(buff1, ...); MPI_Recv(buff2, ...); MPI_Send(buff3, ...); MPI_Recv(buff1, ...); ... 5/9/11 40

Optimization: Buffer Usage Profiling • Conversion of synchronous calls to asynchronous calls ... MPI_Send(buff1, ...); MPI_Recv(buff2, ...); MPI_Send(buff3, ...); MPI_Recv(buff1, ...); ... Buffer Usage Profile ... MPI_Isend(buff1, ..., req1); MPI_Recv(buff2, ...); MPI_Send(buff3, ...); MPI_Wait(req1, ...); MPI_Recv(buff1, ...); ... 5/9/11 41

Performance Evaluation • - Test Cluster: • Intel Xeon Processors (64 bit) • 8-core 2.33 GHz • 8 GB System Memory • 16 nodes • Infiniband Interconnect - Software: MVAPICH 2 - Benchmarks: • Sparse Matrix • CG • Sweep3D • Microbenchmarks 5/9/11 42

Performance Evaluation • Single communication phase per application 5/9/11 43

Performance Evaluation - System chose optimal protocol for each phase dynamically 5/9/11 44

Performance Evaluation Real Modeled • Real and modeled execution times for Sparse Matrix Application • Modeling accuracy: 95% to 99% • Modeling overhead: less than 1% of total execution time 5/9/11 45

Summary • Our system for on-line protocol selection was successfully tested for real and microbenchmarks. • - Protocol cost model: high accuracy with negligible overhead. • - Sender-initiated Post-copy protocol was successfully implemented. 5/9/11 46

Questions? 5/9/11 47

Thank You! 5/9/11 48

Profile Guided MPI Protocol Selection for Point-to-Point Communication Calls

Profile Guided MPI Protocol Selection for Point-to-Point Communication Calls

Presentation Transcript

Point-to-Point Communication

Point to Point Protocol

Point-to-Point Protocol (PPP)

Point to Point protocol (PPP)

Point-to-Point Protocol (PPP)

Point-to-Point Protocol

Point-to-Point Protocol (PPP)

PPP (Point to Point Protocol)

Point-to-Point Protocol (PPP)

Point-to-Point Protocol (PPP)

Point to Point Communication

Point-to-Point Protocol (PPP)

Point to Point Protocol

Point-to-Point Protocol

Point-to-Point Communication

Point-to-Point Protocol (PPP)

Point-to-Point Protocol (PPP)

Point-to-Point Communication

Point to Point Protocol

Point-to-Point Communication

Point-to-Point Protocol (PPP)