200 likes | 528 Views
Communication (or network) latency = time between sending and starting to receive data on ... On a uniprocessor - memory wall speed gap between processor and memory ...
E N D
Communication latency in distributed memory systems Ming Kin Lai May 4, 2007
Latency • Latency = time delay • Memory latency = time that elapses between making a request for a value stored in memory and receiving the associated value • Communication (or network) latency = time between sending and starting to receive data on a network link
Memory latency • On a uniprocessor - memory wall – speed gap between processor and memory • On a NUMA machine (tightly coupled distributed memory multiprocessor), memory latency is • small for local reference • large for remote reference
Reduction of memory latency • Tolerance (hiding) – hiding the effect of memory-access latencies by overlapping useful computation with memory references • Avoidance – minimize remote references by co-locating computation with data it accesses • Latency tolerance and avoidance complementary to each other
Memory latency avoidance • Use of cache
Memory latency tolerance • Multithreading • when one thread waits for data access, the other is executed • successful when overhead associated with the implementation, e.g. thread switching time, are less than the benefit gained from overlapping computation and memory access • Prefetching (into cache) • Out-of-order execution
Reduction ofNetwork (communication) latency • Tolerance (hiding) – hiding the effect of latencies by overlapping useful computation with communication • Avoidance – minimize communication by co-locating computation with data it accesses
Communication latency tolerance • In general, successful only if work available • Multithreading • when one thread waits for communication, the other is executed • successful when overhead associated with the implementation, e.g. thread switching time, are less than the benefit gained from overlapping computation and communication
Communication latency tolerance • Multithreading • exploits parallelism across multiple threads • Prefetching • finds parallelism within a single thread • request for data (i.e. prefetch request) must be moved back far in advance of the use of data in the execution stream • requires ability to predict what data is needed ahead of time
Comm latency avoidance in software DSMs • Data replication in local memory, using local memory as cache for remote locations • Relaxed consistency models
Comparative Evaluation of Latency Tolerance Techniques for Software Distributed Shared Memory – Mowry, Chan and Lo 1998 • Comm latency avoidance is not sufficient • Comm latency tolerance is needed • By combining both prefetching and multithreading such that • multithreading hides synch latency, and • prefetching hides memory latency 3 of the 8 apps can achieve better performance than when we use either technique individually.
Comparative Evaluation of Latency Tolerance Techniques for Software Distributed Shared Memory – Mowry, Chan and Lo 1998 • But combining prefetching and multithreading such that both techniques attempt to hide memory latency is not a good idea – redundant overhead • Best overall approach depends on • predictability of memory access patterns • the extent to which lock stalls dominate synch time • etc
Comm latency tolerance in message passing • “Appropriate” placement of non-blocking communication calls • MPI’s non-blocking calls • MPI_Isend() and MPI_Irecv()
MPI’s non-blocking send • A non-blocking post-send (MPI_Isend) initiates a send operation, but does not complete it • The post-send may returnbefore the msg is copied out of the send buffer • A separate complete-send (MPI_Wait) call is needed to verify that the send operation has completed, i.e. data copied out of the send buffer
MPI’s non-blocking receive • A non-blocking post-receive (MPI_Irecv ) initiates a receive operation, but does not complete it • The post-receive may returnbefore the msg is stored into the receive buffer • A separate complete-receive (MPI_Wait) call is needed to verify that the receive operation has completed, i.e. data received into the receive buffer
MPI_send = MPI_Isend + MPI_Wait • MPI_recv = MPI_Irecv + MPI_Wait
“appropriate” placement of non-blocking calls • To achieve max overlap between computation and communication, communications should be • started as soon as possible • completed as late as possible • Send should be • Posted as soon as the data to be sent is available • Completed just before the send buffer is to be reused • Receive should be • Posted as soon as the receive buffer can be reused • Completed just before the data in the receive buffer is to be used • Sometimes, overlap can be increased by re-ordering computations
Communication latency tolerance in MESSENGERS-C • Multithreading • Multiple sending and receiving threads • I/O multiplexing using poll() (similar to select()) with sockets
References • Matthew Haines, Wim Bohm, “An Evaluation of Software Multithreading in a Conventional Distributed Memory Multiprocessor” 1993 • Yong Yan, Xiaodong Zhang, Zhao Zhang, “A Memory-layout Oriented Run-time Technique for Locality Optimization” 1995 • P H Wang, Wang Hong, J D Collins, E Grochowski, M Kling, J P Shen, “Memory latency-tolerance approaches for Itanium processors: out-of-order execution vs. speculative precomputation” Eighth International Symposium on High-Performance Computer Architecture, 2002