Communication latency in distributed memory systems

Communication latency in distributed memory systems Ming Kin Lai May 4, 2007

Latency • Latency = time delay • Memory latency = time that elapses between making a request for a value stored in memory and receiving the associated value • Communication (or network) latency = time between sending and starting to receive data on a network link

Memory latency • On a uniprocessor - memory wall – speed gap between processor and memory • On a NUMA machine (tightly coupled distributed memory multiprocessor), memory latency is • small for local reference • large for remote reference

Reduction of memory latency • Tolerance (hiding) – hiding the effect of memory-access latencies by overlapping useful computation with memory references • Avoidance – minimize remote references by co-locating computation with data it accesses • Latency tolerance and avoidance complementary to each other

Memory latency avoidance • Use of cache

Memory latency tolerance • Multithreading • when one thread waits for data access, the other is executed • successful when overhead associated with the implementation, e.g. thread switching time, are less than the benefit gained from overlapping computation and memory access • Prefetching (into cache) • Out-of-order execution

Reduction ofNetwork (communication) latency • Tolerance (hiding) – hiding the effect of latencies by overlapping useful computation with communication • Avoidance – minimize communication by co-locating computation with data it accesses

Communication latency tolerance • In general, successful only if work available • Multithreading • when one thread waits for communication, the other is executed • successful when overhead associated with the implementation, e.g. thread switching time, are less than the benefit gained from overlapping computation and communication

Communication latency tolerance • Multithreading • exploits parallelism across multiple threads • Prefetching • finds parallelism within a single thread • request for data (i.e. prefetch request) must be moved back far in advance of the use of data in the execution stream • requires ability to predict what data is needed ahead of time

Comm latency avoidance in software DSMs • Data replication in local memory, using local memory as cache for remote locations • Relaxed consistency models

Comparative Evaluation of Latency Tolerance Techniques for Software Distributed Shared Memory – Mowry, Chan and Lo 1998 • Comm latency avoidance is not sufficient • Comm latency tolerance is needed • By combining both prefetching and multithreading such that • multithreading hides synch latency, and • prefetching hides memory latency 3 of the 8 apps can achieve better performance than when we use either technique individually.

Comparative Evaluation of Latency Tolerance Techniques for Software Distributed Shared Memory – Mowry, Chan and Lo 1998 • But combining prefetching and multithreading such that both techniques attempt to hide memory latency is not a good idea – redundant overhead • Best overall approach depends on • predictability of memory access patterns • the extent to which lock stalls dominate synch time • etc

Comm latency tolerance in message passing • “Appropriate” placement of non-blocking communication calls • MPI’s non-blocking calls • MPI_Isend() and MPI_Irecv()

MPI’s non-blocking send • A non-blocking post-send (MPI_Isend) initiates a send operation, but does not complete it • The post-send may returnbefore the msg is copied out of the send buffer • A separate complete-send (MPI_Wait) call is needed to verify that the send operation has completed, i.e. data copied out of the send buffer

MPI’s non-blocking receive • A non-blocking post-receive (MPI_Irecv ) initiates a receive operation, but does not complete it • The post-receive may returnbefore the msg is stored into the receive buffer • A separate complete-receive (MPI_Wait) call is needed to verify that the receive operation has completed, i.e. data received into the receive buffer

MPI_send = MPI_Isend + MPI_Wait • MPI_recv = MPI_Irecv + MPI_Wait

“appropriate” placement of non-blocking calls • To achieve max overlap between computation and communication, communications should be • started as soon as possible • completed as late as possible • Send should be • Posted as soon as the data to be sent is available • Completed just before the send buffer is to be reused • Receive should be • Posted as soon as the receive buffer can be reused • Completed just before the data in the receive buffer is to be used • Sometimes, overlap can be increased by re-ordering computations

Communication latency tolerance in MESSENGERS-C • Multithreading • Multiple sending and receiving threads • I/O multiplexing using poll() (similar to select()) with sockets

References • Matthew Haines, Wim Bohm, “An Evaluation of Software Multithreading in a Conventional Distributed Memory Multiprocessor” 1993 • Yong Yan, Xiaodong Zhang, Zhao Zhang, “A Memory-layout Oriented Run-time Technique for Locality Optimization” 1995 • P H Wang, Wang Hong, J D Collins, E Grochowski, M Kling, J P Shen, “Memory latency-tolerance approaches for Itanium processors: out-of-order execution vs. speculative precomputation” Eighth International Symposium on High-Performance Computer Architecture, 2002

Communication latency in distributed memory systems

Communication latency in distributed memory systems

Presentation Transcript

Last Class: Communication in Distributed Systems

Distributed Memory Multiprocessors

Predicting Communication Latency in the Internet

Transaction chains: achieving serializability with low-latency in geo-distributed storage systems

Secure Communication for Distributed Systems

Sparrow Distributed , Low Latency Scheduling

Communication Networks in Distributed Systems

Introduction to Software Distributed Shared Memory Systems

Tolerating/hiding Memory Latency

Distributed (Operating) Systems -Communication in Distributed Systems-

TECHNIQUES FOR REDUCING CONSISTENCY-RELATED COMMUNICATION IN DISTRIBUTED SHARED-MEMORY SYSTEMS

Minimizing Latency and Memory in DSMS

Communication in Distributed Systems

Abstracting Communication in Distributed Agent-Based Systems

Distributed Anemone: Transparent Low-Latency Access to Remote Memory

Incrementally Improving Lookup Latency in Distributed Hash Table Systems

Distributed Systems - Interprocess Communication

Communication strategies for distributed embedded systems

2. Communication in Distributed Systems

Communication in Distributed Systems

Distributed Systems : Inter-Process Communication

2. Communication in Distributed Systems