Efficient Asynchronous Message Passing via SCI with Zero-Copying

Efficient Asynchronous Message Passing via SCI with Zero-Copying SCI Europe 2001 – Trinity College Dublin Joachim Worringen*, Friedrich Seifert+, Thomas Bemmerl*

Agenda • What is Zero-Copying? What is it good for? Zero-Copying with SCI • Support through SMI-Library Shared Memory Interface • Zero-Copy Protocols in SCI-MPICH Memory Allocation Setups Performance Optimizations • Performance Evaluation Point-to-Point Application Kernel Asynchronous Communication SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Zero-Copying • Transfer of data between two user-level accessible memory buffers with N explicit intermediate copies:N-way–Copying • No intermediate copy:Zero-Copying • Effective Bandwidth and Efficiency: SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

SCI DMA FastEthernet GigaEthernet Efficiency Comparison SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Zero-Copying with SCI • SCI does zero-copy by nature. But: SCI via IO-Bus is limited: • No SMP-style shared memory • Specially allocated memory regions were required • No general zero-copy possible New possibility: • Using user-allocated buffers for SCI communication • Allows general zero-copy! Connection setup is always required. SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

SMI LibraryShared Memory Interface • High-Level SCI support library • for parallel applications or libraries • Application startup • Synchronization & basic communication • Shared-Memory setup: • Collective regions • Point-2-point regions • Individual regions • Dynamic memory management • Data transfer SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Data Moving (I) • Shared Memory Paradigm: • Import remote memory in local address space • Perform memcpy() or maybe DMA • SMI Support: • region type REMOTE • Synchronous (PIO): SMI_Memcpy() • Asynchronous (DMA if possible): SMI_Imemcpy() followed by SMI_Mem_wait() • Problems: • High Mapping Overhead • Resource Usage (ATT entries on PCI-SCI adapter) SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Mapping Overhead • Not suitable for dynamic memory setups! SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Data Moving (II) • Connection Paradigm: • Connect to remote memory location • No representation in local address space • only DMA possible • SMI support: • Region type RDMA • Synchronous / Asynchronous DMA:SMI_Put/SMI_Iput, SMI_Get/SMI_Iget, SMI_Memwait • Problems: • Alignment restrictions • Source needs to be pinned down SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Setup Acceleration • Memory buffer setup costs time ! • Reduce number of operations to increase performance • Desirable: only one operation per buffer • Problem: limited ressources • Solution: caching of SCI segment states by lazy-release • Leave buffers registered, remote segments connected or mapped • Release unneeded resources if setup of new resource fails • Different replacement strategies possible:LRU, LFU, best-fit, random, immediate • Attention: remote segment deallocation! • Callback on connection event to release local connection • MPI persistent communication operations: • Pre-register user buffer & higher „hold“ priority SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Memory Allocation • Allocate „good“ memory: • MPI_Alloc_mem() / MPI_Free_mem() • Part of MPI-2 (mostly for single-sided operations) • SCI-MPICH defines attributes: • type: shared, private or default • Shared memory performs best. • alignment: none, specified or default • Non-shared memory should be page-aligned • „Good“ memory should only be enforced for communication buffers! SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Zero-Copy Protocols • Applicable for hand-shake based rendez-vous protocol • Requirements: • registered user allocated buffers or • regular SCI segments • „good“ memory via MPI_Alloc_mem() • State of memory range must be known • SMI provides query functionality • Registering / Connection / Mapping may fail • Several different setups possible • Fallback mechanism required SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Sender Receiver Application Thread Device Thread Device Thread Application Thread Isend Ask to send OK to send Done Continue Done Asynchronous Rendez-Vous Control Messages Irecv Data Transfer Wait Wait SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Test Setup • Systems used for performance evaluation: • Pentium-III @ 800 MHz • 512 MB RAM @ 133 MHz • 64-bit / 66 MHz PCI (ServerWorks ServerSet III LE) • Dolphin D330 (single ring topology) • Linux 2.4.4-bigphysarea • modified SCI driver (user memory for SCI) SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Bandwidth Comparison SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Application Kernel: NPB IS • Parallel bucket sort • Keys are integer numbers • Dominant communication:MPI_Alltoallv for distributed key array: SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

MPI_Alltoallv Performance • MPI_Alltoallv is translated into point-to-point operations: MPI_Isend / MPI_Irecv / MPI_Waitall • Improved performance with asynchronous DMA operations • Application speedup deduced SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Computation Synchronous Asynchronous totaltime computation time Asynchronous Communication • Goal: Overlap Computation & Communication • How to quantify the efficiency for this? • Typical overlapping effect: SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Saturation and Efficiency(I) • Two parameters are required: • Saturation s • Duration of computation period required to make total time (communication & computation) increase • Efficiency e • Relation of overhead to message latency SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Computation Synchronous Asynchronous Saturation s Saturation and Efficiency(II) ttotal tmsg_a ttotal - tbusy tmsg_s tbusy SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Experimental Setup: Overlap Micro-Benchmark to quantify overlapping: latency = MPI_Wtime() if (sender) MPI_Isend(msg, msgsize) while (elapsed_time < spinning_duration) spin (with multiple threads) MPI_Wait() else MPI_Recv() latency = MPI_Wtime() - latency SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Experimental Setup: Spinning Different ways of keeping CPU busy: • FIXEDSpin on single variable for a given amount of CPU time • No memory stress • DAXPYPerform a given number of DAXPY operations on vectors (vectorsizes x, y equivalent to message size) • Stress memory system SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

DAXPY – 64kiB Message SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

DAXPY – 256kiB Message SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

FIXED – 64kiB Message SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Asynchronous Performance • Saturation and Efficiency derived from experiments: SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Summary & Outlook • Efficient utilization of new SCI driver functionality for MPI communication: • Max. bandwidth of 230 MiB/s (regular) 190 MiB/s (user) • Connection overhead hidden by segment caching • Asynchronous communication pays off much earlier than before • New (?)quantification scheme for efficiency of asynchronous communication • Flexible MPI memory allocation supports MPI application writer • Connection-oriented DMA transfers reduce resource utilization • DMA alignment problems • Segment callback required for improved connection caching SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Efficient Asynchronous Message Passing via SCI with Zero-Copying

Efficient Asynchronous Message Passing via SCI with Zero-Copying

Presentation Transcript

Message Passing Basics

Message Passing

Message Passing Communication

Message Passing Programming with MPI

Message Passing Models

Message Passing

Message-Passing

Message Passing Programming with MPI

Message Passing

Message Passing Interface

Message Passing

Message Passing Programming with MPI

Message Passing

Message-Passing Computing

Message Passing

Message-Passing Computing

Message Passing Programming with MPI

Message Passing Interface

Message Passing Interface

Asynchronous Point-to-Point Message Passing

Message Passing Computing