270 likes | 395 Views
Efficient Asynchronous Message Passing via SCI with Zero-Copying. SCI Europe 2001 – Trinity College Dublin. Joachim Worringen * , Friedrich Seifert + , Thomas Bemmerl *. Agenda. What is Zero-Copying? What is it good for? Zero-Copying with SCI Support through SMI-Library
E N D
Efficient Asynchronous Message Passing via SCI with Zero-Copying SCI Europe 2001 – Trinity College Dublin Joachim Worringen*, Friedrich Seifert+, Thomas Bemmerl*
Agenda • What is Zero-Copying? What is it good for? Zero-Copying with SCI • Support through SMI-Library Shared Memory Interface • Zero-Copy Protocols in SCI-MPICH Memory Allocation Setups Performance Optimizations • Performance Evaluation Point-to-Point Application Kernel Asynchronous Communication SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Zero-Copying • Transfer of data between two user-level accessible memory buffers with N explicit intermediate copies:N-way–Copying • No intermediate copy:Zero-Copying • Effective Bandwidth and Efficiency: SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
SCI DMA FastEthernet GigaEthernet Efficiency Comparison SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Zero-Copying with SCI • SCI does zero-copy by nature. But: SCI via IO-Bus is limited: • No SMP-style shared memory • Specially allocated memory regions were required • No general zero-copy possible New possibility: • Using user-allocated buffers for SCI communication • Allows general zero-copy! Connection setup is always required. SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
SMI LibraryShared Memory Interface • High-Level SCI support library • for parallel applications or libraries • Application startup • Synchronization & basic communication • Shared-Memory setup: • Collective regions • Point-2-point regions • Individual regions • Dynamic memory management • Data transfer SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Data Moving (I) • Shared Memory Paradigm: • Import remote memory in local address space • Perform memcpy() or maybe DMA • SMI Support: • region type REMOTE • Synchronous (PIO): SMI_Memcpy() • Asynchronous (DMA if possible): SMI_Imemcpy() followed by SMI_Mem_wait() • Problems: • High Mapping Overhead • Resource Usage (ATT entries on PCI-SCI adapter) SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Mapping Overhead • Not suitable for dynamic memory setups! SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Data Moving (II) • Connection Paradigm: • Connect to remote memory location • No representation in local address space • only DMA possible • SMI support: • Region type RDMA • Synchronous / Asynchronous DMA:SMI_Put/SMI_Iput, SMI_Get/SMI_Iget, SMI_Memwait • Problems: • Alignment restrictions • Source needs to be pinned down SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Setup Acceleration • Memory buffer setup costs time ! • Reduce number of operations to increase performance • Desirable: only one operation per buffer • Problem: limited ressources • Solution: caching of SCI segment states by lazy-release • Leave buffers registered, remote segments connected or mapped • Release unneeded resources if setup of new resource fails • Different replacement strategies possible:LRU, LFU, best-fit, random, immediate • Attention: remote segment deallocation! • Callback on connection event to release local connection • MPI persistent communication operations: • Pre-register user buffer & higher „hold“ priority SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Memory Allocation • Allocate „good“ memory: • MPI_Alloc_mem() / MPI_Free_mem() • Part of MPI-2 (mostly for single-sided operations) • SCI-MPICH defines attributes: • type: shared, private or default • Shared memory performs best. • alignment: none, specified or default • Non-shared memory should be page-aligned • „Good“ memory should only be enforced for communication buffers! SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Zero-Copy Protocols • Applicable for hand-shake based rendez-vous protocol • Requirements: • registered user allocated buffers or • regular SCI segments • „good“ memory via MPI_Alloc_mem() • State of memory range must be known • SMI provides query functionality • Registering / Connection / Mapping may fail • Several different setups possible • Fallback mechanism required SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Sender Receiver Application Thread Device Thread Device Thread Application Thread Isend Ask to send OK to send Done Continue Done Asynchronous Rendez-Vous Control Messages Irecv Data Transfer Wait Wait SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Test Setup • Systems used for performance evaluation: • Pentium-III @ 800 MHz • 512 MB RAM @ 133 MHz • 64-bit / 66 MHz PCI (ServerWorks ServerSet III LE) • Dolphin D330 (single ring topology) • Linux 2.4.4-bigphysarea • modified SCI driver (user memory for SCI) SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Bandwidth Comparison SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Application Kernel: NPB IS • Parallel bucket sort • Keys are integer numbers • Dominant communication:MPI_Alltoallv for distributed key array: SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
MPI_Alltoallv Performance • MPI_Alltoallv is translated into point-to-point operations: MPI_Isend / MPI_Irecv / MPI_Waitall • Improved performance with asynchronous DMA operations • Application speedup deduced SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Computation Synchronous Asynchronous totaltime computation time Asynchronous Communication • Goal: Overlap Computation & Communication • How to quantify the efficiency for this? • Typical overlapping effect: SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Saturation and Efficiency(I) • Two parameters are required: • Saturation s • Duration of computation period required to make total time (communication & computation) increase • Efficiency e • Relation of overhead to message latency SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Computation Synchronous Asynchronous Saturation s Saturation and Efficiency(II) ttotal tmsg_a ttotal - tbusy tmsg_s tbusy SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Experimental Setup: Overlap Micro-Benchmark to quantify overlapping: latency = MPI_Wtime() if (sender) MPI_Isend(msg, msgsize) while (elapsed_time < spinning_duration) spin (with multiple threads) MPI_Wait() else MPI_Recv() latency = MPI_Wtime() - latency SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Experimental Setup: Spinning Different ways of keeping CPU busy: • FIXEDSpin on single variable for a given amount of CPU time • No memory stress • DAXPYPerform a given number of DAXPY operations on vectors (vectorsizes x, y equivalent to message size) • Stress memory system SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
DAXPY – 64kiB Message SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
DAXPY – 256kiB Message SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
FIXED – 64kiB Message SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Asynchronous Performance • Saturation and Efficiency derived from experiments: SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme
Summary & Outlook • Efficient utilization of new SCI driver functionality for MPI communication: • Max. bandwidth of 230 MiB/s (regular) 190 MiB/s (user) • Connection overhead hidden by segment caching • Asynchronous communication pays off much earlier than before • New (?)quantification scheme for efficiency of asynchronous communication • Flexible MPI memory allocation supports MPI application writer • Connection-oriented DMA transfers reduce resource utilization • DMA alignment problems • Segment callback required for improved connection caching SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme