200 likes | 316 Views
About some OS features related to Cluster Computing . Loïc Prylli LIP (RESO team) ENS-Lyon/INRIA/CNRS France. Outline . Introduction Revisiting OS-bypass Asynchronous-IO APIs (TCP/IP, disk) Application: remote file access Conclusion. PROC. mémoire. Node hardware view (Myrinet).
E N D
About some OS features related to Cluster Computing Loïc Prylli LIP (RESO team) ENS-Lyon/INRIA/CNRS France
Outline • Introduction • Revisiting OS-bypass • Asynchronous-IO APIs (TCP/IP, disk) • Application: remote file access • Conclusion
PROC mémoire Node hardware view (Myrinet) Disques BUS PCI SRAM LANAI PROC Carte réseau Myrinet
IP stack Driver BIP Socket interface Internet applications PVM MPI-BIP Madeleine BIP BIP « firmware » Software View kernel Libraries Embedded software
What is OS-bypass (request side)? Application and libraries User-space OS-bypass OS services Kernel-space Request queue Network embedded firmware Device-space
OS-bypass for data movement via « memory registration » Application and libraries User-space (virtual-memory) OS DMA Network embedded firmware
Memory registration • Problem maintaining the coherence between: • OS view of the virtual space • Nic view of the virtual space • Particularly across fork/mmap/munmap, ex: • send operation -> implicit registration • munmap/mmap -> change the address space • Send operation -> reuse obsolete registration • Strong dependency on OS internals
OS-Bypass : when is it useful? Communication library Communication library System Call Parameters validation Access control Protocol processing Parameters validation Access control Protocol processing Network interface Network interface Without OS-bypass With OS-bypass
Example in the architecture of BIP • Security level: • No network protection: • OS-bypass is best • New development with network protection: • Parameter validation and checking and fairness is done in kernel for message sending
Sometimes cluster are not in a « MPI » closed environment Internet Grids, Storage
Mixing cluster communications and other I/O activities • 10000 connections problem: • how to deal efficiently with 10K connections (generally TCP connections, but also disks I/Os, internal cluster communications) • Typical application loop: • Wait for some event(any source) • Treat request • Problem: • Poll/select/MPI_WaitAny not scalable • Mixing with clusters communications make it worse • Independant threads is problematic when modifying the set of connections
API to manage disk or TCP/IP efficiently • Problem: Old POSIX I/O is limited • No concurrency/pipeline allowed without threads. • Solution: using kernel-managed completions queues (as provided by Linux AIO project) • Functionality similar to NT queues or FreeBsd Kqueue, • Interface: • io_submit_req()=> (read, write, send, recv requests) • io_getevents()
AIO subsytem structure:overcome the MPI_WaitAny or select()/poll() design Application/Libraries Event-queue requests OS interruptions requests HARDWARE
Application: NFS replacement for cluster • Shared lib implementation on top of either GM, BIP TCP/IP. • Uses Linux-AIO for TCP/IP, Disk I/O. • Server-side export: • either a in-memory filesystem, • or some local fileystem. • Conceptually similar to DAFS: • add transparent use • Point-to-point design
Usual NFS architecture Application NFS server (kernel) Virtual File System TCP/RPC Virtual File System Ext2 VFAT NFS client Ext2 VFAT Buffer-Cache Buffer-Cache IDE SCSI Local disks client server
VIA/BIP/GM Our simple remote file access protocol architecture Application Server Client Virtual File System Ext2 FAT32 Buffer-Cache IDE SCSI Local disks
Results on Myrinet • Micro Benchmark 100Mbyte copy:
Conclusion • OS-bypass is not necessarily a performance optimisation, it is an architecture choice. • Similarity in the evolution of API and subsystems for cluster network communications and disk/TCP-IO API: • Strongly asynchronous design • Completion queues (a missing feature in MPI)