Scalable Fault Tolerance: Xen Virtualization for PGAs Models on High-Performance Networks

Daniele Scarpazza, Oreste Villa, Fabrizio Petrini, Jarek Nieplocha, Vinod Tipparaju, Manoj Krishnan Pacific Northwest National Laboratory Radu Teodoresci, Jun Nakanom Josep Torrellas University of Illinois Duncan Roweth Quadrics In collaboration with Patrick Mullaney, Novell Wayne Augsburger, Mellanox Scalable Fault Tolerance: Xen Virtualization for PGAs Models on High-Performance Networks

Project Motivation • Component count in high-end systems has been growing • How do we utilize large (103-5 processor) systems for solving complex science problems? • Fundamental problems • Scalability to massive processor counts • Application performance on single processor given the increasingly complex memory hierarchy • Hardware and software failures MTBF as a function of system size

Multiple FT Techniques • Application drivers • Multidisciplinary, multiresolution, and multiscale nature of scientific problems drive the demand for high end systems • Applications place increasingly differing demands on the system resources: disk, network, memory, and CPU • Some of them have natural fault resiliency and require very little support • System drivers • Different I/O configurations, programmable or simple/commodity NICs, proprietary/custom/commodity operating systems • Tradeoffs between acceptable rates of failure and cost • Cost effectiveness is the main constraint in HPC • Therefore, it is not cost-effective or practical to rely on a single fault tolerance approach for all applications and systems

Key Elements of SFT • Buffered Coscheduling provides global coordination • of system activities, communication, CR BCS ReVive and ReVive I/O provide efficient CR capability for shared memory servers (cluster node): ReVive (I/O) Virtualization of High Performance Network Interfaces and Protocols IBA/QsNET Fault-Tolerance module for ARMCI runtime system FT ARMCI Focus of the talk Hypervisor to enable virtualization of compute node environment including OS (external dependency) XEN

Transparent System-level CR of PGA Applications on Infiniband and QsNET • We explored a new approach to cluster fault-tolerance by integrating Xen with the latest generations of Infiniband and Quadrics high-performance networks • Focus on Partitioned Global Address Space (PGAs) programming models • Most of existing work focused on MPI • Design Goals • low-overhead and • transparent migration

Main Contributions • Integration of Xen and Infiniband • Enhanced Xen’s kernel modules to fully support user-level Infiniband protocols and IP over IB with minimal overhead • Support for Partitioned Global Address space programming models (PGAs) • Emphasis on ARMCI • Automatic Detection of a Global Recovery Line and Coordinated Migration • Perform a live migration without any change to user applications • Experimental evaluation

Xen Hypervisor • On each machine Xen allows the creation of a privileged virtual machine (Dom0) and and one or more non-privileged VMs (DomUs) • Xen provides the ability to pause, un-pause, checkpoint and resume DomUs • Xen employs para-virtualization • Non-privileged domains run a modified operating system featuring guest device drivers • Their requests are forwarded to the native device driver in Dom0 using a split driver model

Infiniband Device Driver • The driver is implemented in two sections • A paravirtualized section for slow path control operations (e.g., q-pair creation) and • A direct access section for fast path data operations (transmit/receive) • Based on Ohio State/IBM implementation • Driver was extended to support additional CPU architectures and Infiniband adapters • Added proxy layer to allow Subnet and Connection management from guest VMs • Propagate suspend/resume to the applications not only kernel modules • Several stability improvements

Software Stack

Xen/Infiniband Device Driver: Communication Performance

Parallel Programming Models • Single Threaded • Data Parallel, e.g. HPF • Multiple Processes • Partitioned-Local Data Access • MPI • Uniform-Global-Shared Data Access • OpenMP • Partitioned-Global-Shared Data Access • Co-Array Fortran • Uniform-Global-Shared + Partitioned Data Access • UPC, Global Arrays, X10

Fault Tolerance In PGAs Models • Implementation considerations • 1-sided communication, perhaps some 2-sided and collectives • Special considerations in implementation of global recovery line • Memory operations need to be synchronized for checkpoint/restart • Memory is a combination of local and global (globally visible) • Global memory could be shared from OS view • Pinned and registered with network adapter G G G G L L L L P P P P SMP node n SMP node 0 Network

Xen-enabled ARMCI Fundamental Communication Models in HPC • Runtime system - one-sided communication • Global Arrays, Rice Co-Array Fortran, GPSHEM, IBM X10 port under way • Portable high performance remote memory copy interface • Asynchronous remote memory access (RMA) • Fast Collective Operations • Zero-copy protocols, explicit NIC support • “Pure” non-blocking communication - 99.9% overlap • Data Locality • Shared-memory within SMP node and RMA across nodes • High performance delivered on wide range of platforms • Multi-protocol and multi-method implementation A B receive send P1 P0 message passing 2-sided model A B put P0 P1 remote memory access (RMA) 1-sided model A B A=B P0 P1 shared memory load/stores 0-sided model Examples of data transfers optimized in ARMCI

Global Recovery Lines (GRLs) • A GRL is required before each checkpoint / migration • A GRL is required for Infiniband networks because • IBA does not allow location-independent layer 2 and 3 addresses • IBA hardware maintains stateful connections not accessible by software • The protocol that enforces a GRL has • A drain phase, which completes any ongoing communication • Followed by a global silence, where it is possible to perform node migration • And a resume phase, where the processing nodes acquire knowledge of the new network topology

GRL and Resource Management

Experimental Evaluation • Our experimental testbed is a cluster of 8 Dell PowerEdge 1950 • Each node has two dual-core Intel Xeon Woodcrest running @ 3.0 GHz, with 8 Gbytes of memory • The cluster is interconnected by a Mellanox Infinihost III 4X HCA adapters • Suse Linux Enterprise Server 10.0 • Xen 3.02

Timing of a GRL

Scalability of Network Drain

Scalability of Network Resume

Save and Restore Latencies

Migration Latencies

Conclusion • We have presented a novel software infrastructure that allows completely transparent checkpoint/restart • We have implemented a device driver that enhances the existing Xen/Infiniband drivers • Support for PGAs programming models • Minimal overhead, 10s of milliseconds • Most of the time is spent saving/restoring the node image

Scalable Fault Tolerance: Xen Virtualization for PGAs Models on High-Performance Networks