A New Approach in the System Software Design for Large-Scale Parallel Computers

A New Approach in the System Software Designfor Large-Scale Parallel Computers Juan Fernández1,2, Eitan Frachtenberg1, Fabrizio Petrini1, Salvador Coll1 and José C. Sancho1 1Performance and Architecture Lab 2Grupo de Arquitectura y Computación Paralela (GACOP) CCS-3 Modeling, Algorithms and Informatics Dpto. Ingeniería y Tecnología de Computadores Los Alamos National Laboratory, NM 87545, USA Universidad de Murcia, 30071 Murcia, SPAIN URL: http://www.c3.lanl.gov URL: http://www.ditec.um.es email:{juanf,eitanf,fabrizio,scoll,jcsancho}@lanl.gov

Motivation OS OS OS Hardware / OSs are glued together by System Software: Resource Management Communications Parallel Development and Debugging Tools Parallel File System Fault Tolerance OS OS OS OS OS System software is a key factor to maximize usability, performance and scalability on large-scale systems!!!

Motivation • System software complexity due to multiple factors: • Extremely complex global state • Non-deterministic behavior inherent to computing systems and parallel apps • Local OSs lack global awareness of parallel apps • Independent design of different components • User-level applications rely on system software

Outline • Motivation • Goals • Core Primitives • Resource Management • Communication Libraries • Ongoing and future work

Goals • Target • Simplifying design and implementation of the system software for large-scale parallel computers • Simplicity, performance, scalability, determinism • Approach • Built atop a basic set of three primitives • Global synchronization/scheduling • Vision • SIMD system running MIMD applications (variable granularity in the order of hundreds of s)

Core Primitives • System software built atop three primitives • Xfer-And-Signal • Transfer block of data to a set of nodes • Optionally signal local/remote event upon completion • Compare-And-Write • Compare global variable on a set of nodes • Optionally write global variable on the same set of nodes • Test-Event • Poll local event

Core Primitives The proposed mechanisms simplify design and implementation!!!

Core Primitives • Implementation • Global, virtually addressable shared memory • Remote Direct Memory Access (RDMA) • Hardware-supported multicast • Hardware-supported global query • Computing capability in the NIC • Portability • Infiniband, BlueGene/L, QsNET

Resource Management • STORM: Scalable TOol for Resource Management [1,2] • Job launching • binary and data dissemination • actual launching of a parallel job • reporting of job termination • Job scheduling • FCFS, gang scheduling, ... [3] • new scheduling algorithms can be “plugged” • Heartbeat/strobe at regular intervals (time slices) • Monitoring • Built atop the three core primitives [1] “Scalable Resource Management in High Performance Computers.” E. Frachtenberg, J. Fernández, F. Petrini, and S. Coll. Cluster´02. [2] “STORM: Lightning-Fast Resource Management.” E. Frachtenberg, J. Fernández, F. Petrini, S. Pakin and S. Coll. SC´02. [3] “Flexible CoScheduling: Mitigating Load Imbalance and Improving Utilization of Heterogeneous Resources.” E. Frachtenberg, D. G. Feitelson, F. Petrini and J. Fernández. IPDPS´03.

Communication Libraries • BCS-MPI: Buffered Coscheduled MPI [4] • Global synchronization [5] • Heartbeat/strobe sent at regular intervals (time slices) • All system activities are tightly coupled • Global Scheduling • Exchange of communication requirements • Communication scheduling • Perform real transmission and reduce computations [6] • Implementation on the NIC (Elan3 - QsNet) • Built atop the three core primitives [4] “BCS-MPI: A New Approach in the System Software Design for Large-Scale Parallel Computers.” J. Fernández, E. Frachtenberg, and F. Petrini. SC´03. [5] “Scalable Collective Communication on the ASCI Q Machine” J. Fernández, E. Frachtenberg, and F. Pettrini. HOTi´11. [6] “Scalable NIC-based Reduction on Large-scale Clusters.” A. Moody, J. Fernández, F. Petrini and D. K. Panda. SC´03.

Communication Libraries • BCS-MPI: real-time commication scheduling • Global Strobe • (time slice starts) Exchange of comm requirements • Global • Synchronization Communication scheduling Time Slice (hundreds of s) • Global • Synchronization Real transmission • Global Strobe • (time slice ends)

Ongoing and future work • Improved system utilization • Scheduling multiple jobs • QoS for different types of traffic • Scheduling messages may provide traffic segregation • Transparent fault tolerance [7] • BCS MPI simplifies the state of the machine • Kernel-level implementation of BCS-MPI • User-level solution is already working • Deterministic replay of MPI programs • Ordered resource scheduling may enforce reproducibility [7] “On the Feasibility of Incremental Checkpointing for Scientific Computing.” J. C. Sancho, F. Petrini, G. Johnson, J. Fernández and E. Frachtenberg. IPDPS´04.

A New Approach in the System Software Designfor Large-Scale Parallel Computers Juan Fernández1,2, Eitan Frachtenberg1, Fabrizio Petrini1 1Performance and Architecture Lab 2Grupo de Arquitectura y Computación Paralelas (GACOP) CCS-3 Modeling, Algorithms and Informatics Dpto. Ingeniería y Tecnología de Computadores Los Alamos National Laboratory, NM 87545, USA Universidad de Murcia, 30071 Murcia, SPAIN URL: http://www.c3.lanl.gov URL: http://www.ditec.um.es email:{juanf,eitanf,fabrizio}@lanl.gov

Motivation Growing gap between workstation and cluster usability!!!

Motivation • System software complexity due to multiple factors: • Extremely complex global state • Thousands of processes, threads, open files, pending messages, etc. • Non-deterministic behavior • Inherent to computing systems  OS process scheduling • Induced by parallel applications  MPI_ANY_SOURCE • Local OSs lack global awareness of parallel applications • Interferences with fine-grain synchronization operations  non-scalable collective communication primitives • Independent design of different components • Redundancy of functionality Communication protocols • Missing functionality  QoS user-level traffic / system-level traffic • User-level applications rely on system software • System software performance/scalability impacts user-application performance/scalability

Resource Management • Job Launching STORM is 40 times faster than the best reported result!!!

Resource Management • Job Scheduling STORM is able to use very small time slices: RESPONSIVENESS !!!

Communication Libraries • Non-blocking primitives: MPI_Isend/Irecv

Communication Libraries • Blocking primitives: MPI_Send/Recv

Communication Libraries • Global Synchronization Protocol • Global Message Scheduling Phase • Microphases: Descriptor Exchange + Message Scheduling • Message Transmission Phase: • Microphases: Point-to-point, Barrier and Broadcast, Reduce

Communication Libraries • SAGE- timing.input (IA32) 0.5% SPEEDUP !!!

A New Approach in the System Software Design for Large-Scale Parallel Computers

A New Approach in the System Software Design for Large-Scale Parallel Computers

Presentation Transcript

Efficient Parallel Software for Large-Scale Semidefinite Programs

A new system of labour management in African large-scale agriculture?

Pregel : A System for Large-Scale Graph Processing

A clone detection approach for a collection of similar large-scale software products

Agile Software Testing in a Large-Scale Project

System Software for Parallel Computing

Pregel : A System for Large-Scale Graph Processing

Configuring a Large-Scale GALS System

Large Scale Parallel Print Service

Large-Scale WAN Design

Architectural Support for System Software on Large-Scale Clusters

Large-Scale Molecular Dynamics Simulations of Materials on Parallel Computers

A Proactive Resiliency Approach for Large-Scale HPC Systems System Research Team

Large Scale Crawling the Web for Parallel Texts

Parallel Changes in Large-Scale Software Development: An Observational Case Study

Performance Enhancements in MSC.Nastran for Large Scale Design Optimization on Cray SV1 Computers

Large-Scale System Partitioning

Parallel Subgraph Listing in a Large-Scale Graph

Towards a parallel OpenCL-based solver for large-scale

Large Scale Parallel Print Service

Architectural Support for System Software on Large-Scale Clusters