240 likes | 369 Views
A New Approach in the System Software Design for Large-Scale Parallel Computers. Juan Fernández 1,2 , Eitan Frachtenberg 1 , Fabrizio Petrini 1 , Salvador Coll 1 and José C. Sancho 1 1 Performance and Architecture Lab 2 Grupo de Arquitectura y Computación Paralela (GACOP)
E N D
A New Approach in the System Software Designfor Large-Scale Parallel Computers Juan Fernández1,2, Eitan Frachtenberg1, Fabrizio Petrini1, Salvador Coll1 and José C. Sancho1 1Performance and Architecture Lab 2Grupo de Arquitectura y Computación Paralela (GACOP) CCS-3 Modeling, Algorithms and Informatics Dpto. Ingeniería y Tecnología de Computadores Los Alamos National Laboratory, NM 87545, USA Universidad de Murcia, 30071 Murcia, SPAIN URL: http://www.c3.lanl.gov URL: http://www.ditec.um.es email:{juanf,eitanf,fabrizio,scoll,jcsancho}@lanl.gov
Motivation OS OS OS Hardware / OSs are glued together by System Software: Resource Management Communications Parallel Development and Debugging Tools Parallel File System Fault Tolerance OS OS OS OS OS System software is a key factor to maximize usability, performance and scalability on large-scale systems!!!
Motivation • System software complexity due to multiple factors: • Extremely complex global state • Non-deterministic behavior inherent to computing systems and parallel apps • Local OSs lack global awareness of parallel apps • Independent design of different components • User-level applications rely on system software
Outline • Motivation • Goals • Core Primitives • Resource Management • Communication Libraries • Ongoing and future work
Goals • Target • Simplifying design and implementation of the system software for large-scale parallel computers • Simplicity, performance, scalability, determinism • Approach • Built atop a basic set of three primitives • Global synchronization/scheduling • Vision • SIMD system running MIMD applications (variable granularity in the order of hundreds of s)
Outline • Motivation • Goals • Core Primitives • Resource Management • Communication Libraries • Ongoing and future work
Core Primitives • System software built atop three primitives • Xfer-And-Signal • Transfer block of data to a set of nodes • Optionally signal local/remote event upon completion • Compare-And-Write • Compare global variable on a set of nodes • Optionally write global variable on the same set of nodes • Test-Event • Poll local event
Core Primitives The proposed mechanisms simplify design and implementation!!!
Core Primitives • Implementation • Global, virtually addressable shared memory • Remote Direct Memory Access (RDMA) • Hardware-supported multicast • Hardware-supported global query • Computing capability in the NIC • Portability • Infiniband, BlueGene/L, QsNET
Outline • Motivation • Goals • Core Primitives • Resource Management • Communication Libraries • Ongoing and future work
Resource Management • STORM: Scalable TOol for Resource Management [1,2] • Job launching • binary and data dissemination • actual launching of a parallel job • reporting of job termination • Job scheduling • FCFS, gang scheduling, ... [3] • new scheduling algorithms can be “plugged” • Heartbeat/strobe at regular intervals (time slices) • Monitoring • Built atop the three core primitives [1] “Scalable Resource Management in High Performance Computers.” E. Frachtenberg, J. Fernández, F. Petrini, and S. Coll. Cluster´02. [2] “STORM: Lightning-Fast Resource Management.” E. Frachtenberg, J. Fernández, F. Petrini, S. Pakin and S. Coll. SC´02. [3] “Flexible CoScheduling: Mitigating Load Imbalance and Improving Utilization of Heterogeneous Resources.” E. Frachtenberg, D. G. Feitelson, F. Petrini and J. Fernández. IPDPS´03.
Outline • Motivation • Goals • Core Primitives • Resource Management • Communication Libraries • Ongoing and future work
Communication Libraries • BCS-MPI: Buffered Coscheduled MPI [4] • Global synchronization [5] • Heartbeat/strobe sent at regular intervals (time slices) • All system activities are tightly coupled • Global Scheduling • Exchange of communication requirements • Communication scheduling • Perform real transmission and reduce computations [6] • Implementation on the NIC (Elan3 - QsNet) • Built atop the three core primitives [4] “BCS-MPI: A New Approach in the System Software Design for Large-Scale Parallel Computers.” J. Fernández, E. Frachtenberg, and F. Petrini. SC´03. [5] “Scalable Collective Communication on the ASCI Q Machine” J. Fernández, E. Frachtenberg, and F. Pettrini. HOTi´11. [6] “Scalable NIC-based Reduction on Large-scale Clusters.” A. Moody, J. Fernández, F. Petrini and D. K. Panda. SC´03.
Communication Libraries • BCS-MPI: real-time commication scheduling • Global Strobe • (time slice starts) Exchange of comm requirements • Global • Synchronization Communication scheduling Time Slice (hundreds of s) • Global • Synchronization Real transmission • Global Strobe • (time slice ends)
Ongoing and future work • Improved system utilization • Scheduling multiple jobs • QoS for different types of traffic • Scheduling messages may provide traffic segregation • Transparent fault tolerance [7] • BCS MPI simplifies the state of the machine • Kernel-level implementation of BCS-MPI • User-level solution is already working • Deterministic replay of MPI programs • Ordered resource scheduling may enforce reproducibility [7] “On the Feasibility of Incremental Checkpointing for Scientific Computing.” J. C. Sancho, F. Petrini, G. Johnson, J. Fernández and E. Frachtenberg. IPDPS´04.
A New Approach in the System Software Designfor Large-Scale Parallel Computers Juan Fernández1,2, Eitan Frachtenberg1, Fabrizio Petrini1 1Performance and Architecture Lab 2Grupo de Arquitectura y Computación Paralelas (GACOP) CCS-3 Modeling, Algorithms and Informatics Dpto. Ingeniería y Tecnología de Computadores Los Alamos National Laboratory, NM 87545, USA Universidad de Murcia, 30071 Murcia, SPAIN URL: http://www.c3.lanl.gov URL: http://www.ditec.um.es email:{juanf,eitanf,fabrizio}@lanl.gov
Motivation Growing gap between workstation and cluster usability!!!
Motivation • System software complexity due to multiple factors: • Extremely complex global state • Thousands of processes, threads, open files, pending messages, etc. • Non-deterministic behavior • Inherent to computing systems OS process scheduling • Induced by parallel applications MPI_ANY_SOURCE • Local OSs lack global awareness of parallel applications • Interferences with fine-grain synchronization operations non-scalable collective communication primitives • Independent design of different components • Redundancy of functionality Communication protocols • Missing functionality QoS user-level traffic / system-level traffic • User-level applications rely on system software • System software performance/scalability impacts user-application performance/scalability
Resource Management • Job Launching STORM is 40 times faster than the best reported result!!!
Resource Management • Job Scheduling STORM is able to use very small time slices: RESPONSIVENESS !!!
Communication Libraries • Non-blocking primitives: MPI_Isend/Irecv
Communication Libraries • Blocking primitives: MPI_Send/Recv
Communication Libraries • Global Synchronization Protocol • Global Message Scheduling Phase • Microphases: Descriptor Exchange + Message Scheduling • Message Transmission Phase: • Microphases: Point-to-point, Barrier and Broadcast, Reduce
Communication Libraries • SAGE- timing.input (IA32) 0.5% SPEEDUP !!!