Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Performance and Architectures Lab (PAL), CCS-3

Schedule • 09.00-09.45 Introduction • 09.45-10.00 Break • 10.00-10.45 Existing Systems • 10.45-11.00 Break • 11.00-11.45 Case Study • 11.45-12.00 Break • 12.00-12.45 A New Approach

Part 1: Introduction • The need for more capability • The big issues • A taxonomy of systems in three dimensions

The Need for More Capability The most constant difficulty in contriving the engine has arisen from the desire to reduce the time in which the calculations were executed to the shortest which is possible. Charles Babbage, 1791-1871 Our interest is in scientific computing—large-scale, numerical, parallel applications run on large-scale parallel machines.

Definitions • Computing capacity: total deliverable computing power from a system or set of systems. (Power—rate of delivery) • Computing capability: computing power available to a single application. Highest-end computing is primarily concerned with capability—why else build such machines?

The Need for Large-Scale Parallel Machines • It is the insatiable demand for ever more computational capability that has driven the creation of many Tflop-scale parallel machines (Earth Simulator, LANL’s ASCI Q, LLNL’s Thunder and BlueGene/L, etc.) • Petaflop machines are on the horizon, for example DARPA HPCS program (High Productivity Computing Systems)

One-upmanship? Is this merely one-upmanship with the Japanese? From The Roadmap for the Revitalization of High-End Computing, Computing Research Association: […] there is a growing recognition that a new set of scientific and engineering discoveries could be catalyzed by access to very-large-scale computer systems—those in the 100 teraflop to petaflop range.

Requirements for ASCI In our own arena, Advanced Simulation and Computing (ASC) for stockpile stewardship; climate, ocean, and urban infrastructure modeling, etc., Within 10 years, estimates of the demand for Capability and general physics arguments indicate a machine of 1000TF=1 PetaFlop (PF) will be needed to execute the most demanding jobs. Such demand is inevitable; it should not be viewed, however, as some plateau in required Capability: there are sound technical reasons to expect even greater Capability demand in the future.

Large Component Count 133,120 processors 608,256 DRAM • Increases in performance will be achieved through single processor improvements and increases in component count • For example, BlueGene/L will have 133,120 processors and 608,256 memory modules • The large component count will make any assumption of complete reliability unrealistic

Sensitivity to Failures • In a large-scale machine a failure of a single component usually causes a significant fraction of the system to fail because • Components are strongly coupled (e.g., a failure of a fan will lead to other failures due to overheating) • The state of the application is not stored redundantly, and loss of any state is catastrophic • In capability mode, many processing nodes are running the same application, and are tightly coupled together

The Need for Transparent Fault-Tolerance • System software must be resilient to failures, to allow continuing execution of in the presence of failures • Most of the investment is in the application software (250M$/year for MPI software in the ASCI TriLabs) • Economical constraints impose a limited level of redundancy • Other considerations include cost of development, scalability and efficiency

The JASON’s Report • A recent report from the JASON’s, a committee of distinguished scientists chartered by the US government, raised the sensitive question of whether ASCI machines can be used as capability engines • For that to be possible, major advances in fault-tolerance are needed • The recommendation of the report is to skip one generation of supercomputers, due to the lack of good technical/scientific solutions

MTBF as a Function of System Size

Failure Distribution (ASCI Blue Mountain)

State of the Art in Large-Scale Supercomputers • We can assemble large-scale systems by wiring together hardware and “bolting together” software components • But we have almost no control on the machine: not only faults but also performance anomalies

1.2 The Big Issues From DoE Office of Science By the end of this decade petascale computers with thousands of times more computational power than any in current use will be vital tools for expanding the frontiers of science and for addressing vital national priorities. These systems will have tens to hundreds of thousands of processors, an unprecedented level of complexity, and will require significant new levels of scalability and fault management. [Emphasis added]

Office of Science cont’d Current and future large-scale parallel systems require that such services be implemented in a fast and scalable manner so that the OS/R does not become a performance bottleneck. Without reliable, robust operating systems and runtime environments the computational science research community will be unable to easily and completely employ future generations of extreme-scale systems for scientific discovery.

DARPA Defence Advanced Research Projects Administration (DARPA) High Productivity Computing Systems (HPCS) mission: Provide economically viable high productivity systems for the national security and industrial user communities with the following design attributes in the latter part of this decade: • Performance • Programmability • Portability • Robustness

Our Translation • Performance—achieving achievable performance (not, e.g., some percentage of theoretical peak) • Programmability/portability—standard interfaces, transparency of mechanisms for fault tolerance • Robustness—graceful failover

1.3 A Taxonomy of Systems Q: Is it a supercomputer or just a cluster? A: It is a continuum along multiple dimensions. A taxonomy of systems of three dimensions: • Degree of integration of compute node; • Collective primitives provided by the network interface, programmability, global address space; • Degree of integration of system software.

Note This taxonomy is useful for our explication, but we make no claims that it • is canonical, • that it captures highly specialized architectures (for example custom-designed special-purpose digital processors, vector processors, floating-point processors). We are concerned with the big `general purpose’ parallel machines.

Compute Node Degree of integration of compute node between processors, memory, and network interface • Single processor—SMP—multiple CPU cores per chip • Number of levels of cache, proximity of caches to CPU core • Proximity of network interface to CPU core: on-chip—off-chip direct connection—separated by I/O interface

Network Interface • Collective primitives provided by network interface: none—functionally rich; • Programmability of network interface: none—general purpose • Provision of virtual global address space

System software • Degree of integration of system software much more about this later…

Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Presentation Transcript

Large-scale adaptive systems

Large-Scale Distributed Systems

Challenges and Solutions in Large-Scale Computing Systems

Large-scale adaptive systems

Large-scale adaptive systems

Large-scale adaptive systems

Large-scale adaptive systems

Extreme scale parallel and distributed systems

Large Scale Distributed Computing Systems

Large Scale Parallel Print Service

Large-Scale Distributed Systems

Large Scale Computing Systems

Large Scale File Systems

Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Large-scale adaptive systems

Large-Scale Systems

Map-Reduce and Parallel Computing for Large-Scale Media Processing

ENERGY EFFICIENCY IN LARGE SCALE DISTRIBUTED SYSTEMS

Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems

Large-Scale Computing with Grids

Large Scale Parallel Print Service