Programming Environment and Performance Modeling for million-processor machines

Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign Http://charm.Cs.uiuc.edu PPL-Dept of Computer Science, UIUC

Context: Group Mission and Approach • To enhance Performance and Productivity in programming complex parallel applications • Performance: scalable to very large number of processors • Productivity: of human programmers • Complex: irregular structure, dynamic variations • Approach: Application Oriented yet CS centered research • Develop enabling technology, for a wide collection of apps. • Develop, use and test it in the context of real applications • Develop standard library of reusable parallel components PPL-Dept of Computer Science, UIUC

Project Objective and Overview • Focus on extremely large parallel machines • Exemplified by Blue Gene/Cyclops • Issues: • Programming Environment: • Objects, threads, compiler support • Runtime performance adaptation • Performance modeling • Coarse grained models • Fine grained models • Hybrid • Applications: • Unstructured Meshes (FEM/Crack Propagation), .. David Padua Sanjay Kale Sarita Adve Phillipe Geubelle PPL-Dept of Computer Science, UIUC

Project Objective and Overview • Focus on extremely large parallel machines • Exemplified by Blue Gene/Cyclops • Issues: • Programming Environment • Runtime performance adaptation • Performance modeling • Coarse grained models • Fine grained models • Hybrid • Applications: • Unstructured Meshes (FEM/Crack Propagation), .. David Padua Sanjay Kale Sarita Adve Phillipe Geubelle PPL-Dept of Computer Science, UIUC

Multi-partition Decomposition • Idea: divide the computation into a large number of pieces • Independent of number of processors • Typically larger than number of processors • Let the system map entities to processors • Optimal division of labor between “system” and programmer: • Decomposition done by programmer, • Everything else automated PPL-Dept of Computer Science, UIUC

Object-based Parallelization User is only concerned with interaction between objects System implementation User View PPL-Dept of Computer Science, UIUC

Charm++ • Parallel C++ with Data Driven Objects • Object Arrays/ Object Collections • Object Groups: • Global object with a “representative” on each PE • Asynchronous method invocation • Prioritized scheduling • Information sharing abstractions: readonly, tables,.. • Mature, robust, portable • http://charm.cs.uiuc.edu PPL-Dept of Computer Science, UIUC

Data driven execution Scheduler Scheduler Message Q Message Q PPL-Dept of Computer Science, UIUC

Load Balancing Framework • Based on object migration • Partitions implemented as objects (or threads) are mapped to available processors by LB framework • Measurement based load balancers: • Principle of persistence • Computational loads and communication patterns • Runtime system measures actual computation times of every partition, as well as communication patterns • Variety of “plug-in” LB strategies available • Scalable to a few thousand processors • Including those for situations when principle of persistence does not apply PPL-Dept of Computer Science, UIUC

Building on Object-based Parallelism • Application induced load imbalances • Environment induced performance issues: • Dealing with extraneous loads on shared m/cs • Vacating workstations • Heterogeneous clusters • Shrinking and Expanding jobs to available Pes • Object “migration”: novel uses • Automatic checkpointing • Automatic prefetching for out-of-core execution • Reuse: object based components PPL-Dept of Computer Science, UIUC

Applications • Charm++ developed in the context of real applications • Current applications we are involved with: • Molecular dynamics • Crack propagation • Rocket simulation: fluid dynamics + structures + • QM/MM: Material properties via quantum mech • Cosmology simulations: parallel analysis+viz • Cosmology: gravitational with multiple timestepping PPL-Dept of Computer Science, UIUC

Molecular Dynamics • Collection of [charged] atoms, with bonds • Newtonian mechanics • At each time-step • Calculate forces on each atom • Bonds: • Non-bonded: electrostatic and van der Waal’s • Calculate velocities and advance positions • 1 femtosecond time-step, millions needed! • Thousands of atoms (1,000 - 100,000) PPL-Dept of Computer Science, UIUC

Object Based Parallelization for MD PPL-Dept of Computer Science, UIUC

Performance Data: SC2000 PPL-Dept of Computer Science, UIUC

Charm++ Is a Good Match for M-PIM • Encapsulation : objects • Cost model: • Object data, read-only data, remote data • Migration and resource management: automatic • One sided communication: since the beginning • Asynchronous global operations (reductions, ..) • Modularity: • see 1996 paper for why DD Objects enable modularity • Acceptability: • C++ • Now also: AMPI on top of charm++ PPL-Dept of Computer Science, UIUC

AMPI: Goals • Runtime adaptivity for MPI programs • Based on multi-domain decomposition and dynamic load balancing features of Charm++ • Minimal changes to the original MPI code • Full MPI 1.1 standard compliance • Additional support for coupled codes • Automatic conversion of existing MPI programs: Polaris/Padua AMPIzer Original MPI Code AMPI Code AMPI Runtime PPL-Dept of Computer Science, UIUC

How Good Is the Programmability • I.E. Do programmers find it easy/good • We think so  • Certainly a good intermediate level model • Higher level abstractions can be built on it • But what kinds of abstractions? • We think domain-specific ones PPL-Dept of Computer Science, UIUC

Decomposition Mapping Charm++ HPF Scheduling expression MPI Specialization Domain specific frameworks /AMPI PPL-Dept of Computer Science, UIUC

S S Q Q Further Match With MPIM • Ability to predict: • Which data is going to be needed and • Which code will execute • Based on the ready queue of object method invocations • So, we can: • Prefetch data accurately • Prefetch code if needed PPL-Dept of Computer Science, UIUC

So, What Are We Doing About It? • How to develop any programming environment for a machine that isn’t built yet • Blue Gene/C emulator using charm++ • Completed last year • Implememnts low level BG/C API • Packet sends, extract packet from comm buffers • Emulation runs on machines with hundreds of “normal” processors • Charm++ on blue Gene /C Emulator PPL-Dept of Computer Science, UIUC

Charm++ Charm++ BG/C low level API Converse Structure of the Emulators Blue Gene/C Low-level API Charm++ Converse PPL-Dept of Computer Science, UIUC

BG/C Nodes Hardware thread Simulating (Host) Processor Emulation on a Parallel Machine PPL-Dept of Computer Science, UIUC

Extensions to Charm++ for BG/C • Microtasks: • Objects may fire microtasks that can be executed by any thread on the same node • Increases parallelism • Overhead: sub-microsecond • Issue: • Object affinity: map to thread or node? • Thread, currently. • Microtasks alleviate load balancing within a node PPL-Dept of Computer Science, UIUC

Emulation efficiency • How much time does it take to run an emulation? • 8 Million processors being emulated on 100 • In addition, lower cache performance • Lots of tiny messages • On a Linux cluster: • Emulation shows good speedup PPL-Dept of Computer Science, UIUC

Emulation efficiency 1000 BG/C nodes (10x10x10) Each with 200 threads (total of 200,000 user-level threads) But Data is preliminary, based on one simulation PPL-Dept of Computer Science, UIUC

Emulator to Simulator • Step 1: Coarse grained simulation • Simulation: performance prediction capability • Models contention for processor/thread • Also models communication delay based on distance • Doesn’t model memory access on chip, or network • How to do this in spite of out-of-order message delivery? • Rely on determinism of Charm++ programs • Time stamped messages and threads • Parallel time-stamp correction algorithm PPL-Dept of Computer Science, UIUC

Applications on the current system • Using BG Charm++ • LeanMD: • Research quality Molecular Dyanmics • Version 0: only electrostatics + van der Vaal • Simple AMR kernel • Adaptive tree to generate millions of objects • Each holding a 3D array • Communication with “neighbors” • Tree makes it harder to find nbrs, but Charm makes it easy PPL-Dept of Computer Science, UIUC

Emulator to Simulator • Step 2: Add fine grained procesor simulation • Sarita Adve: RSIM based simulation of a node • SMP node simulation: completed • Also: simulation of interconnection network • Millions of thread units/caches to simulate in detail? • Step 3: Hybrid simulation • Instead: use detailed simulation to build model • Drive coarse simulation using model behavior • Further help from compiler and RTS PPL-Dept of Computer Science, UIUC

Modeling layers Applications For each: need a detailed simulation and a simpler (e.g. table-driven) “model” Libraries/RTS Chip Architecture Network model And methods for combining them PPL-Dept of Computer Science, UIUC

Summary • Charm++ (data-driven migratable objects) • is a well-matched candidate programming model for M-PIMs • We have developed an Emulator/Simulator • For BG/C • Runs on parallel machines • We have Implemented multi-million object applications using Charm++ • And tested on emulated Blue Gene/C • More info: http://charm.cs.uiuc.edu • Emulator is available for download, along with Charm PPL-Dept of Computer Science, UIUC

Programming Environment and Performance Modeling for million-processor machines

Programming Environment and Performance Modeling for million-processor machines

Presentation Transcript

Performance Modeling

Performance-Driven Processor Allocation

Lecture 2 Single Processor Machines: Memory Hierarchies and Processor Features

Measuring and Modeling Hyper-threaded Processor Performance

Programming Environment and Performance Modeling for million-processor machines

Distributed Memory Machines and Programming

Single Processor Machines: Memory Hierarchies and Processor Features

PAPR - Network Processor Architecture and Programming

Programming Environment

Programming Environment

High Performance Programming on a Single Processor: Memory Hierarchies

Performance Modeling

Programming the Cell Processor: Achieving High Performance and Efficiency

Programming for Cache Performance

Chapter 14 – Performance and Processor Design

Programming the Cell Processor: Achieving High Performance and Efficiency

Performance Modeling

Programming Environment

Programming the Cell Processor: Achieving High Performance and Efficiency