Application Performance Analysis on Blue Gene/L

Application Performance Analysis on Blue Gene/L Jim Pool, P.I. Maciej Brodowicz, Sharon Brunett, Tom Gottschalk, Dan Meiron, Paul Springer, Thomas Sterling, Ed Upchurch

Caltech’s Role in Blue Gene/L Project • Understand implications of BG/L network architecture & drive results from real world ASCI applications • Develop statistical models of applications, processors as message generators, and the network • Focus on • Application communications distribution • Network contention as function of load, size and adaptive routing • Represent 64K Nodes Explicitly in Statistical Model • Create trace analysis tools to characterize applications • Extensible Trace Facility (ETF)

Blue Gene / L Node

Blue Gene / L Network

ETF Built-in Trace Options • MPI events • All point-to-point communications (MPI-1) • All collective communications (MPI-1) • Non-blocking request tracking • Communicator creation and destruction • MPI datatype decoding (requires MPI-2) • Languages: C, Fortran • Easy instrumentation of applications • Memory reference and program execution tracing • Tracking of statically and dynamically allocated arrays (identifiers, element sizes, dimensions) • Tracking of scalar variables • Read and write accesses to individual scalars and array elements as well as contiguous vectors of elements • Function calls • Program execution phases

ETF Tracing Example forMagnetic Hydro Dynamic (MHD) Code with Adaptive Mesh Refinement (AMR) • Parallel MHD fluid code solves equations of hydrodynamics and resistive Maxwell’s equations • Part of larger application which computes dynamic responses to strong shock waves impinging on target materials • Fortran 90 + MPI • MPI Cartesian communicators • Nearest neighbor comms use non blocking send/recv • MPI Allreduce for calculating stable time steps

AMR MHD: Communication Profile 20 time steps on 32 processors, 128x128 cells Max. level = 1 Max. level = 2

Lennard-Jones Molecular Dynamics • Short range molecular dynamics application simulating Newtonian interactions in large groups of atoms • production code from Sandia National Lab • Simulations are large in two dimensions • number of atoms and number of time steps • Spatial decomposition case selected • each processing node keeps track of the positions and movement of the atoms in a 3-D box • Computations carried out in a single time step correspond to femto-seconds of real time • a meaningful simulation of the evolution of the system’s state typically requires thousands of time steps • Point-to-point MPI messages are exchanged across each of the 6 sides of the box / time step • Code is written in Fortran and MPI

Lennard-Jones Molecular Dynamics Communication Steps Typical Grid Cell and Cutoff Radius Computational Cycle Model

LJS Single Processor BG/L Performance Original Code vs. Tuned for BG/L 12 10 good cache reuse 8 Improvement (%) 6 4 2 0 15,625 31,250 62,500 125,000 250,000 500,000 Number of Atoms per BG/L CPU

LJS Molecular Dynamics Performance Fixed Problem Size of 1 Billion Atoms Compute Time [ms] Communications Time [ms] Time per single iteration (ms) 64k 2k 4k 8k 16k 32k Number of BG/L CPUs

LJS Speedup BG/L vs. ASCI Red 3200 Nodes 1 Billion Atom Problem 80 70 60 50 Speedup 40 30 20 10 0 2k 4k 8k 16k 32k 64k Number of BlueGene/L Nodes

LJS Communications Time 500,000 Atoms per BG/L Node 60 50 40 Communications Time Per Iteration (msecs) 30 20 Physical Nearest Neighbor Mapping Random Mapping 10 0 4x4x4 (64 BGL Nodes) 8x8x8 (512 BGL Nodes) 16x16x16 (4096 BGL Nodes) BG/L Configuration

What is QMC and Why is it a Good Fit for BG/L? • QMC is a finite all-electron Quantum Monte Carlo code used to determine quantum properties of materials with extremely high accuracy • Developed at Caltech by Bill Goddard’s ASCI Material Properties group • Interesting Characteristics • Low memory requirements • After initialization, highly parallel and scalable • Minimal set of MPI calls required • Non blocking p2p, reduction, probe, communicator, collective calls • No communications during QMC working steps • Communicating convergence statistics is 7200 bytes regardless of problem size and node count • Code already ported to many platforms (Linux, AIX, IRIX, etc.) • C++ and MPI sources

Iterative QMC Algorithm For each processor do: Steps = Total Steps / number of processors Generate walkers Equilibrate walkers for each step generate QMC statistics send QMC statistics to master node

QMC Communications Time For 100,000 Steps Per Node (Reduce Using the Torus) 1 8x8x8 (512) 16x16x16 (4K) 32x16x16 (8K) 32x32x16 (16K) 32x32x32 (32K) 64x32x32 (64K) 0.1 Time (seconds) 0.01 0.001 BG/L Configuration

Future Application Porting and Analysis for BG/L • ASCI solid dynamics code simulating the mechanical response of polycrystalline materials, such as tantalum • Address memory constraints, grain load imbalance and MPI_Waitall() efficiency as we port/tune to BG/L • good stress test for BG/L robustness • Scalable simulation of polycrystalline response with assumed grain shape. The grain shape corresponds to the space-filling polyhedra corresponding to the Wigner-Seitz cell of a BCC crystal. The 390 grain example shown here was run on LLNL’s IBM • SP3, frost.

Application Performance Analysis on Blue Gene/L

Application Performance Analysis on Blue Gene/L

Presentation Transcript

PAPI 3.0.8.1 on Blue Gene L

Performance Analysis using Windows Performance Toolkit

A Gene Expression Project

DNA sequence analysis Gene prediction

Genetics Basics

Optimizing the Performance of Streaming Numerical Kernels on the IBM Blue Gene/P PowerPC 450

Case Study: Blue Gene P

The 9 Steps to True Blue Greatness

Project F2: Application Performance Analysis

Gene Set Enrichment Analysis (GSEA)

Mean Value Analysis of a Database Grid Application

Genetics Basics

Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD

Gene expression

BLUE GENE/L

4. Gene Expression Data Analysis

Performance Analysis for VoIP System

Gene Set Enrichment Analysis Microarray Classification

The IBM Blue Gene/L System Architecture

Performance Improvement For Plant Gene Prediction

Performance Analysis of Embedded Systems