Emulating Massively Parallel (Peta FLOPS ) Machines

Emulating Massively Parallel (PetaFLOPS) Machines Neelam Saboo, Arun Kumar Singla Joshua Mostkoff Unger, Gengbin Zheng, Laxmikant V. Kalé http://charm.cs.uiuc.edu Department of Computer Science Parallel Programming Laboratory

Roadmap • BlueGene Architecture • Need for an Emulator • Charm++ BlueGene • Converse BlueGene • Future Work

BOARD PROCESSOR 1 GFlop/s, 0.5 MB NODE/CHIP 25 GFlop/s, 12.5 MB TOWER BLUE GENE 1 PFlop/s, 0.5 TB Blue Gene: Processor-in-memory Case Study • Five steps to a PetaFLOPS, taken from: • http://www.research.ibm.com/bluegene/ FUNCTIONAL MODEL: 34X34X36 cube of shared memory nodes each having 25 processors.

SMP Node • 25 processors • 200 processing elements • Input/Output Buffer • 32 x 128 bytes • Network • Connected to six neighbors via duplex link • 16 bit @ 500 MHz = 1 Gigabyte/s • Latencies: • 5 cycles per hop • 75 cycles per turn

out in Processor • STATS: • 500 MHz • Memory-side cache eliminates coherency problems • 10 cycles local cache • 20 cycles remote cache • 10 cycles cache miss • 8 integer units sharing 2 floating point units • 8 x 25 x ~40,000 = ~8 x 106 processing elements!

Need for Emulator • Emulator – enables programmer to develop, compile, and run software using programming interface that will be used in actual machine

Emulator Objectives • Emulate Blue Gene and other petaFLOPS machines. • Memory limitations and time limitations on single processor requires that simulation MUST be performed on parallel architecture. • Issues: • Assume that program written for processor-in-memory machine will handle out-of-order execution and messaging. • Therefore don’t need complex event queue/rollback.

Emulator Implementation • What are basic data structures/interface? • Machine configuration (topology), handler registration • Nodes with node-level shared data • Threads (associated with each node) representing processing elements • Communication between nodes • How to handle all these objects on parallel architecture? How to handle object-to-object communication? • Difficulties of implementation eased by using Charm++, object-oriented parallel programming paradigm.

Experiments on Emulator • Sample applications implemented: • Primes • Jacobi relaxation • MD prototype • 40,000 atoms, no bonds calculated, nearest neighbor cutoff • Ran full Blue Gene (with 8 x 106 threads) on ~100 ASCI-Red processors ApoA-I: 92k Atoms

Collective Operations • Explore different algorithms for broadcasts and reductions OCTREE LINE RING z y Use “primitive” 30 x 30 x 20 (10 threads) Blue Gene emulation on 50 processor Linux cluster x

Converse BlueGene Emulator Objective • Performance estimation (with proper time stamping) • Provide API for building Charm++ on top of emulator.

Bluegene Emulator Communication threads Worker thread inBuffer Affinity message queue Non-affinity message queue Node Structure

Performance • Pingpong • Close to Converse pingpong; • 81-103 us v.s. 92 us RTT • Charm++ pingpong • 116 us RTT • Charm++ Bluegene pingpong • 134-175 us RTT

Charm++ on top of Emulator • BlueGene thread represents Charm++ node; • Name conflict: • Cpv, Ctv • MsgSend, etc • CkMyPe(), CkNumPes(), etc

Future Work: Simulator • LeanMD : Fully functional MD with only cutoff • How can we examine performance of algorithms on variants of processor-in-memory design in massive system? • Several layers of detail to measure • Basic: Correctly model performance, timestamp messages with correction for out-of-order execution • More detailed: network performance, memory access, modeling sharing of floating-point unit, estimation techniques

Emulating Massively Parallel (Peta FLOPS ) Machines

Emulating Massively Parallel (Peta FLOPS ) Machines

Presentation Transcript

Massively Parallel Processors

Massively Parallel LDPC Decoding on GPU

Massively Parallel Cloud Data Storage Systems

Programming Massively Parallel Graphics Processors

Massively Parallel/Distributed Data Storage Systems

A Massively Parallel Architecture for Bioinformatics

Intel in Visual Computing in Russia

Parallel Matrix Multiply

Parallel Processing

Massively Parallel Multgrid for Finite Elements

Theoretical limitations of massively parallel biology

Tera-flops Peta-bytes and Exa-links

Massively Parallel Signature Sequencing (MPSS)

Steering Massively Parallel Applications Under Python

PAVEMENT/PIO Parallel I/O System for Massively Parallel Processors

Massively Parallel Database Dump and Reload

Today's Software For Tomorrow's Hardware: An Introduction to Parallel Computing

Massively Parallel Computing for Protein Alignment

Massively Parallel Cloud Data Storage Systems