150 likes | 320 Views
Emulating Massively Parallel (Peta FLOPS ) Machines. Neelam Saboo, Arun Kumar Singla Joshua Mostkoff Unger, Gengbin Zheng, Laxmikant V. Kal é. http://charm.cs.uiuc.edu. Department of Computer Science Parallel Programming Laboratory. Roadmap. BlueGene Architecture Need for an Emulator
E N D
Emulating Massively Parallel (PetaFLOPS) Machines Neelam Saboo, Arun Kumar Singla Joshua Mostkoff Unger, Gengbin Zheng, Laxmikant V. Kalé http://charm.cs.uiuc.edu Department of Computer Science Parallel Programming Laboratory
Roadmap • BlueGene Architecture • Need for an Emulator • Charm++ BlueGene • Converse BlueGene • Future Work
BOARD PROCESSOR 1 GFlop/s, 0.5 MB NODE/CHIP 25 GFlop/s, 12.5 MB TOWER BLUE GENE 1 PFlop/s, 0.5 TB Blue Gene: Processor-in-memory Case Study • Five steps to a PetaFLOPS, taken from: • http://www.research.ibm.com/bluegene/ FUNCTIONAL MODEL: 34X34X36 cube of shared memory nodes each having 25 processors.
SMP Node • 25 processors • 200 processing elements • Input/Output Buffer • 32 x 128 bytes • Network • Connected to six neighbors via duplex link • 16 bit @ 500 MHz = 1 Gigabyte/s • Latencies: • 5 cycles per hop • 75 cycles per turn
out in Processor • STATS: • 500 MHz • Memory-side cache eliminates coherency problems • 10 cycles local cache • 20 cycles remote cache • 10 cycles cache miss • 8 integer units sharing 2 floating point units • 8 x 25 x ~40,000 = ~8 x 106 processing elements!
Need for Emulator • Emulator – enables programmer to develop, compile, and run software using programming interface that will be used in actual machine
Emulator Objectives • Emulate Blue Gene and other petaFLOPS machines. • Memory limitations and time limitations on single processor requires that simulation MUST be performed on parallel architecture. • Issues: • Assume that program written for processor-in-memory machine will handle out-of-order execution and messaging. • Therefore don’t need complex event queue/rollback.
Emulator Implementation • What are basic data structures/interface? • Machine configuration (topology), handler registration • Nodes with node-level shared data • Threads (associated with each node) representing processing elements • Communication between nodes • How to handle all these objects on parallel architecture? How to handle object-to-object communication? • Difficulties of implementation eased by using Charm++, object-oriented parallel programming paradigm.
Experiments on Emulator • Sample applications implemented: • Primes • Jacobi relaxation • MD prototype • 40,000 atoms, no bonds calculated, nearest neighbor cutoff • Ran full Blue Gene (with 8 x 106 threads) on ~100 ASCI-Red processors ApoA-I: 92k Atoms
Collective Operations • Explore different algorithms for broadcasts and reductions OCTREE LINE RING z y Use “primitive” 30 x 30 x 20 (10 threads) Blue Gene emulation on 50 processor Linux cluster x
Converse BlueGene Emulator Objective • Performance estimation (with proper time stamping) • Provide API for building Charm++ on top of emulator.
Bluegene Emulator Communication threads Worker thread inBuffer Affinity message queue Non-affinity message queue Node Structure
Performance • Pingpong • Close to Converse pingpong; • 81-103 us v.s. 92 us RTT • Charm++ pingpong • 116 us RTT • Charm++ Bluegene pingpong • 134-175 us RTT
Charm++ on top of Emulator • BlueGene thread represents Charm++ node; • Name conflict: • Cpv, Ctv • MsgSend, etc • CkMyPe(), CkNumPes(), etc
Future Work: Simulator • LeanMD : Fully functional MD with only cutoff • How can we examine performance of algorithms on variants of processor-in-memory design in massive system? • Several layers of detail to measure • Basic: Correctly model performance, timestamp messages with correction for out-of-order execution • More detailed: network performance, memory access, modeling sharing of floating-point unit, estimation techniques