Blue Gene Simulator

Blue Gene Simulator Gengbin Zheng gzheng@uiuc.edu Gunavardhan Kakulapati kakulapa@uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign http://charm.cs.uiuc.edu

Overview • Blue Gene Emulator • Blue Gene Simulator • Timing correction schemes • Performance and results

BG/C Nodes Hardware thread Simulating (Host) Processor Emulation on a Parallel Machine

Blue Gene Emulator: functional view Communication threads Worker threads inBuffer CorrectionQ Affinity message queues Non-affinity message queues One Blue Gene/C node

Communication threads Communication threads Worker threads Worker threads inBuff inBuff CorrectionQ CorrectionQ Non-affinity message queues Non-affinity message queues Blue Gene Emulator: functional view Affinity message queues Affinity message queues Converse scheduler Converse Q

What is capable … • Blue Gene API support • Blue Gene Charm++ • Structured Dagger • Trace Projections

Emulator to Simulator • Emulator: • Study programming model and application development • Simulator: • performance prediction capability • models communication latency based on network model; • Doesn’t model memory access on chip, or network contention

Simulator • Parallel performance is hard to model • Communication subsystem • Out of order messages • Communication/computation overlap • Event dependencies • Parallel Discrete Event Simulation • Emulation program executes in parallel with event time stamp correction. • Exploit inherent determinacy of application

How to simulate? • Time stamping events • Per thread timer (sharing one physical timer) • Time stamp messages • Calculate communication latency based on network model • Parallel event simulation • When a message is sent out, calculate the predicted arrival time for the destination bluegene-processor • When a message is received, update current time. currTime = max(currTime,recvTime) • Time stamp correction

Message sent: RecvT(msg) = curT+Latency Message scheduled: curT = max(curT, RecvT(msg)) Time Stamping messages and threads Thread Timer: curT

Need for timestamp correction • Time stamp correction needed for out-of-order messages • Out-of-order delivery can occur: • A message arrives late while some other message updates the thread time to future • So late message executes in the context of future, although its predicted time is earlier

Parallel correction algorithm • Sort message execution by receive time; • Adjust time stamps when needed • Use correction message to inform the change in event startTime. • Send out correction messages following the path message was sent • The events already in the timeline may have to move.

RecvTime Execution TimeLine M1 M2 M3 M4 M5 M6 M7 M8 Timestamps Correction

RecvTime Execution TimeLine M1 M2 M3 M8 M4 M5 M6 M7 Timestamps Correction

RecvTime Execution TimeLine M1 M2 M3 M4 M5 M6 M7 M8 RecvTime Execution TimeLine M1 M2 M3 M8 M4 M5 M6 M7 Correction Message Timestamps Correction

RecvTime Execution TimeLine M1 M2 M3 M4 M5 M6 M7 M4 M4 Correction Message (M4) RecvTime Execution TimeLine M1 M2 M4 M3 M5 M6 M7 Correction Message Correction Message (M4) RecvTime Execution TimeLine M1 M2 M3 M5 M6 M4 M7 Correction Message Timestamps Correction

Linear-order correction • Works only when • Programs have no alternate orders of execution possible • Messages are processed in the same order for multiple executions • Eg: MPI programs with no-wildcard recvs, structured-dagger code with no “overlap” or “forall”.

Reasons: • Correction algorithm breaks dependency logic • Only based on receive time; • Cases: • When an event depends on several messages • Last message triggers the computation • Message buffered until some condition holds • Example for invalid correction scheme: Jacobi-1D

Solution • Use structured dagger to retrieve dependence information • As the program runs, form a chain of bluegene logs preserving the dependency information . • Bluegene logs for entry functions and structured dagger functions

Timestamp correction scheme • Every event has a list of backward and forward dependents. • An event cannot start till its backward dependents have finished. • Define effRecvTime = max(recvTime, endOfBackDeps) • An event can start only after its effRecvTime. startTime = max(effRecvTime,timeline.last.endTime)

Timestamp correction scheme • Timeline is not sorted on the recvTime of the event like the previous case. • Timeline is sorted based on the effRecvTime. • Steps to process a correction message • Find the earliest updated event due to the message • Cut the timeline from that event • Calculate new effRecvTimes from then. • Reinsert into the timeline in the order of effRecvTime

Non-linear order correction scheme • The new scheme : • Takes into account the event dependencies • Works even when messages can be received in different orders in different runs. • Requires all the dependencies to be captured using structured dagger. • But the timing correction is very slow. Several optimizations possible.

Optimizations to online correction scheme • Overwrite old corrections: • An event can get multiple correction messages. • Reduce the number of corrections • Same scheme if correction message arrives earlier than the message itself • Use multisend • Messages destined to same real processor but different events can be sent collectively.

More optimizations • Prioritize messages based on their predicted recvTime. • Lazy processing • Process correction messages periodically. • Allows corrections to be overwritten. • Batch processing • Process many correction messages at a time • Many events will be affected • Choose the earliest and reinsert in the order of effRecvTime. • Ability to start corrections in the middle • Can ignore the startup events for timing correction

Timing correction still very slow. • Observations: • Don’t let the execution go far ahead of the correction wave. • A large difference means many wrong events to be corrected. • Closely following the execution wave also may not help. • A new scheme • Similar to the one used for gvt (Global virtual time)

GVT-like scheme • Use heartbeat • Periodically broadcast asking for gvt • Gvt • Is the time after which the events are invalid due to pending corrections • Compute the gvt as the minimum of predict recvTimes of all correction messages and startTimes of all affected events. • Use a parameter “leash”. Execution of the program cannot go beyond “gvt + leash”

Projections before correction

Projections after correction

Correctness of the scheme (using Jacobi1D)

Predicted time vs latency factor

Predicted speedup

More work • Ongoing work • Make sure gvt scheme is correct • Future work • The presented scheme is on-line correction • Explore the off-line (post-mortem) correction scheme using generated traces.

Blue Gene Simulator

Blue Gene Simulator

Presentation Transcript

Performance Analysis on Blue Gene/P

IBM System Blue Gene®

PAPI 3.0.8.1 on Blue Gene L

Blue Gene System and Performance Overview

Case Study: Blue Gene P

Running on the SDSC Blue Gene

Blue Gene extreme I/O

Blue Gene/L: Delivering Large Scale Parallelism

Blue Gene / C

BLUE GENE/L

SDSC Blue Gene: Overview

CSM support for Blue Gene/P

Blue Gene Bring Up

Blue Gene/P Admin Education

The IBM Blue Gene/L System Architecture

Ibm blue gene fastest computer ever!

The Blue Gene Experience

Blue Gene/P Navigator

Blue Gene / C