330 likes | 466 Views
Blue Gene Simulator. Gengbin Zheng gzheng@uiuc.edu Gunavardhan Kakulapati kakulapa@uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign http://charm.cs.uiuc.edu. Overview. Blue Gene Emulator Blue Gene Simulator
E N D
Blue Gene Simulator Gengbin Zheng gzheng@uiuc.edu Gunavardhan Kakulapati kakulapa@uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign http://charm.cs.uiuc.edu
Overview • Blue Gene Emulator • Blue Gene Simulator • Timing correction schemes • Performance and results
BG/C Nodes Hardware thread Simulating (Host) Processor Emulation on a Parallel Machine
Blue Gene Emulator: functional view Communication threads Worker threads inBuffer CorrectionQ Affinity message queues Non-affinity message queues One Blue Gene/C node
Communication threads Communication threads Worker threads Worker threads inBuff inBuff CorrectionQ CorrectionQ Non-affinity message queues Non-affinity message queues Blue Gene Emulator: functional view Affinity message queues Affinity message queues Converse scheduler Converse Q
What is capable … • Blue Gene API support • Blue Gene Charm++ • Structured Dagger • Trace Projections
Emulator to Simulator • Emulator: • Study programming model and application development • Simulator: • performance prediction capability • models communication latency based on network model; • Doesn’t model memory access on chip, or network contention
Simulator • Parallel performance is hard to model • Communication subsystem • Out of order messages • Communication/computation overlap • Event dependencies • Parallel Discrete Event Simulation • Emulation program executes in parallel with event time stamp correction. • Exploit inherent determinacy of application
How to simulate? • Time stamping events • Per thread timer (sharing one physical timer) • Time stamp messages • Calculate communication latency based on network model • Parallel event simulation • When a message is sent out, calculate the predicted arrival time for the destination bluegene-processor • When a message is received, update current time. currTime = max(currTime,recvTime) • Time stamp correction
Message sent: RecvT(msg) = curT+Latency Message scheduled: curT = max(curT, RecvT(msg)) Time Stamping messages and threads Thread Timer: curT
Need for timestamp correction • Time stamp correction needed for out-of-order messages • Out-of-order delivery can occur: • A message arrives late while some other message updates the thread time to future • So late message executes in the context of future, although its predicted time is earlier
Parallel correction algorithm • Sort message execution by receive time; • Adjust time stamps when needed • Use correction message to inform the change in event startTime. • Send out correction messages following the path message was sent • The events already in the timeline may have to move.
RecvTime Execution TimeLine M1 M2 M3 M4 M5 M6 M7 M8 Timestamps Correction
RecvTime Execution TimeLine M1 M2 M3 M8 M4 M5 M6 M7 Timestamps Correction
RecvTime Execution TimeLine M1 M2 M3 M4 M5 M6 M7 M8 RecvTime Execution TimeLine M1 M2 M3 M8 M4 M5 M6 M7 Correction Message Timestamps Correction
RecvTime Execution TimeLine M1 M2 M3 M4 M5 M6 M7 M4 M4 Correction Message (M4) RecvTime Execution TimeLine M1 M2 M4 M3 M5 M6 M7 Correction Message Correction Message (M4) RecvTime Execution TimeLine M1 M2 M3 M5 M6 M4 M7 Correction Message Timestamps Correction
Linear-order correction • Works only when • Programs have no alternate orders of execution possible • Messages are processed in the same order for multiple executions • Eg: MPI programs with no-wildcard recvs, structured-dagger code with no “overlap” or “forall”.
Reasons: • Correction algorithm breaks dependency logic • Only based on receive time; • Cases: • When an event depends on several messages • Last message triggers the computation • Message buffered until some condition holds • Example for invalid correction scheme: Jacobi-1D
Solution • Use structured dagger to retrieve dependence information • As the program runs, form a chain of bluegene logs preserving the dependency information . • Bluegene logs for entry functions and structured dagger functions
Timestamp correction scheme • Every event has a list of backward and forward dependents. • An event cannot start till its backward dependents have finished. • Define effRecvTime = max(recvTime, endOfBackDeps) • An event can start only after its effRecvTime. startTime = max(effRecvTime,timeline.last.endTime)
Timestamp correction scheme • Timeline is not sorted on the recvTime of the event like the previous case. • Timeline is sorted based on the effRecvTime. • Steps to process a correction message • Find the earliest updated event due to the message • Cut the timeline from that event • Calculate new effRecvTimes from then. • Reinsert into the timeline in the order of effRecvTime
Non-linear order correction scheme • The new scheme : • Takes into account the event dependencies • Works even when messages can be received in different orders in different runs. • Requires all the dependencies to be captured using structured dagger. • But the timing correction is very slow. Several optimizations possible.
Optimizations to online correction scheme • Overwrite old corrections: • An event can get multiple correction messages. • Reduce the number of corrections • Same scheme if correction message arrives earlier than the message itself • Use multisend • Messages destined to same real processor but different events can be sent collectively.
More optimizations • Prioritize messages based on their predicted recvTime. • Lazy processing • Process correction messages periodically. • Allows corrections to be overwritten. • Batch processing • Process many correction messages at a time • Many events will be affected • Choose the earliest and reinsert in the order of effRecvTime. • Ability to start corrections in the middle • Can ignore the startup events for timing correction
Timing correction still very slow. • Observations: • Don’t let the execution go far ahead of the correction wave. • A large difference means many wrong events to be corrected. • Closely following the execution wave also may not help. • A new scheme • Similar to the one used for gvt (Global virtual time)
GVT-like scheme • Use heartbeat • Periodically broadcast asking for gvt • Gvt • Is the time after which the events are invalid due to pending corrections • Compute the gvt as the minimum of predict recvTimes of all correction messages and startTimes of all affected events. • Use a parameter “leash”. Execution of the program cannot go beyond “gvt + leash”
More work • Ongoing work • Make sure gvt scheme is correct • Future work • The presented scheme is on-line correction • Explore the off-line (post-mortem) correction scheme using generated traces.