750 likes | 906 Views
Two Lectures on Grid Computing. Keshav Pingali Cornell University. What is grid computing?. Two models: Distributed collaboration : Large system is built from software components and instruments residing at different sites on Internet High-performance distributed computation
E N D
Two Lectureson Grid Computing Keshav Pingali Cornell University
What is grid computing? • Two models: • Distributed collaboration: • Large system is built from software components and instruments residing at different sites on Internet • High-performance distributed computation • Utility computing: • Metaphor of electrical power grid • Program execute wherever there are computational resources • Programs are mobile to take advantage of changing resource availability
Application-level Checkpoint-Restart for High-performance Computing Keshav Pingali Cornell University Joint work with Greg Bronevetsky, Rohit Fernandes, Daniel Marques, Paul Stodghill
Trends in HPC (i) • Old picture of high-performance computing: • Turn-key big-iron platforms • Short-running codes • Modern high-performance computing: • Large clusters of commodity parts • Lower reliability than turn-key systems • Example: MTBF of ASCI is 6 hours • Long-running codes • Protein-folding on Blue Gene/L: one year • Program runtimes greatly exceed mean time to failure Fault-tolerance is critical
Fault tolerance • Fault tolerance comes in different flavors • Mission-critical systems: (eg) air traffic control system • No down-time, fail-over, redundancy • Computational science applications • Restart after failure, minimizing lost work • Fault models • Fail-stop: • Failed process dies silently • No corrupted messages or shared data • Byzantine: • Arbitrary misbehavior is allowed • Our focus: • Computational science applications • Fail-stop faults • One solution: checkpoint/restart (CPR)
Trends in HPC (ii) • Grid computing: utility computing • Programs execute wherever there are computational resources • Program are mobile to take advantage of changing resource availability • Key mechanism: checkpoint/restart • Identical platforms at different sites on grid • Platform-dependent checkpoints (cf. Condor) • Different platforms at different sites on grid • Platform-independent (portable) checkpoints
Checkpoint/restart (CPR) • System-level checkpointing (SLC) (eg) Condor • core-dump style snapshots of computations • mechanisms very architecture and OS dependent • checkpoints are not portable • Application-level checkpointing (ALC) • programs are self-checkpointing and self-restarting • (eg) n-body codes save and restore positions and velocities of particles • amount of state saved can be much smaller than SLC • IBM’s BlueGene protein folding • Megabytes vs terabytes • Alegra (Sandia) • App. level restart file only 5% of core size • Disadvantage of current application-level check-pointing • manual implementation • requires global barriers in programs
Our Approach • Automate application-level check-pointing • Minimize programmer effort • Generalize to arbitrary MPI programs w/o barriers Application + State-saving Original Application Precompiler Failure detector Thin Coordination Layer MPI Implementation MPI Implementation Reliable communication layer
Outline • Saving single-process state • stack, heap, globals • MPI programs: [PPoPP 2003,ICS2003,SC2004] • coordination of single-process states into a global snapshot • non-blocking protocol: no barriers needed in program • OpenMP programs:[EWOMP 2004, ASPLOS 2004] • blocking protocol • barriers and locks • Ongoing work • portable checkpoints • program analysis to reduce amount of saved state
System ArchitectureSingle-Processor Checkpointing Application + State-saving Precompiler Original Application Failure detector Thin Coordination Layer MPI Implementation MPI Implementation Reliable communication layer
Precompiler • Where to checkpoint • At calls to potentialCheckpoint() function • Mandatory calls in main process (initiator) • Other calls are optional • Process checks if global checkpoint has been requested, and if so, joins in protocol to save state • Inserted by programmer or automated tool • Currently inserted by programmer • Transformed program can save its state only at calls to potentialCheckpoint()
Saving Application State • Heap • special malloc that tracks memory that is allocated and freed • Stack • Location stack (LS): track which function invocations led to place where checkpoint was taken • Variable Description Stack (VDS): records local variables in these function invocations that must be saved • On recovery • LS is used to re-execute sequence of function invocations and re-create stack frames • VDS is used to restore variables into stack • Similar to approach used in Dome [Beguelin et al] • Globals • precompiler inserts statements to save them
main() { int a; VDS.push(&a, sizeof a); if(restart) load LS; copy LS to LS.old jump dequeue(LS.old) // … LS.push(label1); label1: function(); LS.pop(); // … VDS.pop(); } function() { int b; VDS.push(&b, sizeof b); if(restart) jump dequeue(LS.old) // … LS.push(label2); potentialCheckpoint(); label2: if(restart) load VDS; restore variables; LS.pop(); // … VDS.pop(); } Example
Sequential Experiments (vs Condor) • Checkpoint sizes are comparable.
Ongoing work: reducing saved state • Automatic placement of potentialCheckpoint calls • Program analysis to determine live data at checkpoint [with Radu Rugina] • Incremental state-saving • Recomputation vs saving state • eg. Protein folding, A*B = C • Prior work: • CATCH (Fuchs,Urbana): uses runtime learning rather than static analysis • Beck, Plank and Kingsley (UTK): memory exclusion analysis of static data
Outline • Saving single-process state • Stack, heap, globals, locals, … • MPI programs: • coordination of single-process states into a global snapshot • non-blocking protocol • OpenMP programs: • blocking protocol • barriers and locks • Ongoing work
System ArchitectureDistributed Checkpointing Application + State-saving Precompiler Original Application Failure detector Thin Coordination Layer MPI Implementation MPI Implementation Reliable communication layer
Need for Coordination P’s Checkpoint Early Message Process P Past Message Future Message Process Q Late Message Q’s Checkpoint • Horizontal Lines – events in each process • Recovery Line • line connecting checkpoints on each processor • represents global system state on recovery • Problem with Communication • messages may cross recovery line
Late Messages • Record message data at receiver as part of checkpoint • On recovery, re-read recorded message data P’s Checkpoint Process P Process Q Late Message Q’s Checkpoint
Early Messages • Must suppress the resending of message on recovery P’s Checkpoint Early Message Process P Process Q Q’s Checkpoint
Early Messages • Must suppress the resending of message on recovery • What about non-deterministic events before the send? • Must ensure the application generates the same early message on recovery • Record and replay all non-deterministic events between checkpoint and send • In our applications, non-determinism arises from wild-card MPI receives P’s Checkpoint Early Message Process P Process Q Q’s Checkpoint
Difficulty of Coordination • No communication no coordination necessary P’s Checkpoint Process P Process Q Q’s Checkpoint
Difficulty of Coordination • No communication no coordination necessary • BSP Style programs checkpoint at barrier P’s Checkpoint Process P Past Message Future Message Process Q Q’s Checkpoint
Difficulty of Coordination • No communication no coordination necessary • BSP Style programs checkpoint at barrier • General MIMD programs • System-level checkpointing (eg. Chandy-Lamport) • Forces checkpoints to avoid early messages • Only consistent cuts P’s Checkpoint Process P Past Message Future Message Process Q Late Message Q’s Checkpoint
Difficulty of Coordination • No communication no coordination necessary • BSP Style programs checkpoint at barrier • General MIMD programs • System-level checkpointing (eg. Chandy-Lamport) • Only late messages: consistent cuts • Application-level checkpointing • Checkpoint locations fixed, may not force • Late and early messages: inconsistent cuts • Requires new protocol P’s Checkpoint Early Message Process P Past Message Future Message Process Q Late Message Q’s Checkpoint
MPI-specific issues • Non-FIFO communication – tags • Non-blocking communication • Collective communication • MPI_Reduce(), MPI_AllGather(), MPI_Bcast()… • Internal MPI library state • Visible: • non-blocking request objects, datatypes, communicators, attributes • Invisible: • buffers, address mappings, etc.
Outline • Saving single-process state • Stack, heap, globals, locals, … • MPI programs: • coordination of single-process states into a global snapshot • non-blocking protocol • OpenMP programs: • blocking protocol • barriers and locks • Ongoing work
The Global View • A program’s execution is divided into a series of disjoint epochs • Epochs are separated by recovery lines • A failure in Epoch n means all processes roll back to the prior recovery line Initiator Epoch 0 Epoch 1 Epoch 2 …… Epoch n Process P Process Q
Recovery Line Protocol Outline (I) • Initiator checkpoints, sends pleaseCheckpoint message to all others • After receiving this message, process checkpoints at the next available spot • Sends every other process Q the number of messages sent to Q in the last epoch Initiator pleaseCheckpoint Process P Process Q
Protocol Outline (II) • After checkpointing, each process keeps a record, containing: • data of messages from last epoch (Late messages) • non-deterministic events: • In our applications, non-determinism arises mainly from wild-card MPI receives Initiator pleaseCheckpoint Process P Recording… Process Q
Protocol Outline (IIIa) • Globally, ready to stop recording when • all processes have received their late messages • all processes have sent their early messages Initiator Process P Process Q
Protocol Outline (IIIb) • Locally, when a process • has received all its late messages sends a readyToStopRecording message to Initiator. Initiator readyToStopRecording Process P Process Q
Protocol Outline (IV) • When initiator receives readyToStopRecording from everyone, it sends stopRecording to everyone • Process stops recording when it receives • stopRecording message from initiator OR • message from a process that has itself stopped recording Initiator stopRecording stopRecording Process P Application Message Process Q
Protocol Discussion • Why can’t we just wait to receive stopRecording message? • Our record would depend on a non-deterministic event, invalidating it. • The application message may be different or may not be resent on recovery. Initiator stopRecording ? Process P Application Message Process Q
Protocol Details • Piggyback 4 bytes of control data to tell if a message is late, early, etc. • Can be reduced to 2 bits • On recovery • reinitialize MPI library using MPI_Init() • restore individual process states • recreate datatypes and communicators • repost non-blocking receives • ensure that all calls to MPI_Send() and MPI_Recv() are suppressed/fed with data as necessary
MPI Collective Communications • Single communication involving multiple processes • Single-Receiver: multiple senders, one receiver e.g. Gather, Reduce • Single-Sender: one sender, multiple receivers e.g. Bcast, Scatter • AlltoAll: every process in group sends data to every other process e.g. AlltoAll, AllGather, AllReduce, Scan • Barrier: everybody waits for everybody else to reach barrier before going on. (Only collective call with explicit synchronization guarantee)
Possible Solutions • We have a protocol for point-to-point messages. Why not reimplement all collectives as point-to-point messages? • Lots of work and less efficient than native implementation. • Checkpoint collectives directly without breaking them up. • May be complex but requires no reimplementation of MPI internals.
AlltoAll Example MPI_AlltoAll() • Data flows represent application-level semantics of how data travels • Do NOT correspond to real messages • Used to reason about application’s view of communications • AlltoAll data flows are a bidirectional clique since data flows in both directions • Recovery line may be • Before the AlltoAll • After the AlltoAll • Straddling the AlltoAll Process P Process Q MPI_AlltoAll() Process R MPI_AlltoAll()
Straddling AlltoAll: What to do • Straddling the AlltoAll – only single case • PQ and PR – Late/Early data flows • Record result and replay for P • Suppress P’s call to MPI_AlltoAll • Record/replay non-determinism before P’s MPI_AlltoAll call Process P ` Process Q Record Process R
Implementation • Two large parallel platforms • Lemieux: Pittsburgh Supercomputing Center • 750 Compaq Alphaserver ES45 nodes • Node: four 1-GHz Alpha processors, 4 GB memory, 38GB local disk • Tru64 Unix operating system. • Quadrics interconnection network. • Velocity 2: Cornell Theory Center • 128 dual-processor Windows cluster • Benchmarks: • NAS suite: CG, LU, SP • SMG2000, HPL
Runtimes on Lemieux • Compare original running time of code with running time using C3 system without taking any checkpoints • Chronic overhead is small (< 10%)
Overheads w/checkpointing on Lemieux • Configuration 1: no checkpoints • Configuration 2: go through motions of taking 1 checkpoint, but nothing written to disk • Configuration 3: write checkpoint data to local disk on each machine • Measurement noise: about 2-3% • Conclusion: relative cost of taking a checkpoint is fairly small
Overhead w/checkpointing on V2 • Overheads on V2 for taking checkpoints are also fairly small.
Outline • Saving single-process state • Stack, heap, globals, locals, … • MPI programs: • coordination of single-process states into a global snapshot • non-blocking protocol • OpenMP programs: • blocking protocol • barriers and locks • Ongoing work
Blocking Protocol • Protocol: • Barrier • Each thread records own state • Thread 0 records shared state • Barrier Thread A SavingState Thread B
Problems Due to Barriers • Protocol introduces new barriers into program • May cause errors or deadlock • If checkpoint crosses barrier: • Thread A reaches barrier, waits for thread B • Thread B reaches checkpoint, calls first barrier • Both threads advance • Thread B recording checkpoint • Thread A computing, possibly corrupting checkpoint Application Barrier Thread A Thread B Checkpoint
Resolving the Barrier Problem • Cannot allow checkpoints to cross barriers • On recovery some threads before barrier, others after • Illegal situation • Solution: • Whenever checkpoint crosses barrier, force checkpoint before barrier Application Barrier Application Barrier Thread A Thread B Checkpoint Checkpoint
Deadlock due to locks • Suppose checkpoint crosses dependence • Thread B grabs lock, will not release until after checkpoint • Thread A won’t checkpoint before it acquires lock • Since checkpoint is barrier: Deadlock Lock Unlock ( ) Thread A Lock Unlock Thread B ( ) Checkpoint