1 / 46

Compiler-Generated Staggered Checkpointing

Compiler-Generated Staggered Checkpointing. Alison N. Norman Department of Computer Sciences The University of Texas at Austin. Sung-Eun Choi Los Alamos National Laboratory. Calvin Lin Department of Computer Sciences The University of Texas at Austin. The Importance of Clusters.

robert-ward
Download Presentation

Compiler-Generated Staggered Checkpointing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compiler-Generated Staggered Checkpointing Alison N. Norman Department of Computer Sciences The University of Texas at Austin Sung-Eun Choi Los Alamos National Laboratory Calvin Lin Department of Computer Sciences The University of Texas at Austin

  2. The Importance of Clusters • Scientific computation is increasingly performed on clusters • Cost-effective: Created from commodity parts • Scientists want more computational power • Cluster computational power is easy to increase by adding processors  Cluster size keeps increasing! The University of Texas at Austin

  3. Clusters Are Not Perfect • Failure rates are increasing • The number of moving parts is growing (processors, network connections, disks, etc.) • Mean Time Between Failure (MTBF) is shrinking How can we deal with these failures? The University of Texas at Austin

  4. Options forFault-Tolerance • Redundancy in space • Each participating process has a backup process • Expensive! • Redundancy in time • Processes save state and then rollback for recovery • Lighter-weight fault tolerance The University of Texas at Austin

  5. Today’s Answer Programmers place checkpoints • Small checkpoint size • Synchronous • Every process checkpoints in the same place in the code • Global synchronization before and after checkpoints The University of Texas at Austin

  6. What’s the Problem? • Future systems will be larger • Checkpointing will hurt program performance • Many processes checkpointing synchronously will result in network and file system contention • Checkpointing to local disk not viable • Application programmers are only willing to pay 1% overhead for fault-tolerance • The solution: • Avoid synchronous checkpoints The University of Texas at Austin

  7. Solution: Staggered Checkpointing • Spread individual checkpoints in time to reduce network and file system contention • Possible approaches exist • Dynamic---Runtime overhead! • Do not guarantee reduced contention • This talk is going to explain: • Why staggered checkpointing is a good solution • Difficulties of staggered checkpointing • How a compiler can help The University of Texas at Austin

  8. Contributions • Show that synchronous checkpointing will suffer significant contention • Show that staggered checkpointing improves performance: • Reduces checkpoint latency up to a factor of 23 • Enables more frequent checkpoints • Describe a prototype compiler for identifying staggered checkpoints • Show that there is great potential for staggering checkpoints within applications The University of Texas at Austin

  9. Talk Outline • Motivation • Our Solution • Build communication graph • Create vector clocks • Identify recovery lines • Results • Future Work • Related Work • Conclusion The University of Texas at Austin

  10. 0 1 Understanding Staggered Checkpointing More processes, more data, synchronous checkpoints Contention! Not so fast… There is communication! State is inconsistent--- it could not have existed That’s easy! We’ll stagger the checkpoints…. Today: Tomorrow: State is consistent---it could have existed No problem! Send not saved X VALID Recovery line Processes X Recovery line[Randall 75] Receive is saved checkpoint with contention Receive not saved checkpoint … 64K X 2 Send is saved Time The University of Texas at Austin

  11. Complications with Staggered Checkpointing Checkpoints must be placed carefully: • Want valid recovery lines • Want low contention • Want small state This is difficult! The University of Texas at Austin

  12. Our Solution Compiler places staggered checkpoints • Builds communication graph • Calculates vector clocks • Identifies valid recovery lines The University of Texas at Austin

  13. Assumptions in our Prototype Compiler • Number of nodes known at compile-time • Communication only dependent on: • Node rank • Number of nodes in the system • Other constants • Explicit communication • Implementation assumes MPI The University of Texas at Austin

  14. First Step:Build Communication Graph • Find neighbor at each communication call • Symbolic expression analysis • Constant propagation and folding MPI_irecv(x, x, x, from_process, …) from_process = node_rank % sqrt(no_nodes) - 1 • Instantiate each process • Control-dependence analysis • Not all communication calls are executed every time • Match sends with receives, etc. The University of Texas at Austin

  15. 0 1 2 Example:Communication Graph Processes Time The University of Texas at Austin

  16. Second Step: Calculate Vector Clocks • Use communication graph • Create vector clocks (we will review!) • Iterate through calls • Track dependences • Keep current clocks with each call The University of Texas at Austin

  17. 0 [1,0,0] [2,0,0] [3,2,0] [4,5,2] 1 [2,3,2] [2,4,2] [1,1,0] [1,2,0] [2,5,2] 2 [2,0,1] [2,0,2] [2,4,3] Example:Calculate Vector Clocks [P0,P1,P2] Track events within a process Vector Clocks: capture inter-process dependences [1,0,0] [1] [2] [3] [4] Processes [3] [4] [1] [2] [5] [1,1,0] [1] [2] [3] [Lamport 78] Time The University of Texas at Austin

  18. 0 [1,0,0] [2,0,0] [3,2,0] [4,5,2] 1 [2,3,2] [2,4,2] [1,1,0] [1,2,0] [2,5,2] 2 [2,0,1] [2,0,2] [2,4,3] Next Step:Identify All PossibleValid Recovery Lines There are so many! Final Step: Choose some! And then place them in the code… Processes Time The University of Texas at Austin

  19. Talk Outline • Motivation • Our Solution • Results • Methodology • Contention Effects • Benchmark Results • Future Work • Related Work • Conclusion The University of Texas at Austin

  20. Methodology Compiler Trace Generator Simulator FS • Event-driven Simulator • Models computation events, communication events, and checkpointing events • Network, file system modeled optimistically • Cluster characteristics are modeled after an actual cluster • Compiler Implementation • Implemented in Broadway Compiler [Guyer & Lin 2000] • Accepts C code, generates C code with checkpoints • Trace Generator • Generates traces from pre-existing benchmarks • Uses static analysis and profiling The University of Texas at Austin

  21. Synthetic Benchmark • Large number of sequential instructions • 2 checkpoint locations per process • Simulated with 2 checkpointing policies • Policy 1: Synchronous • Every process checkpoints simultaneously • Barrier before, barrier after • Policy 2: Staggered • Processes checkpoint in groups of four • Spread evenly throughout the sequential instructions The University of Texas at Austin

  22. Staggering Improves Performance 256GB checkpointed by the system 16GB checkpointed by the system The University of Texas at Austin

  23. Amount of data checkpointed by the system Synchronous, 16GB What About a Fixed Problem Size? Average Checkpoint Time Per Process Average Checkpoint Time Per Process Staggered, 16GB • Staggered checkpointing improves performance • Staggered checkpointing becomes more helpful as • Number of processes increases • Amount of data checkpointed increases Numbers represented in previous graph Time (s) Number of Processes Number of Processes The University of Texas at Austin

  24. Synchronous, 16GB Synchronous, 256GB Staggered, 256GB What About a Fixed Problem Size? Average Checkpoint Time Per Process Staggered, 16GB • Staggered checkpointing improves performance • Staggered checkpointing becomes more helpful as • Number of processes increases • Amount of data checkpointed increases Numbers represented in previous graph 23x improvement Time (s) Number of Processes The University of Texas at Austin

  25. Staggering Allows More Checkpoints • Staggered checkpointing allows processes to checkpoint more often • Can checkpoint 9.3x more frequently for 4K processes The University of Texas at Austin

  26. Benchmark Characteristics • IS, BT are versions of the NAS Parallel Benchmarks • ek-simple is CFD benchmark The University of Texas at Austin

  27. Unique ValidRecovery Lines Number of statically unique valid recovery lines • Lots of point-to-point communication means many unique valid recovery lines • ek-simple is most representative of real applications • These recovery lines differ only with respect to dependence-creating communication The University of Texas at Austin

  28. Future Work • Develop heuristic to identify good recovery lines • Determining optimal is NP complete [Li et al 94] • Scalable simulation • Develop more realistic contention models • Relax assumptions in compiler • Dynamically changing communication patterns The University of Texas at Austin

  29. Related Work • Checkpointing with compilers • Compiler-Assisted [Beck et al 1994] • Automatic Checkpointing [Choi & Deitz 2002] • Application-Level Non-Blocking [Bronevetsky et al 2003] • Dynamic fault-tolerant protocols • Message logging [Elnozahy et al 2002] The University of Texas at Austin

  30. Conclusions • Synchronous checkpointing suffers from contention • Staggered checkpoints reduce contention • Reduces checkpoint latency up to a factor of 23 • Allows the application to tolerate more failures without a corresponding increase in overhead • A compiler can identify where to stagger checkpoints • Unique valid recovery lines are numerous in applications with point-to-point communication The University of Texas at Austin

  31. Thank you! The University of Texas at Austin

  32. Dynamic ValidRecovery Lines Number of dynamically unique valid recovery lines The University of Texas at Austin

  33. Crash Send Omission Receive Omission General Omission Arbitrary failures with message authentication Arbitrary (Byzantine) failures Fault Model The University of Texas at Austin

  34. Vector Clock Formula The University of Texas at Austin

  35. Message Logging • Saves all messages sent to stable storage • In the future, storing this data will be untenable • Message logging relies on checkpointing so that logs can be cleared The University of Texas at Austin

  36. In-flight messages:Why we don’t care • We reason about them at the application level so… • Messages are assumed received at actual receive call or at wait • We will know if any messages crossed the recovery line. We can prepare for recovery by checkpointing that information. The University of Texas at Austin

  37. C-Breeze • In-house compiler • Allows us to reason about code at various phases of compilation • Allows us to add our own phases The University of Texas at Austin

  38. In the future… • Systems will be more complex • Programs will be more complex • Checkpointing will be more complex • Programmer should not waste time and talent handling fault-tolerance Checkpointing MPI FORTRAN/C/C++ Algorithm The University of Texas at Austin

  39. The University of Texas at Austin

  40. Solution Goals • Transparent checkpointing • Use the compiler to place checkpoints • Low failure-free execution overhead • Stagger checkpoints • Minimize checkpoint state • Support legacy code The University of Texas at Austin

  41. The Intuition • Fault-tolerance requires valid recovery lines • Many possible valid recovery lines • Find them • Automatically choose a good one • Small state, low contention • Flexibility is key The University of Texas at Austin

  42. Our Solution • Where is the set of valid recovery lines? • Determine communication pattern • Use vector clocks • Which recovery line should we use? • Develop heuristics based on cluster architecture and application (not done yet) The University of Texas at Austin

  43. Overview: Status • Discover communication pattern • Create vector clocks • Identify possible recovery lines • Select recovery line • Experimentation • Performance model and heuristic The University of Texas at Austin

  44. Finding neighbors Find the neighbors for each process: p = sqrt(no_nodes) cell_coord[0][0] = node % p cell_coord[1][0] = node / p; j = cell_coord[0][0] - 1 i = cell_coord[1][0] - 1 from_process = (i – 1 + p) % p + p * j MPI_irecv(x, x, x, from_process, …) (taken from NAS benchmark bt) from_process = (node / (sqrt(no_nodes)) - 1 – 1 + sqrt(no_nodes)) % sqrt(no_nodes) + sqrt(no_nodes) * node % sqrt(no_nodes) - 1 The University of Texas at Austin

  45. Final Step: Recovery Lines • Discover possible recovery lines • Choose a good one • Determining optimal is NP complete [Li 94] • Develop heuristic • Rough performance model for staggering • Goals • Valid recovery line • Reduce bandwidth contention • Reduce storage space The University of Texas at Austin

  46. What about a fixed problem size? Average checkpoint speedup (x faster) per process : Staggered over Synchronous. Numbers represented in previous graph Data checkpointed by the system Number of Processes • Staggered checkpointing improves performance • Staggered checkpointing becomes more helpful as • number of processes increases • amount of data checkpointed increases The University of Texas at Austin

More Related