1 / 48

ScalaTrace: Scalable Compression and Replay of Communication Traces

Frank Mueller , Mike Noeth, Prasun Ratn North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory. ScalaTrace: Scalable Compression and Replay of Communication Traces. Outline. Introduction Intra-Node Compression Framework

hamish
Download Presentation

ScalaTrace: Scalable Compression and Replay of Communication Traces

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Frank Mueller, Mike Noeth, Prasun Ratn North Carolina State UniversityMartin Schulz, Bronis R. de SupinskiLawrence Livermore National Laboratory ScalaTrace: Scalable Compression and Replay of Communication Traces

  2. Outline • Introduction • Intra-Node Compression Framework • Inter-Node Compression Framework • Replay Mechanism • Experimental Results • Conclusion and Future Work

  3. Introduction • Contemporary HPC Systems • Size > 1000 processors • take IBM Blue Gene/L: 64k processors • Challenges on HPC Systems (large-scale scientific applications) • Communication scaling (MPI) • Communication analysis • Task mapping • Procurements also require performance prediction of future systems

  4. viz I/O nodes Communication Analysis • Existing approaches and short comings • Source code analysis + Does not require machine time • Wrong abstraction level: source code often complicated • No dynamic information • Lightweight statistical analysis (mpiP) + Low instrumentation cost - Less information available (i.e. aggregate metrics only) • Fully captured (lossless)(Vampir, VNG) + Full trace available for offline analysis • Traces generated per task: not scalable • Gather only traces on subset of nodes • Use cluster for visualization

  5. Our Approach • Trace-driven approach to analyze MPI communication • Goals • Extract entire communication trace • Maintain structure • Full replay (independent of original application) • Scalable • Lossless • Rapid instrumentation • MPI implementation independent

  6. ScalaTrace Design Overview Two Parts: • Recording traces • Use MPI profiling layer • Compress at the task-level • Compress across all nodes • Replaying traces

  7. Outline • Introduction • Intra-Node Compression Framework • Inter-Node Compression Framework • Replay Mechanism • Experimental Results • Conclusion and Future Work

  8. Intra-Node Compression Framework • Umpire [SC’00]  wrapper generator for MPI profiling layer • Initialization wrapper • Tracing wrapper • Termination wrapper • Intra-node compression of MPI calls • Provides load scalability • Interoperability with cross-node framework • Event aggregation • Special handling of MPI_Waitsome • Maintain structure of call sequences  stack walk signatures • XOR signature for speedXOR match necessary (not sufficient)  same call sequence

  9. target head target head target head target head target tail target tail target tail target tail target tail target tail target tail target tail op1 op2 op3 op4 op5 op3 op4 op5 match merge head merge head match match merge tail merge tail merge tail Intra-Node Compression Example • Consider the MPI operation stream: op1, op2, op3, op4, op5, op3, op4, op5

  10. op1 op2 op3 op4 op5 Intra-Node Compression Example • Consider the MPI operation stream: op1, op2, op3, op4, op5, op3, op4, op5 • Algorithm in the paper (( ), iters = 2)

  11. Event Aggregation • Domain-specific encoding  improve compression • MPI_Waitsome (array_of_requests, incount, outcount, …) blocks until one or more requests satisfied • Number of Waitsome calls in loop is nondeterministic • Take advantage of general usage to delay compression • MPI_Waitsome does not compress until a different operation is executed • Accumulate output parameter outcount

  12. Outline • Introduction • Intra-Node Compression Framework • Inter-Node Compression Framework • Replay Mechanism • Experimental Results • Conclusion and Future Work

  13. Inter-Node Framework Interoperability • Single Program, Multiple Data (SPMD) nature of MPI codes • Match operations across nodes by manipulating parameters • Source / destination offsets • Request offsets

  14. Location-Independent Encoding • Point-to-point communication specifies targets by MPI rank • MPI rank parameters will not match across tasks • Use offsets instead • 16 processor (4x4) 2D stencil example • MPI rank targets (source/destination) • 9 communicates with 8, 5, 10, 13 • 10 communicates with 9, 6, 11, 14 • MPI offsets • 9 & 10 communicate with -1, -4, +1, +4

  15. Current Current Current Request Handles • Asynchronous MPI ops associated w/ a MPI_Request handle • Handle nondeterministic across tasks • Parameter mismatch in inter-node framework • Handle w/ circular request buffer (pre-set size) • On asynchronous MPI operation, store handle in buffer • On lookup of handle, use offset of current position in buffer • Requires special handling in replay mechanism H1 H2 H3 Lookup H1 = -2

  16. Match op1 op2 op3 Inter-Node Compression Framework • Invoked after all computations done (in MPI_Finalize wrapper) • Merges operation queues produced by task-level framework • Job size scalability Task 0 op1 op2 op3 Task 1 Task 2 op4 op5 op6 Task 3 op4 op5 op6

  17. Match op4 op5 op6 Inter-Node Compression Framework • There’s more: Relaxed reordering of events of different nodes (dependence check)  paper … • Invoked after all computations done (in MPI_Finalize wrapper) • Merges operation queues produced by task-level framework • Job size scalability Task 0 op1 op2 op3 Task 1 Task 2 op4 op5 op6 Task 3

  18. Reduction over Binary Radix Tree • Cross-node framework merges operation queues of each task • Merge algorithm supports merging two queues at a time  paper • Radix layout facilitates compression (constant stride b/w nodes) • Need a control mechanism to order merging process

  19. Outline • Introduction • Intra-Node Compression Framework • Inter-Node Compression Framework • Replay Mechanism • Experimental Results • Conclusion and Future Work

  20. Replay Mechanism • Motivation • Can replay traces on any architecture • Useful for rapid prototyping  procurements • Communication tuning (Miranda -- SC’05) • Communication analysis (patterns) • Communication tuning (inefficiencies) • Replay design • Replays comprehensive trace produced by recording framework • Parses trace, loads task-level op queues (inverse of merge algo) • Replay on-the-fly (inverse of compression algo) • Timing deltas

  21. Experimental Results • Environment • 1024 node BG/L at Lawrence Livermore National Labs • Stencil micro-benchmarks • Raptor real world application[Greenough’03]

  22. Trace System Validation • Uncompressed trace dumps compared • Replay • Matching profiles from mpiP

  23. Task Scalability Performance • Varied: • Strong (task) scaling: number of nodes • Examined metrics: • Trace file size • Memory usage • Compression time (or write time) • Timed replay accuracy • Results for • 1/2/3D stencils • Raptor application • NAS Parallel Benchmarks

  24. Trace File Size – 3D Stencil • Constant size for fully compressed traces (inter-node) • Log scale 100MB 0.5MB 10kB

  25. Memory Usage – 3D Stencil • Constant memory usage for fully compressed (inter-node) • min = leaves, avg = middle layer (decr. W/ node #), max ~ task 0 • Average memory usage decreases w/ more processors 0.5MB

  26. Trace File Size – Raptor App • Sub-linear increase for fully compressed traces (inter-node) • NOT on log scale 93MB 80MB 35 MB

  27. Memory Usage – Raptor • Constant memory usage for fully compressed (inter-node) • Average memory usage decreases w/ more processors 500MB

  28. Load Scaling – 3D Stencil • Both intra- and inter-node compression result in constant size • Log scale

  29. Trace File Size – NAS PB Codes Log scale file size [Bytes], 32-512 CPUs 3 categories for none / intra- / inter-node Focus: blue = full compression • Near-constant size (EP, also DT) • Instead of exponential • Sub-linear (MG, also LU) • Still good • Non-scalable (FT, also BT, CG, IS) • Still 2-4 orders of magn. smaller • But could improve • Due to complex comm. Patterns • along diagonal of 2D layout • Even varying # endpoints

  30. Memory Usage – NAS PB Codes Log scale memory [Bytes], 32-512 CPUs 3 categories for min, avg, max,root0 • Near-constant size (EP, also DT) • Also constant in memory • Sub-linear (MG, also LU) • Sometimes constant in memory • Non-scalable (FT, also BT, CG, IS) • Non-scalable in memory

  31. Compression/Write Overhead – NAS PB Log scale time [ms], 32-512 CPUs none / intra- / inter-node (full compression) • Near-constant size (EP, also DT) • Inter-node compression fastest • Sub-linear (LU, also MG) • Intra-node fasterthan inter-node • Non-scalable (FT, also BT, CG, IS) • Not competitive better write with intra-nodecompression

  32. Record delta times b/w MPI calls Path-sensitive  histograms Replay compute time as delay on BG/L Bars in replay experiments: uninstrumented (orig. trace) w/ mpiP uncompressed (w/ mpiP) node/intra compressed (w/ mpiP) glob./inter compressed (w/ mpiP) Report timing of replay: 32-512 CPUs Compute (delay) Communicate (just replayed) Use fine-grained clock Small, scalable traces & timing retained Timed Replay NAS PB 12345 FT CG MG

  33. ScalaTrace Summary Contributions: • Scalable approach to capturing full trace of communication • Near constant trace sizes for some apps (others: more work) • Near constant memory requirement • Rapid analysis via replay mechanism from trace (w/o app) • Record t to retaining timing behavior  scalable • Fast timeline search, easy outlier detection • Lossless MPI tracing of any number of node feasible • May store&visualize MPI traces on desktop Future Work: • Task layout model (i.e. Miranda) • Post-analysis stencil identification • Tuning  detect non-scalable MPI usage • Support for procurements • Offload compression to I/O nodes

  34. Acknowledgements • Mike Noeth (NCSU, intern @ LLNL) • Prasun Ratn (NCSU, intern @ LLNL) • Martin Schulz (LLNL) • Bronis R. de Supinski (LLNL) • IPDPS’07 best paper, work on timed replay forthcoming • Availability under BSD license: moss.csc.ncsu.edu/~mueller/scala.html • Funded in part by Humboldt Foundation, NSF CCF-0429653, CNS-0410203, CAREER CCR-0237570 • Part of this work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48

  35. Average time per node Max. time Global Compression/Write Time [ms] BTCGDT EP FTISLUMG

  36. Trace File Size – Raptor • Near constant size for fully compressed traces • Linear scale

  37. Memory Usage – Raptor Results • Same as 3D stencil • Investigating the min memory usage for 1024 tasks

  38. Intra-Node Compression Algorithm Intercept MPI call Identify target Identify merger Match verification Compression

  39. Call Sequence ID: Stack Walk Signature • Maintain structure by distinguishing between operations • XOR signature  speed • XOR match necessary (not sufficient)  same calling context

  40. Inter-Node Merge Algorithm Iterate through both queues Find match Maintain order Compress ops

  41. Inter-Node Merge Example • Consider two tasks each with their own operation queue MI Sequence 1 Sequence 4 Master Queue Master Queue Participants: Task 0 Participants: Task 0 Task 1 MATCH! Sequence 1 Sequence 2 Sequence 3 Sequence 4 Slave Queue Slave Queue Participants: Task 1 Participants: Task 1 Participants: Task 1 Participants: Task 1 SI SH

  42. MATCH! SI Inter-Node Merge Example • Consider two tasks each with their own operation queue MI Sequence 1 Sequence 4 Master Queue Master Queue Participants: Task 0 Participants: Task 0 Task 1 Sequence 1 Sequence 2 Sequence 3 Sequence 4 Slave Queue Slave Queue Participants: Task 1 Participants: Task 1 Participants: Task 1 Participants: Task 1 SH SI

  43. Cross-Node Merge Example • Consider two tasks each with their own operation queue MI Sequence 1 Sequence 4 Master Queue Master Queue Participants: Task 0 Participants: Task 0 Task 1 Task 1 MATCH! Sequence 2 Sequence 3 Sequence 4 Slave Queue Slave Queue Participants: Task 1 Participants: Task 1 Participants: Task 1 SH SI

  44. Temporal Cross-Node Reordering • Requirement: queue maintains order of operations • Merge algorithm maintains order of operations too strictly • Unmatched sequences in slave queue always moved to master • Results in poorer compression • Solution: only move operations that must be moved • Intersect task participation lists of matched & unmatched ops • Intersection empty  no dependency • O/w  ops must be moved

  45. MI SI SH SI Dependency Example • Consider the 4 task job (task 1 & 3 have already merged): Sequence 1 Sequence 2 Master Queue Participants: Task 0 Participants: Task 0 MATCH! Sequence 2 Sequence 1 Slave Queue Participants: Task 1 Participants: Task 3

  46. MI SH SI Dependency Example • Consider the 4 task job (task 1 & 3 have already merged): Sequence 1 Sequence 2 Master Queue Participants: Task 0 Participants: Task 0 Sequence 2 Sequence 1 Slave Queue Participants: Task 1 Participants: Task 3

  47. Duplicates MI Task 3 SH SI Dependency Example • Consider the 4 task job (task 1 & 3 have already merged): Sequence 2 Sequence 1 Sequence 2 Master Queue Participants: Task 1 Participants: Task 0 Participants: Task 0 Sequence 1 Slave Queue Participants: Task 3

  48. MI Task 3 SH SI Dependency Example • Consider the 4 task job (task 1 & 3 have already merged): Sequence 2 Sequence 1 Sequence 2 Master Queue Participants: Task 1 Participants: Task 0 Participants: Task 0 Task 1 Sequence 1 Slave Queue Participants: Task 3

More Related