ScalaTrace: Scalable Compression and Replay of Communication Traces

Frank Mueller, Mike Noeth, Prasun Ratn North Carolina State UniversityMartin Schulz, Bronis R. de SupinskiLawrence Livermore National Laboratory ScalaTrace: Scalable Compression and Replay of Communication Traces

Outline • Introduction • Intra-Node Compression Framework • Inter-Node Compression Framework • Replay Mechanism • Experimental Results • Conclusion and Future Work

Introduction • Contemporary HPC Systems • Size > 1000 processors • take IBM Blue Gene/L: 64k processors • Challenges on HPC Systems (large-scale scientific applications) • Communication scaling (MPI) • Communication analysis • Task mapping • Procurements also require performance prediction of future systems

viz I/O nodes Communication Analysis • Existing approaches and short comings • Source code analysis + Does not require machine time • Wrong abstraction level: source code often complicated • No dynamic information • Lightweight statistical analysis (mpiP) + Low instrumentation cost - Less information available (i.e. aggregate metrics only) • Fully captured (lossless)(Vampir, VNG) + Full trace available for offline analysis • Traces generated per task: not scalable • Gather only traces on subset of nodes • Use cluster for visualization

Our Approach • Trace-driven approach to analyze MPI communication • Goals • Extract entire communication trace • Maintain structure • Full replay (independent of original application) • Scalable • Lossless • Rapid instrumentation • MPI implementation independent

ScalaTrace Design Overview Two Parts: • Recording traces • Use MPI profiling layer • Compress at the task-level • Compress across all nodes • Replaying traces

Intra-Node Compression Framework • Umpire [SC’00]  wrapper generator for MPI profiling layer • Initialization wrapper • Tracing wrapper • Termination wrapper • Intra-node compression of MPI calls • Provides load scalability • Interoperability with cross-node framework • Event aggregation • Special handling of MPI_Waitsome • Maintain structure of call sequences  stack walk signatures • XOR signature for speedXOR match necessary (not sufficient)  same call sequence

target head target head target head target head target tail target tail target tail target tail target tail target tail target tail target tail op1 op2 op3 op4 op5 op3 op4 op5 match merge head merge head match match merge tail merge tail merge tail Intra-Node Compression Example • Consider the MPI operation stream: op1, op2, op3, op4, op5, op3, op4, op5

op1 op2 op3 op4 op5 Intra-Node Compression Example • Consider the MPI operation stream: op1, op2, op3, op4, op5, op3, op4, op5 • Algorithm in the paper (( ), iters = 2)

Event Aggregation • Domain-specific encoding  improve compression • MPI_Waitsome (array_of_requests, incount, outcount, …) blocks until one or more requests satisfied • Number of Waitsome calls in loop is nondeterministic • Take advantage of general usage to delay compression • MPI_Waitsome does not compress until a different operation is executed • Accumulate output parameter outcount

Inter-Node Framework Interoperability • Single Program, Multiple Data (SPMD) nature of MPI codes • Match operations across nodes by manipulating parameters • Source / destination offsets • Request offsets

Location-Independent Encoding • Point-to-point communication specifies targets by MPI rank • MPI rank parameters will not match across tasks • Use offsets instead • 16 processor (4x4) 2D stencil example • MPI rank targets (source/destination) • 9 communicates with 8, 5, 10, 13 • 10 communicates with 9, 6, 11, 14 • MPI offsets • 9 & 10 communicate with -1, -4, +1, +4

Current Current Current Request Handles • Asynchronous MPI ops associated w/ a MPI_Request handle • Handle nondeterministic across tasks • Parameter mismatch in inter-node framework • Handle w/ circular request buffer (pre-set size) • On asynchronous MPI operation, store handle in buffer • On lookup of handle, use offset of current position in buffer • Requires special handling in replay mechanism H1 H2 H3 Lookup H1 = -2

Match op1 op2 op3 Inter-Node Compression Framework • Invoked after all computations done (in MPI_Finalize wrapper) • Merges operation queues produced by task-level framework • Job size scalability Task 0 op1 op2 op3 Task 1 Task 2 op4 op5 op6 Task 3 op4 op5 op6

Match op4 op5 op6 Inter-Node Compression Framework • There’s more: Relaxed reordering of events of different nodes (dependence check)  paper … • Invoked after all computations done (in MPI_Finalize wrapper) • Merges operation queues produced by task-level framework • Job size scalability Task 0 op1 op2 op3 Task 1 Task 2 op4 op5 op6 Task 3

Reduction over Binary Radix Tree • Cross-node framework merges operation queues of each task • Merge algorithm supports merging two queues at a time  paper • Radix layout facilitates compression (constant stride b/w nodes) • Need a control mechanism to order merging process

Replay Mechanism • Motivation • Can replay traces on any architecture • Useful for rapid prototyping  procurements • Communication tuning (Miranda -- SC’05) • Communication analysis (patterns) • Communication tuning (inefficiencies) • Replay design • Replays comprehensive trace produced by recording framework • Parses trace, loads task-level op queues (inverse of merge algo) • Replay on-the-fly (inverse of compression algo) • Timing deltas

Experimental Results • Environment • 1024 node BG/L at Lawrence Livermore National Labs • Stencil micro-benchmarks • Raptor real world application[Greenough’03]

Trace System Validation • Uncompressed trace dumps compared • Replay • Matching profiles from mpiP

Task Scalability Performance • Varied: • Strong (task) scaling: number of nodes • Examined metrics: • Trace file size • Memory usage • Compression time (or write time) • Timed replay accuracy • Results for • 1/2/3D stencils • Raptor application • NAS Parallel Benchmarks

Trace File Size – 3D Stencil • Constant size for fully compressed traces (inter-node) • Log scale 100MB 0.5MB 10kB

Memory Usage – 3D Stencil • Constant memory usage for fully compressed (inter-node) • min = leaves, avg = middle layer (decr. W/ node #), max ~ task 0 • Average memory usage decreases w/ more processors 0.5MB

Trace File Size – Raptor App • Sub-linear increase for fully compressed traces (inter-node) • NOT on log scale 93MB 80MB 35 MB

Memory Usage – Raptor • Constant memory usage for fully compressed (inter-node) • Average memory usage decreases w/ more processors 500MB

Load Scaling – 3D Stencil • Both intra- and inter-node compression result in constant size • Log scale

Trace File Size – NAS PB Codes Log scale file size [Bytes], 32-512 CPUs 3 categories for none / intra- / inter-node Focus: blue = full compression • Near-constant size (EP, also DT) • Instead of exponential • Sub-linear (MG, also LU) • Still good • Non-scalable (FT, also BT, CG, IS) • Still 2-4 orders of magn. smaller • But could improve • Due to complex comm. Patterns • along diagonal of 2D layout • Even varying # endpoints

Memory Usage – NAS PB Codes Log scale memory [Bytes], 32-512 CPUs 3 categories for min, avg, max,root0 • Near-constant size (EP, also DT) • Also constant in memory • Sub-linear (MG, also LU) • Sometimes constant in memory • Non-scalable (FT, also BT, CG, IS) • Non-scalable in memory

Compression/Write Overhead – NAS PB Log scale time [ms], 32-512 CPUs none / intra- / inter-node (full compression) • Near-constant size (EP, also DT) • Inter-node compression fastest • Sub-linear (LU, also MG) • Intra-node fasterthan inter-node • Non-scalable (FT, also BT, CG, IS) • Not competitive better write with intra-nodecompression

Record delta times b/w MPI calls Path-sensitive  histograms Replay compute time as delay on BG/L Bars in replay experiments: uninstrumented (orig. trace) w/ mpiP uncompressed (w/ mpiP) node/intra compressed (w/ mpiP) glob./inter compressed (w/ mpiP) Report timing of replay: 32-512 CPUs Compute (delay) Communicate (just replayed) Use fine-grained clock Small, scalable traces & timing retained Timed Replay NAS PB 12345 FT CG MG

ScalaTrace Summary Contributions: • Scalable approach to capturing full trace of communication • Near constant trace sizes for some apps (others: more work) • Near constant memory requirement • Rapid analysis via replay mechanism from trace (w/o app) • Record t to retaining timing behavior  scalable • Fast timeline search, easy outlier detection • Lossless MPI tracing of any number of node feasible • May store&visualize MPI traces on desktop Future Work: • Task layout model (i.e. Miranda) • Post-analysis stencil identification • Tuning  detect non-scalable MPI usage • Support for procurements • Offload compression to I/O nodes

Acknowledgements • Mike Noeth (NCSU, intern @ LLNL) • Prasun Ratn (NCSU, intern @ LLNL) • Martin Schulz (LLNL) • Bronis R. de Supinski (LLNL) • IPDPS’07 best paper, work on timed replay forthcoming • Availability under BSD license: moss.csc.ncsu.edu/~mueller/scala.html • Funded in part by Humboldt Foundation, NSF CCF-0429653, CNS-0410203, CAREER CCR-0237570 • Part of this work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48

Average time per node Max. time Global Compression/Write Time [ms] BTCGDT EP FTISLUMG

Trace File Size – Raptor • Near constant size for fully compressed traces • Linear scale

Memory Usage – Raptor Results • Same as 3D stencil • Investigating the min memory usage for 1024 tasks

Intra-Node Compression Algorithm Intercept MPI call Identify target Identify merger Match verification Compression

Call Sequence ID: Stack Walk Signature • Maintain structure by distinguishing between operations • XOR signature  speed • XOR match necessary (not sufficient)  same calling context

Inter-Node Merge Algorithm Iterate through both queues Find match Maintain order Compress ops

Inter-Node Merge Example • Consider two tasks each with their own operation queue MI Sequence 1 Sequence 4 Master Queue Master Queue Participants: Task 0 Participants: Task 0 Task 1 MATCH! Sequence 1 Sequence 2 Sequence 3 Sequence 4 Slave Queue Slave Queue Participants: Task 1 Participants: Task 1 Participants: Task 1 Participants: Task 1 SI SH

MATCH! SI Inter-Node Merge Example • Consider two tasks each with their own operation queue MI Sequence 1 Sequence 4 Master Queue Master Queue Participants: Task 0 Participants: Task 0 Task 1 Sequence 1 Sequence 2 Sequence 3 Sequence 4 Slave Queue Slave Queue Participants: Task 1 Participants: Task 1 Participants: Task 1 Participants: Task 1 SH SI

Cross-Node Merge Example • Consider two tasks each with their own operation queue MI Sequence 1 Sequence 4 Master Queue Master Queue Participants: Task 0 Participants: Task 0 Task 1 Task 1 MATCH! Sequence 2 Sequence 3 Sequence 4 Slave Queue Slave Queue Participants: Task 1 Participants: Task 1 Participants: Task 1 SH SI

Temporal Cross-Node Reordering • Requirement: queue maintains order of operations • Merge algorithm maintains order of operations too strictly • Unmatched sequences in slave queue always moved to master • Results in poorer compression • Solution: only move operations that must be moved • Intersect task participation lists of matched & unmatched ops • Intersection empty  no dependency • O/w  ops must be moved

MI SI SH SI Dependency Example • Consider the 4 task job (task 1 & 3 have already merged): Sequence 1 Sequence 2 Master Queue Participants: Task 0 Participants: Task 0 MATCH! Sequence 2 Sequence 1 Slave Queue Participants: Task 1 Participants: Task 3

MI SH SI Dependency Example • Consider the 4 task job (task 1 & 3 have already merged): Sequence 1 Sequence 2 Master Queue Participants: Task 0 Participants: Task 0 Sequence 2 Sequence 1 Slave Queue Participants: Task 1 Participants: Task 3

Duplicates MI Task 3 SH SI Dependency Example • Consider the 4 task job (task 1 & 3 have already merged): Sequence 2 Sequence 1 Sequence 2 Master Queue Participants: Task 1 Participants: Task 0 Participants: Task 0 Sequence 1 Slave Queue Participants: Task 3

MI Task 3 SH SI Dependency Example • Consider the 4 task job (task 1 & 3 have already merged): Sequence 2 Sequence 1 Sequence 2 Master Queue Participants: Task 1 Participants: Task 0 Participants: Task 0 Task 1 Sequence 1 Slave Queue Participants: Task 3

ScalaTrace: Scalable Compression and Replay of Communication Traces

ScalaTrace: Scalable Compression and Replay of Communication Traces

Presentation Transcript

Scalable Secure Bidirectional Group Communication

Capture and Replay

Reliable and Scalable Multimedia Communication

UT^2: Human-like Behavior via Neuroevolution of Combat Behavior and Replay of Human Traces

Replay

Archives and traces

Replay

Distributed, Globally Scalable VoIP Communication

Efficient Scalable Video Compression by Scalable Motion Coding

Scalable Compression and Replay of Communication Traces in Massively Parallel Environments

TRACES:

Capture and Replay

SCRIBE - SCalable, Robust, EffIcient Dictionary-Based ComprEssion

Karma: Scalable Deterministic Record-Replay

REPLAY

Replay

Threshold Compression for 3G Scalable Monitoring

Traces of Wearmouth