490 likes | 692 Views
Frank Mueller , Mike Noeth, Prasun Ratn North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory. ScalaTrace: Scalable Compression and Replay of Communication Traces. Outline. Introduction Intra-Node Compression Framework
E N D
Frank Mueller, Mike Noeth, Prasun Ratn North Carolina State UniversityMartin Schulz, Bronis R. de SupinskiLawrence Livermore National Laboratory ScalaTrace: Scalable Compression and Replay of Communication Traces
Outline • Introduction • Intra-Node Compression Framework • Inter-Node Compression Framework • Replay Mechanism • Experimental Results • Conclusion and Future Work
Introduction • Contemporary HPC Systems • Size > 1000 processors • take IBM Blue Gene/L: 64k processors • Challenges on HPC Systems (large-scale scientific applications) • Communication scaling (MPI) • Communication analysis • Task mapping • Procurements also require performance prediction of future systems
viz I/O nodes Communication Analysis • Existing approaches and short comings • Source code analysis + Does not require machine time • Wrong abstraction level: source code often complicated • No dynamic information • Lightweight statistical analysis (mpiP) + Low instrumentation cost - Less information available (i.e. aggregate metrics only) • Fully captured (lossless)(Vampir, VNG) + Full trace available for offline analysis • Traces generated per task: not scalable • Gather only traces on subset of nodes • Use cluster for visualization
Our Approach • Trace-driven approach to analyze MPI communication • Goals • Extract entire communication trace • Maintain structure • Full replay (independent of original application) • Scalable • Lossless • Rapid instrumentation • MPI implementation independent
ScalaTrace Design Overview Two Parts: • Recording traces • Use MPI profiling layer • Compress at the task-level • Compress across all nodes • Replaying traces
Outline • Introduction • Intra-Node Compression Framework • Inter-Node Compression Framework • Replay Mechanism • Experimental Results • Conclusion and Future Work
Intra-Node Compression Framework • Umpire [SC’00] wrapper generator for MPI profiling layer • Initialization wrapper • Tracing wrapper • Termination wrapper • Intra-node compression of MPI calls • Provides load scalability • Interoperability with cross-node framework • Event aggregation • Special handling of MPI_Waitsome • Maintain structure of call sequences stack walk signatures • XOR signature for speedXOR match necessary (not sufficient) same call sequence
target head target head target head target head target tail target tail target tail target tail target tail target tail target tail target tail op1 op2 op3 op4 op5 op3 op4 op5 match merge head merge head match match merge tail merge tail merge tail Intra-Node Compression Example • Consider the MPI operation stream: op1, op2, op3, op4, op5, op3, op4, op5
op1 op2 op3 op4 op5 Intra-Node Compression Example • Consider the MPI operation stream: op1, op2, op3, op4, op5, op3, op4, op5 • Algorithm in the paper (( ), iters = 2)
Event Aggregation • Domain-specific encoding improve compression • MPI_Waitsome (array_of_requests, incount, outcount, …) blocks until one or more requests satisfied • Number of Waitsome calls in loop is nondeterministic • Take advantage of general usage to delay compression • MPI_Waitsome does not compress until a different operation is executed • Accumulate output parameter outcount
Outline • Introduction • Intra-Node Compression Framework • Inter-Node Compression Framework • Replay Mechanism • Experimental Results • Conclusion and Future Work
Inter-Node Framework Interoperability • Single Program, Multiple Data (SPMD) nature of MPI codes • Match operations across nodes by manipulating parameters • Source / destination offsets • Request offsets
Location-Independent Encoding • Point-to-point communication specifies targets by MPI rank • MPI rank parameters will not match across tasks • Use offsets instead • 16 processor (4x4) 2D stencil example • MPI rank targets (source/destination) • 9 communicates with 8, 5, 10, 13 • 10 communicates with 9, 6, 11, 14 • MPI offsets • 9 & 10 communicate with -1, -4, +1, +4
Current Current Current Request Handles • Asynchronous MPI ops associated w/ a MPI_Request handle • Handle nondeterministic across tasks • Parameter mismatch in inter-node framework • Handle w/ circular request buffer (pre-set size) • On asynchronous MPI operation, store handle in buffer • On lookup of handle, use offset of current position in buffer • Requires special handling in replay mechanism H1 H2 H3 Lookup H1 = -2
Match op1 op2 op3 Inter-Node Compression Framework • Invoked after all computations done (in MPI_Finalize wrapper) • Merges operation queues produced by task-level framework • Job size scalability Task 0 op1 op2 op3 Task 1 Task 2 op4 op5 op6 Task 3 op4 op5 op6
Match op4 op5 op6 Inter-Node Compression Framework • There’s more: Relaxed reordering of events of different nodes (dependence check) paper … • Invoked after all computations done (in MPI_Finalize wrapper) • Merges operation queues produced by task-level framework • Job size scalability Task 0 op1 op2 op3 Task 1 Task 2 op4 op5 op6 Task 3
Reduction over Binary Radix Tree • Cross-node framework merges operation queues of each task • Merge algorithm supports merging two queues at a time paper • Radix layout facilitates compression (constant stride b/w nodes) • Need a control mechanism to order merging process
Outline • Introduction • Intra-Node Compression Framework • Inter-Node Compression Framework • Replay Mechanism • Experimental Results • Conclusion and Future Work
Replay Mechanism • Motivation • Can replay traces on any architecture • Useful for rapid prototyping procurements • Communication tuning (Miranda -- SC’05) • Communication analysis (patterns) • Communication tuning (inefficiencies) • Replay design • Replays comprehensive trace produced by recording framework • Parses trace, loads task-level op queues (inverse of merge algo) • Replay on-the-fly (inverse of compression algo) • Timing deltas
Experimental Results • Environment • 1024 node BG/L at Lawrence Livermore National Labs • Stencil micro-benchmarks • Raptor real world application[Greenough’03]
Trace System Validation • Uncompressed trace dumps compared • Replay • Matching profiles from mpiP
Task Scalability Performance • Varied: • Strong (task) scaling: number of nodes • Examined metrics: • Trace file size • Memory usage • Compression time (or write time) • Timed replay accuracy • Results for • 1/2/3D stencils • Raptor application • NAS Parallel Benchmarks
Trace File Size – 3D Stencil • Constant size for fully compressed traces (inter-node) • Log scale 100MB 0.5MB 10kB
Memory Usage – 3D Stencil • Constant memory usage for fully compressed (inter-node) • min = leaves, avg = middle layer (decr. W/ node #), max ~ task 0 • Average memory usage decreases w/ more processors 0.5MB
Trace File Size – Raptor App • Sub-linear increase for fully compressed traces (inter-node) • NOT on log scale 93MB 80MB 35 MB
Memory Usage – Raptor • Constant memory usage for fully compressed (inter-node) • Average memory usage decreases w/ more processors 500MB
Load Scaling – 3D Stencil • Both intra- and inter-node compression result in constant size • Log scale
Trace File Size – NAS PB Codes Log scale file size [Bytes], 32-512 CPUs 3 categories for none / intra- / inter-node Focus: blue = full compression • Near-constant size (EP, also DT) • Instead of exponential • Sub-linear (MG, also LU) • Still good • Non-scalable (FT, also BT, CG, IS) • Still 2-4 orders of magn. smaller • But could improve • Due to complex comm. Patterns • along diagonal of 2D layout • Even varying # endpoints
Memory Usage – NAS PB Codes Log scale memory [Bytes], 32-512 CPUs 3 categories for min, avg, max,root0 • Near-constant size (EP, also DT) • Also constant in memory • Sub-linear (MG, also LU) • Sometimes constant in memory • Non-scalable (FT, also BT, CG, IS) • Non-scalable in memory
Compression/Write Overhead – NAS PB Log scale time [ms], 32-512 CPUs none / intra- / inter-node (full compression) • Near-constant size (EP, also DT) • Inter-node compression fastest • Sub-linear (LU, also MG) • Intra-node fasterthan inter-node • Non-scalable (FT, also BT, CG, IS) • Not competitive better write with intra-nodecompression
Record delta times b/w MPI calls Path-sensitive histograms Replay compute time as delay on BG/L Bars in replay experiments: uninstrumented (orig. trace) w/ mpiP uncompressed (w/ mpiP) node/intra compressed (w/ mpiP) glob./inter compressed (w/ mpiP) Report timing of replay: 32-512 CPUs Compute (delay) Communicate (just replayed) Use fine-grained clock Small, scalable traces & timing retained Timed Replay NAS PB 12345 FT CG MG
ScalaTrace Summary Contributions: • Scalable approach to capturing full trace of communication • Near constant trace sizes for some apps (others: more work) • Near constant memory requirement • Rapid analysis via replay mechanism from trace (w/o app) • Record t to retaining timing behavior scalable • Fast timeline search, easy outlier detection • Lossless MPI tracing of any number of node feasible • May store&visualize MPI traces on desktop Future Work: • Task layout model (i.e. Miranda) • Post-analysis stencil identification • Tuning detect non-scalable MPI usage • Support for procurements • Offload compression to I/O nodes
Acknowledgements • Mike Noeth (NCSU, intern @ LLNL) • Prasun Ratn (NCSU, intern @ LLNL) • Martin Schulz (LLNL) • Bronis R. de Supinski (LLNL) • IPDPS’07 best paper, work on timed replay forthcoming • Availability under BSD license: moss.csc.ncsu.edu/~mueller/scala.html • Funded in part by Humboldt Foundation, NSF CCF-0429653, CNS-0410203, CAREER CCR-0237570 • Part of this work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48
Average time per node Max. time Global Compression/Write Time [ms] BTCGDT EP FTISLUMG
Trace File Size – Raptor • Near constant size for fully compressed traces • Linear scale
Memory Usage – Raptor Results • Same as 3D stencil • Investigating the min memory usage for 1024 tasks
Intra-Node Compression Algorithm Intercept MPI call Identify target Identify merger Match verification Compression
Call Sequence ID: Stack Walk Signature • Maintain structure by distinguishing between operations • XOR signature speed • XOR match necessary (not sufficient) same calling context
Inter-Node Merge Algorithm Iterate through both queues Find match Maintain order Compress ops
Inter-Node Merge Example • Consider two tasks each with their own operation queue MI Sequence 1 Sequence 4 Master Queue Master Queue Participants: Task 0 Participants: Task 0 Task 1 MATCH! Sequence 1 Sequence 2 Sequence 3 Sequence 4 Slave Queue Slave Queue Participants: Task 1 Participants: Task 1 Participants: Task 1 Participants: Task 1 SI SH
MATCH! SI Inter-Node Merge Example • Consider two tasks each with their own operation queue MI Sequence 1 Sequence 4 Master Queue Master Queue Participants: Task 0 Participants: Task 0 Task 1 Sequence 1 Sequence 2 Sequence 3 Sequence 4 Slave Queue Slave Queue Participants: Task 1 Participants: Task 1 Participants: Task 1 Participants: Task 1 SH SI
Cross-Node Merge Example • Consider two tasks each with their own operation queue MI Sequence 1 Sequence 4 Master Queue Master Queue Participants: Task 0 Participants: Task 0 Task 1 Task 1 MATCH! Sequence 2 Sequence 3 Sequence 4 Slave Queue Slave Queue Participants: Task 1 Participants: Task 1 Participants: Task 1 SH SI
Temporal Cross-Node Reordering • Requirement: queue maintains order of operations • Merge algorithm maintains order of operations too strictly • Unmatched sequences in slave queue always moved to master • Results in poorer compression • Solution: only move operations that must be moved • Intersect task participation lists of matched & unmatched ops • Intersection empty no dependency • O/w ops must be moved
MI SI SH SI Dependency Example • Consider the 4 task job (task 1 & 3 have already merged): Sequence 1 Sequence 2 Master Queue Participants: Task 0 Participants: Task 0 MATCH! Sequence 2 Sequence 1 Slave Queue Participants: Task 1 Participants: Task 3
MI SH SI Dependency Example • Consider the 4 task job (task 1 & 3 have already merged): Sequence 1 Sequence 2 Master Queue Participants: Task 0 Participants: Task 0 Sequence 2 Sequence 1 Slave Queue Participants: Task 1 Participants: Task 3
Duplicates MI Task 3 SH SI Dependency Example • Consider the 4 task job (task 1 & 3 have already merged): Sequence 2 Sequence 1 Sequence 2 Master Queue Participants: Task 1 Participants: Task 0 Participants: Task 0 Sequence 1 Slave Queue Participants: Task 3
MI Task 3 SH SI Dependency Example • Consider the 4 task job (task 1 & 3 have already merged): Sequence 2 Sequence 1 Sequence 2 Master Queue Participants: Task 1 Participants: Task 0 Participants: Task 0 Task 1 Sequence 1 Slave Queue Participants: Task 3