240 likes | 257 Views
UCRL-PRES-230290. Stack Trace Analysis for Large Scale Debugging using MRNet. Dorian C. Arnold, Barton P. Miller University of Wisconsin Dong Ahn, Bronis R. de Supinski, Gregory L. Lee, Martin Schulz Lawrence Livermore National Laboratory. Scaling Tools. Machine sizes are increasing
E N D
UCRL-PRES-230290 Stack Trace Analysis for Large Scale Debugging using MRNet Dorian C. Arnold, Barton P. Miller University of Wisconsin Dong Ahn, Bronis R. de Supinski,Gregory L. Lee, Martin Schulz Lawrence Livermore National Laboratory
Scaling Tools • Machine sizes are increasing • New cluster close to or above 10,000 cores • Blue Gene/L: over 131,000 cores • Not only applications need to scale • Support environment • Tools • Challenges • Data collection, storage, and analysis • Scalable process management and control • Visualization Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
Typical debug session includes many interactions Debugging on BlueGene/L 4096 is only 3% of BG/L! Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
Scalability Limitations • Large volumes of debug data • Single frontend for all node connections • Centralized data analysis • Vendor licensing limitations • Approach: scalable, lightweight debugger • Reduce exploration space to small subset • Online aggregation using a TBŌN • Full-featured debugger for deeper digging Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
Outline • Case study: CCSM • STAT Approach • Concept of Stack Traces • Identification of Equivalence Classes • Implementation • Using Tree-based Overlay Networks • Data and Work Flow in STAT • Evaluation • Conclusions Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
Case Study: CCSM • Community Climate System Model (CCSM) • Used to make climate predictions • Coupled models for atmosphere, ocean, sea ice and land surface • Implementation • Multiple Program Multiple Data (MPMD) model • MPI-based application • Distinct components for each model • Typically requires significant node count • Models executed concurrently • Several hundred tasks Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
Observations • Intermittently hangs with 472 tasks • Non-deterministic • Only at large scale • Appears at seemingly random code locations • Hard to reproduce:2 hangs over next 10 days (~50 runs) • Current approach: • Attach to job using TotalView • Collect stack traces from all 472 tasks • Visualize cross-node callgraph Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
CCSM Callgraph Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
Lessons Learned • Some bugs only occur at large scales • Non-deterministic & hard to reproduce • Stack traces can provide useful insight • Many bugs are temporal in nature • Need tools that: • Combine spatial and temporal observations • Discover application behavior • Run effectively at scale Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
STAT Approach • Sample application stack traces • Across time and space • Through third party interface • Using a DynInst based daemon • Merge/analyze traces: • Discover equivalent process behavior • Group similar processes • Facilitate scalable analysis/data presentation • Leverage TBŌN model (MRNet) • Communicate traces back to a frontend • Merge on the fly within MRNet filters Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
Singleton Stack Trace Appl. Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
Merging Stack Traces • Multiple traces over space or time • Taken independently • Stored in graph representation • Create call graph prefix tree • Only merge nodes with identical stack backtrace • Retains context information • Advantages • Compressed representation • Scalable visualization • Scalable analysis Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
Merging Stack Traces Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
2D-Trace/Space Analysis Appl Appl Appl … Appl Appl Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
Prefix Tree vs. DAG TotalView STAT Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
2D-Trace/Time Analysis … Appl Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
Time & Space Analysis • Both 2D techniques insufficient • Spatial aggregation misses temporal component • Temporal aggregation misses parallel aspects • Multiple samples, multiple processes • Track global program behavior over time • Merge into single, 3D prefix tree • Challenges: • Scalable data representation • Scalable analysis • Scalable and useful visualization/results Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
4 Nodes / 10 Snapshots 3D-Trace/Space/Time Analysis Appl Appl … Appl … Appl Appl Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
3D-Trace/Space/Time Analysis 288 Nodes / 10 Snapshots Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
Implementation Details • Communication through MRNet • Single data stream from BE to FE • Filters implement tree merge • Tree depth can be configured • Three major components • Backend (BE) daemons gathering traces • Communication processes merging prefix trees • Frontend (FE) tool storing the final graph • Final result saved as GML or DOT file • Node classes color coded • External visualization tools Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
STAT Frontend MRNetCommunicationProcess Filter STAT Tool Daemon MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI Application Processes Work and Data Flow trace( count, freq. ) FE Tree Merge CP CP CP CP BE BE BE BE … Node 1 Node 2 Node N-1 Node N Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
3844 processors, 0.741 seconds STAT Performance 1024x4 Cluster 1.4 GHz Itanium2 Quadrics QsNetII Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
Conclusions • Scaling tools poses challenges • Data management and process control • New strategies for tools needed • STAT – Scalable Stacktrace Analysis • Lightweight tool to identify process classes • Based on merged callgraph prefix trees • Aggregation in Time and Space • Orthogonal to full featured debuggers • Implementation based on TBŌNs • Scalable data collection and aggregation • Enables significant speedup Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
More Information • Paper published at IPDPS 2007Stack Trace Analysis for Large Scale DebuggingD. Arnold, D.H. Ahn, B.R. de Supinski, G. Lee, B.P. Miller, and M. Schulz • Project website & Demo tomorrow http://www.paradyn.org/STAT • TBŌN computing papers & open-source prototype, MRNet, available athttp://www.paradyn.org/mrnet Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007