1 / 24

Stack Trace Analysis for Large Scale Debugging using MRNet

UCRL-PRES-230290. Stack Trace Analysis for Large Scale Debugging using MRNet. Dorian C. Arnold, Barton P. Miller University of Wisconsin Dong Ahn, Bronis R. de Supinski, Gregory L. Lee, Martin Schulz Lawrence Livermore National Laboratory. Scaling Tools. Machine sizes are increasing

ewong
Download Presentation

Stack Trace Analysis for Large Scale Debugging using MRNet

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UCRL-PRES-230290 Stack Trace Analysis for Large Scale Debugging using MRNet Dorian C. Arnold, Barton P. Miller University of Wisconsin Dong Ahn, Bronis R. de Supinski,Gregory L. Lee, Martin Schulz Lawrence Livermore National Laboratory

  2. Scaling Tools • Machine sizes are increasing • New cluster close to or above 10,000 cores • Blue Gene/L: over 131,000 cores • Not only applications need to scale • Support environment • Tools • Challenges • Data collection, storage, and analysis • Scalable process management and control • Visualization Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

  3. Typical debug session includes many interactions Debugging on BlueGene/L 4096 is only 3% of BG/L! Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

  4. Scalability Limitations • Large volumes of debug data • Single frontend for all node connections • Centralized data analysis • Vendor licensing limitations • Approach: scalable, lightweight debugger • Reduce exploration space to small subset • Online aggregation using a TBŌN • Full-featured debugger for deeper digging Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

  5. Outline • Case study: CCSM • STAT Approach • Concept of Stack Traces • Identification of Equivalence Classes • Implementation • Using Tree-based Overlay Networks • Data and Work Flow in STAT • Evaluation • Conclusions Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

  6. Case Study: CCSM • Community Climate System Model (CCSM) • Used to make climate predictions • Coupled models for atmosphere, ocean, sea ice and land surface • Implementation • Multiple Program Multiple Data (MPMD) model • MPI-based application • Distinct components for each model • Typically requires significant node count • Models executed concurrently • Several hundred tasks Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

  7. Observations • Intermittently hangs with 472 tasks • Non-deterministic • Only at large scale • Appears at seemingly random code locations • Hard to reproduce:2 hangs over next 10 days (~50 runs) • Current approach: • Attach to job using TotalView • Collect stack traces from all 472 tasks • Visualize cross-node callgraph Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

  8. CCSM Callgraph Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

  9. Lessons Learned • Some bugs only occur at large scales • Non-deterministic & hard to reproduce • Stack traces can provide useful insight • Many bugs are temporal in nature • Need tools that: • Combine spatial and temporal observations • Discover application behavior • Run effectively at scale Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

  10. STAT Approach • Sample application stack traces • Across time and space • Through third party interface • Using a DynInst based daemon • Merge/analyze traces: • Discover equivalent process behavior • Group similar processes • Facilitate scalable analysis/data presentation • Leverage TBŌN model (MRNet) • Communicate traces back to a frontend • Merge on the fly within MRNet filters Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

  11. Singleton Stack Trace Appl. Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

  12. Merging Stack Traces • Multiple traces over space or time • Taken independently • Stored in graph representation • Create call graph prefix tree • Only merge nodes with identical stack backtrace • Retains context information • Advantages • Compressed representation • Scalable visualization • Scalable analysis Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

  13. Merging Stack Traces Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

  14. 2D-Trace/Space Analysis Appl Appl Appl … Appl Appl Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

  15. Prefix Tree vs. DAG TotalView STAT Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

  16. 2D-Trace/Time Analysis … Appl Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

  17. Time & Space Analysis • Both 2D techniques insufficient • Spatial aggregation misses temporal component • Temporal aggregation misses parallel aspects • Multiple samples, multiple processes • Track global program behavior over time • Merge into single, 3D prefix tree • Challenges: • Scalable data representation • Scalable analysis • Scalable and useful visualization/results Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

  18. 4 Nodes / 10 Snapshots 3D-Trace/Space/Time Analysis Appl Appl … Appl … Appl Appl Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

  19. 3D-Trace/Space/Time Analysis 288 Nodes / 10 Snapshots Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

  20. Implementation Details • Communication through MRNet • Single data stream from BE to FE • Filters implement tree merge • Tree depth can be configured • Three major components • Backend (BE) daemons gathering traces • Communication processes merging prefix trees • Frontend (FE) tool storing the final graph • Final result saved as GML or DOT file • Node classes color coded • External visualization tools Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

  21. STAT Frontend MRNetCommunicationProcess Filter STAT Tool Daemon MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI Application Processes Work and Data Flow trace( count, freq. ) FE Tree Merge CP CP CP CP BE BE BE BE … Node 1 Node 2 Node N-1 Node N Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

  22. 3844 processors, 0.741 seconds STAT Performance 1024x4 Cluster 1.4 GHz Itanium2 Quadrics QsNetII Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

  23. Conclusions • Scaling tools poses challenges • Data management and process control • New strategies for tools needed • STAT – Scalable Stacktrace Analysis • Lightweight tool to identify process classes • Based on merged callgraph prefix trees • Aggregation in Time and Space • Orthogonal to full featured debuggers • Implementation based on TBŌNs • Scalable data collection and aggregation • Enables significant speedup Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

  24. More Information • Paper published at IPDPS 2007Stack Trace Analysis for Large Scale DebuggingD. Arnold, D.H. Ahn, B.R. de Supinski, G. Lee, B.P. Miller, and M. Schulz • Project website & Demo tomorrow http://www.paradyn.org/STAT • TBŌN computing papers & open-source prototype, MRNet, available athttp://www.paradyn.org/mrnet Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

More Related