1 / 25

Stochastic Program Execution Tracing

UMD's SIGMA project focuses on detailed statistics to understand cache behaviors in real applications efficiently using MPI and OpenMP programs in Fortran and C. The project provides restructuring hints, incorporates padding, blocking, and other techniques. Complementary to hardware counters, it features static instrumentation, trace compression, and pattern identification for memory accesses. The dynSIGMA tool combines Dyninst and SIGMA for flexible sampling rates and adaptive instrumentation toggling. Memtime metric evaluates memory performance, and characteristic pattern analysis aids in understanding data objects. An application example includes seismic simulation with variable timesteps and different data patterns. Various challenges like irregularity and compression issues are addressed through hybrid traces and modified linear regression methods. Experiment setups with NAS Benchmarks show significant runtime reductions using transactional instrumentation. Future work focuses on larger datasets and irregular control flow optimizations.

reaton
Download Presentation

Stochastic Program Execution Tracing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Stochastic Program Execution Tracing Jeff Odom, UMD

  2. SIGMA Goals • IBM/UMD tools to understand caches • Focus of detailed statistics • Complement existing hardware counters • Ability to handle real applications • MPI and OpenMP programs • Fortran and C • Provide hints about restructuring • Padding (both inter and intra data structures) • Blocking • UMD effort funded by PERC2

  3. Original SIGMA Approach • Static instrumentation • Capture full information about memory use • Produce compact trace • Extracts loops and memory strides • Post execution tools • Detailed simulator • Full discrete event simulator • Memory profiler • Portion of accesses attributed to each data structure

  4. RPT BLK1 ADR ADR ADR BLK2 ADR ADR BLK3 250 100 200 300 300 500 7 4 4 4 4 4 Representing Program Execution • Capture full execution behavior • Record all basic blocks and memory addresses • Produces large traces (due to looping) • Trace compression • Maintain pattern buffer • Scan for repeating patterns • Extract memory strides • Repeat algorithms for nested loops Base Count Length Stride

  5. Trace Compression Isn’t Enough • A few seconds… • Slows execution considerably • Generates gigabytes

  6. Sampling • We want… • Shorter execution times • Smaller traces • We need… • Representative traces • Where to sample? • Timestep boundary • Outermost loop • Requires manual identification (for now)

  7. Dyninst + SIGMA = dynSIGMA • Dyninst adds flexibility • Vary sample rate without recompilation • Adaptive/progressive rate during execution • Target application runs at native speed when instrumentation turned off • Leverage existing SIGMA infrastructure • Only generate trace • Offline simulation/profiling steps unchanged • Dual application framework • Mutatee generates trace • Mutator toggles instrumentation

  8. Memtime • Simple but effective metric of application memory performance

  9. Characteristic Pattern • Local and global data objects given canonical name • Vector of objects’ memtime is characteristic data pattern • Comparison of characteristic patterns done with simple linear correlation • Can also be applied for function objects

  10. Example Application: seis • Seismic simulation from SPEChpc2002 • Models multiple seismic processes • Process results pipelined • Variable timesteps • Different data pattern for each process • C & Fortran • Fortran – data processing • C – dynamic memory management, IO

  11. Space & Time Gains From Sampling • Includes 0:12 instrumentation overhead

  12. Challenge of Irregularity • Compression requires regular accesses • Sampling may hide poor compression • Each sample may compress poorly • Offset by low sampling rate • Sampling may not be accurate enough • Control flow sampled as well • Sample boundary requires manual definition

  13. Hybrid Traces • Accuracy may be more important than execution time, but storage capacity may be limited • Modeling data access at particular points can be more accurate than timestep sampling • Many codes are mostly regular, but irregular patterns spoil compression

  14. Modified Linear Regression • Establish linear pattern (min 3 points) at each memory access location • Look for repetitions of pattern with higher-level strides • Once input no longer matches pattern, treat further input as irregular until new pattern discovered

  15. Modified Linear Regression • Irregular sequence modeled using uniform distribution • Pattern matching done local to each instrumentation (memory access) point • Original SIGMA pattern matches globally

  16. Modified Linear Regression • Example: 0, 1, 2, 5, 9, 10, 11, 12, 2, 5

  17. Modified Linear Regression • Example: 0, 1, 2, 5, 9, 10, 11, 12, 2, 5

  18. Modified Linear Regression • Example: 0, 1, 2, 5, 9, 10, 11, 12, 2, 5 • Becomes: 0 + x + 10y + {5,9,2,5}

  19. Modified Linear Regression • Example: 0, 1, 2, 5, 9, 10, 11, 12, 2, 5 • Becomes: 0 + x + 10y + {5,9,2,5} • Becomes: 0 + x + 10y + {l:2, h:9}

  20. Experiment Setup • NAS Parallel Benchmarks 3.2 Serial Version, Class S • IBM XL C 8.0, XL Fortran 10.1 • DyninstAPI 5.0, including • Liveness analysis • Up to 90% runtime reduction by excluding one SPR (MQ) • Additional 3% improvement with other GPR/FPR • Transactional instrumentation • Instrumentation always on (no sampling)

  21. Transactional Instrumentation BPatch_thread *thr; BPatch_process *proc; proc = thr->getProcess(); proc->beginInsertionSet(); … thr->insertSnippet(…); thr->insertSnippet(…); … proc->finalizeInsertionSet(true); • Reduces • Memory allocation • Insertion time • Atomic operation

  22. Trace Size

  23. Accuracy

  24. Future Work • Larger datasets (NPB Class B,C) • Some results already gathered for W • Distributions other than uniform • Irregular control flow • Example: Upper triangular matrix does not need to iterate all MxN values • Uses edge instrumentation • BPatch_basicBlock::getIncomingEdges • BPatch_basicBlock::getOutgoingEdges • BPatch_edge::getPoint

  25. Questions?

More Related