250 likes | 269 Views
UMD's SIGMA project focuses on detailed statistics to understand cache behaviors in real applications efficiently using MPI and OpenMP programs in Fortran and C. The project provides restructuring hints, incorporates padding, blocking, and other techniques. Complementary to hardware counters, it features static instrumentation, trace compression, and pattern identification for memory accesses. The dynSIGMA tool combines Dyninst and SIGMA for flexible sampling rates and adaptive instrumentation toggling. Memtime metric evaluates memory performance, and characteristic pattern analysis aids in understanding data objects. An application example includes seismic simulation with variable timesteps and different data patterns. Various challenges like irregularity and compression issues are addressed through hybrid traces and modified linear regression methods. Experiment setups with NAS Benchmarks show significant runtime reductions using transactional instrumentation. Future work focuses on larger datasets and irregular control flow optimizations.
E N D
Stochastic Program Execution Tracing Jeff Odom, UMD
SIGMA Goals • IBM/UMD tools to understand caches • Focus of detailed statistics • Complement existing hardware counters • Ability to handle real applications • MPI and OpenMP programs • Fortran and C • Provide hints about restructuring • Padding (both inter and intra data structures) • Blocking • UMD effort funded by PERC2
Original SIGMA Approach • Static instrumentation • Capture full information about memory use • Produce compact trace • Extracts loops and memory strides • Post execution tools • Detailed simulator • Full discrete event simulator • Memory profiler • Portion of accesses attributed to each data structure
RPT BLK1 ADR ADR ADR BLK2 ADR ADR BLK3 250 100 200 300 300 500 7 4 4 4 4 4 Representing Program Execution • Capture full execution behavior • Record all basic blocks and memory addresses • Produces large traces (due to looping) • Trace compression • Maintain pattern buffer • Scan for repeating patterns • Extract memory strides • Repeat algorithms for nested loops Base Count Length Stride
Trace Compression Isn’t Enough • A few seconds… • Slows execution considerably • Generates gigabytes
Sampling • We want… • Shorter execution times • Smaller traces • We need… • Representative traces • Where to sample? • Timestep boundary • Outermost loop • Requires manual identification (for now)
Dyninst + SIGMA = dynSIGMA • Dyninst adds flexibility • Vary sample rate without recompilation • Adaptive/progressive rate during execution • Target application runs at native speed when instrumentation turned off • Leverage existing SIGMA infrastructure • Only generate trace • Offline simulation/profiling steps unchanged • Dual application framework • Mutatee generates trace • Mutator toggles instrumentation
Memtime • Simple but effective metric of application memory performance
Characteristic Pattern • Local and global data objects given canonical name • Vector of objects’ memtime is characteristic data pattern • Comparison of characteristic patterns done with simple linear correlation • Can also be applied for function objects
Example Application: seis • Seismic simulation from SPEChpc2002 • Models multiple seismic processes • Process results pipelined • Variable timesteps • Different data pattern for each process • C & Fortran • Fortran – data processing • C – dynamic memory management, IO
Space & Time Gains From Sampling • Includes 0:12 instrumentation overhead
Challenge of Irregularity • Compression requires regular accesses • Sampling may hide poor compression • Each sample may compress poorly • Offset by low sampling rate • Sampling may not be accurate enough • Control flow sampled as well • Sample boundary requires manual definition
Hybrid Traces • Accuracy may be more important than execution time, but storage capacity may be limited • Modeling data access at particular points can be more accurate than timestep sampling • Many codes are mostly regular, but irregular patterns spoil compression
Modified Linear Regression • Establish linear pattern (min 3 points) at each memory access location • Look for repetitions of pattern with higher-level strides • Once input no longer matches pattern, treat further input as irregular until new pattern discovered
Modified Linear Regression • Irregular sequence modeled using uniform distribution • Pattern matching done local to each instrumentation (memory access) point • Original SIGMA pattern matches globally
Modified Linear Regression • Example: 0, 1, 2, 5, 9, 10, 11, 12, 2, 5
Modified Linear Regression • Example: 0, 1, 2, 5, 9, 10, 11, 12, 2, 5
Modified Linear Regression • Example: 0, 1, 2, 5, 9, 10, 11, 12, 2, 5 • Becomes: 0 + x + 10y + {5,9,2,5}
Modified Linear Regression • Example: 0, 1, 2, 5, 9, 10, 11, 12, 2, 5 • Becomes: 0 + x + 10y + {5,9,2,5} • Becomes: 0 + x + 10y + {l:2, h:9}
Experiment Setup • NAS Parallel Benchmarks 3.2 Serial Version, Class S • IBM XL C 8.0, XL Fortran 10.1 • DyninstAPI 5.0, including • Liveness analysis • Up to 90% runtime reduction by excluding one SPR (MQ) • Additional 3% improvement with other GPR/FPR • Transactional instrumentation • Instrumentation always on (no sampling)
Transactional Instrumentation BPatch_thread *thr; BPatch_process *proc; proc = thr->getProcess(); proc->beginInsertionSet(); … thr->insertSnippet(…); thr->insertSnippet(…); … proc->finalizeInsertionSet(true); • Reduces • Memory allocation • Insertion time • Atomic operation
Future Work • Larger datasets (NPB Class B,C) • Some results already gathered for W • Distributions other than uniform • Irregular control flow • Example: Upper triangular matrix does not need to iterate all MxN values • Uses edge instrumentation • BPatch_basicBlock::getIncomingEdges • BPatch_basicBlock::getOutgoingEdges • BPatch_edge::getPoint