250 likes | 371 Views
Stochastic Program Execution Tracing. Jeff Odom, UMD. SIGMA Goals. IBM/UMD tools to understand caches Focus of detailed statistics Complement existing hardware counters Ability to handle real applications MPI and OpenMP programs Fortran and C Provide hints about restructuring
E N D
Stochastic Program Execution Tracing Jeff Odom, UMD
SIGMA Goals • IBM/UMD tools to understand caches • Focus of detailed statistics • Complement existing hardware counters • Ability to handle real applications • MPI and OpenMP programs • Fortran and C • Provide hints about restructuring • Padding (both inter and intra data structures) • Blocking • UMD effort funded by PERC2
Original SIGMA Approach • Static instrumentation • Capture full information about memory use • Produce compact trace • Extracts loops and memory strides • Post execution tools • Detailed simulator • Full discrete event simulator • Memory profiler • Portion of accesses attributed to each data structure
RPT BLK1 ADR ADR ADR BLK2 ADR ADR BLK3 250 100 200 300 300 500 7 4 4 4 4 4 Representing Program Execution • Capture full execution behavior • Record all basic blocks and memory addresses • Produces large traces (due to looping) • Trace compression • Maintain pattern buffer • Scan for repeating patterns • Extract memory strides • Repeat algorithms for nested loops Base Count Length Stride
Trace Compression Isn’t Enough • A few seconds… • Slows execution considerably • Generates gigabytes
Sampling • We want… • Shorter execution times • Smaller traces • We need… • Representative traces • Where to sample? • Timestep boundary • Outermost loop • Requires manual identification (for now)
Dyninst + SIGMA = dynSIGMA • Dyninst adds flexibility • Vary sample rate without recompilation • Adaptive/progressive rate during execution • Target application runs at native speed when instrumentation turned off • Leverage existing SIGMA infrastructure • Only generate trace • Offline simulation/profiling steps unchanged • Dual application framework • Mutatee generates trace • Mutator toggles instrumentation
Memtime • Simple but effective metric of application memory performance
Characteristic Pattern • Local and global data objects given canonical name • Vector of objects’ memtime is characteristic data pattern • Comparison of characteristic patterns done with simple linear correlation • Can also be applied for function objects
Example Application: seis • Seismic simulation from SPEChpc2002 • Models multiple seismic processes • Process results pipelined • Variable timesteps • Different data pattern for each process • C & Fortran • Fortran – data processing • C – dynamic memory management, IO
Space & Time Gains From Sampling • Includes 0:12 instrumentation overhead
Challenge of Irregularity • Compression requires regular accesses • Sampling may hide poor compression • Each sample may compress poorly • Offset by low sampling rate • Sampling may not be accurate enough • Control flow sampled as well • Sample boundary requires manual definition
Hybrid Traces • Accuracy may be more important than execution time, but storage capacity may be limited • Modeling data access at particular points can be more accurate than timestep sampling • Many codes are mostly regular, but irregular patterns spoil compression
Modified Linear Regression • Establish linear pattern (min 3 points) at each memory access location • Look for repetitions of pattern with higher-level strides • Once input no longer matches pattern, treat further input as irregular until new pattern discovered
Modified Linear Regression • Irregular sequence modeled using uniform distribution • Pattern matching done local to each instrumentation (memory access) point • Original SIGMA pattern matches globally
Modified Linear Regression • Example: 0, 1, 2, 5, 9, 10, 11, 12, 2, 5
Modified Linear Regression • Example: 0, 1, 2, 5, 9, 10, 11, 12, 2, 5
Modified Linear Regression • Example: 0, 1, 2, 5, 9, 10, 11, 12, 2, 5 • Becomes: 0 + x + 10y + {5,9,2,5}
Modified Linear Regression • Example: 0, 1, 2, 5, 9, 10, 11, 12, 2, 5 • Becomes: 0 + x + 10y + {5,9,2,5} • Becomes: 0 + x + 10y + {l:2, h:9}
Experiment Setup • NAS Parallel Benchmarks 3.2 Serial Version, Class S • IBM XL C 8.0, XL Fortran 10.1 • DyninstAPI 5.0, including • Liveness analysis • Up to 90% runtime reduction by excluding one SPR (MQ) • Additional 3% improvement with other GPR/FPR • Transactional instrumentation • Instrumentation always on (no sampling)
Transactional Instrumentation BPatch_thread *thr; BPatch_process *proc; proc = thr->getProcess(); proc->beginInsertionSet(); … thr->insertSnippet(…); thr->insertSnippet(…); … proc->finalizeInsertionSet(true); • Reduces • Memory allocation • Insertion time • Atomic operation
Future Work • Larger datasets (NPB Class B,C) • Some results already gathered for W • Distributions other than uniform • Irregular control flow • Example: Upper triangular matrix does not need to iterate all MxN values • Uses edge instrumentation • BPatch_basicBlock::getIncomingEdges • BPatch_basicBlock::getOutgoingEdges • BPatch_edge::getPoint