Shadow Profiling: Hiding Instrumentation Costs with Parallelism

Shadow Profiling:Hiding Instrumentation Costs with Parallelism Tipp Moseley Alex Shye Vijay Janapa Reddi Dirk Grunwald (University of Colorado) Ramesh Peri (Intel Corporation)

Motivation • An ideal profiler will… • Collect arbitrarily detailed and abundant information • Incur negligible overhead • A real profiler, e.g., using Pin, satisfies condition 1 • But the cost is high • 3X for BBL counting • 25X for loop profiling • 50X or higher for memory profiling • A real profiler, e.g. PMU sampling or code patching, satisfies condition 2 • But the detail is very coarse

HighDetail Low Overhead Motivation “Bursty Tracing” (Sampled Instrumentation),Novel Hardware,Shadow Profiling VTune, DCPI, OProfile, PAPI, pfmon, PinProbes, … Pintools, Valgrind, ATOM, …

Goal • To create a profiler capable of collecting detailed, abundant information while incurring negligible overhead • Enable developers to focus on other things

The Big Idea • Stems from fault tolerance work on deterministic replication • Periodically fork(), profile “shadow” processes * Assuming instrumentation overhead of 3X

Challenges • Threads • Shared Memory • Asynchronous Interrupts • System Calls • JIT overhead • Overhead vs. Number of CPUs • Maximum speedup is Number of CPUs • If profiler overhead is 50X, need at least 51 CPUs to run in real-time (probably many more) • Too many complications to ensure deterministic replication

Goal (Revised) • To create a profiler capable of sampling detailed traces (bursts) with negligible overhead • Trade abundance for low overhead • Like SimPoints or SMARTS (but not as smart :)

The Big Idea (revised) • Do not strive for full, deterministic replica • Instead, profile many short, mostly deterministic bursts • Profile a fixed number of instructions • “Fake it” for system calls • Must not allow shadow to side-effect system

Design Overview

Design Overview • Monitor uses Pin Probes (code patching) • Application runs natively • Monitor receives periodic timer signal and decides when to fork() • After fork(), child uses PIN_ExecuteAt() functionality to switch Pin from Probe to JIT mode. • Shadow process profiles as usual, except handling of special cases • Monitor logs special read() system calls and pipes result to shadow processes

System Calls • For SPEC CPU2000, system calls occur around 35 times per second • Forking after each puts lots of pressure on CoW pages, Pin JIT engine • 95% of dynamic system calls can be safely handled • Some system calls can be allowed to execute (49%) • getrusage, _llseek, times, time, brk, munmap, fstat64, close, stat64, umask, getcwd, uname, access, exit_group, …

System Calls • Some can be replaced with success assumed (39%) • write, ftruncate, writev, unlink, rename, … • Some are handled specially, but execution may continue (1.8%) • mmap2, open(creat), mmap, mprotect, mremap, fcntl • read() is special (5.4%) • For reads from pipes/sockets, the data must be logged from the original app • For reads from files, the file must be closed and reopened after the fork() because the OS file pointer is not duplicated • ioctl() is special (4.8%) • Frequent in perlbmk • Behavior is device-dependent, safest action is to simply terminate the segment and re-fork()

Other Issues • Shared Memory • Disallow writes to shared memory • Asynchronous Interrupts (Userspace signals) • Since we are only mostly deterministic, no longer an issue • When main program receives a signal, pass it along to live children • JIT Overhead • After each fork(), it is like Pinning a new program • Warmup is too slow • Use Persistent Code Caching [CGO’07]

Multithreaded Programs Issue:fork() does not duplicate all threads • Only the thread that called fork() Solution: • Barrier all threads in the program and store their CPU state • Fork the process and clone new threads for those that were destroyed • Identical address space; only register state was really ‘lost’ • In each new thread, restore previous CPU state • Modified clone() handling in Pin VM • Continue execution, virtualize thread IDs for relevant system calls

Tuning Overhead • Load • Number of active shadow processes • Tested 0.125, 0.25, 0.5, 1.0, 2.0 • Sample Size • Number of instructions to profile • Longer samples for less overhead, more data • Shorter samples for more evenly dispersed data • Tested 1M, 10M, 100M

Experiments • Value Profiling • Typical overhead ~100X • Accuracy measured by Difference in Invariance • Path Profiling • Typical overhead 50% - 10X • Accuracy measured by percent of hot paths detected (2% threshold) • All experiments use SPEC2000 INT Benchmarks with “ref” data set • Arithmetic mean of 3 runs presented

Results - Value Profiling Overhead • Overhead versus native execution • Several configurations less than 1% • Path profiling exhibits similar trends

Results - Value Profiling Accuracy • All configurations within 7% of perfect profile • Lower is better

Results - Path Profiling Accuracy • Most configurations over 90% accurate • Higher is better • Some benchmarks (e.g., 176.gcc, 186.crafty, 187.parser) have millions of paths, but few are “hot”

Results - Page Fault Increase • Proportional increase in page faults • Shadow/Native

Results - Page Fault Rate • Difference in page faults per second experienced by native application

Future Work • Improve stability for multithreaded programs • Investigate effects of different persistent code cache policies • Compare sampling policies • Random (current) • Phase/event-based • Static analysis • Study convergence • Apply technique • Profile-guided optimizations • Simulation techniques

Conclusion • Shadow Profiling allows collection of bursts of detailed traces • Accuracy is over 90% • Incurs negligible overhead • Often less than 1% • With increasing numbers of cores, allows developers’ focus to shift from profiling to applying optimizations

Shadow Profiling: Hiding Instrumentation Costs with Parallelism