230 likes | 333 Views
Shadow Profiling: Hiding Instrumentation Costs with Parallelism. Tipp Moseley Alex Shye Vijay Janapa Reddi Dirk Grunwald (University of Colorado) Ramesh Peri (Intel Corporation). Motivation. An ideal profiler will… Collect arbitrarily detailed and abundant information
E N D
Shadow Profiling:Hiding Instrumentation Costs with Parallelism Tipp Moseley Alex Shye Vijay Janapa Reddi Dirk Grunwald (University of Colorado) Ramesh Peri (Intel Corporation)
Motivation • An ideal profiler will… • Collect arbitrarily detailed and abundant information • Incur negligible overhead • A real profiler, e.g., using Pin, satisfies condition 1 • But the cost is high • 3X for BBL counting • 25X for loop profiling • 50X or higher for memory profiling • A real profiler, e.g. PMU sampling or code patching, satisfies condition 2 • But the detail is very coarse
HighDetail Low Overhead Motivation “Bursty Tracing” (Sampled Instrumentation),Novel Hardware,Shadow Profiling VTune, DCPI, OProfile, PAPI, pfmon, PinProbes, … Pintools, Valgrind, ATOM, …
Goal • To create a profiler capable of collecting detailed, abundant information while incurring negligible overhead • Enable developers to focus on other things
The Big Idea • Stems from fault tolerance work on deterministic replication • Periodically fork(), profile “shadow” processes * Assuming instrumentation overhead of 3X
Challenges • Threads • Shared Memory • Asynchronous Interrupts • System Calls • JIT overhead • Overhead vs. Number of CPUs • Maximum speedup is Number of CPUs • If profiler overhead is 50X, need at least 51 CPUs to run in real-time (probably many more) • Too many complications to ensure deterministic replication
Goal (Revised) • To create a profiler capable of sampling detailed traces (bursts) with negligible overhead • Trade abundance for low overhead • Like SimPoints or SMARTS (but not as smart :)
The Big Idea (revised) • Do not strive for full, deterministic replica • Instead, profile many short, mostly deterministic bursts • Profile a fixed number of instructions • “Fake it” for system calls • Must not allow shadow to side-effect system
Design Overview • Monitor uses Pin Probes (code patching) • Application runs natively • Monitor receives periodic timer signal and decides when to fork() • After fork(), child uses PIN_ExecuteAt() functionality to switch Pin from Probe to JIT mode. • Shadow process profiles as usual, except handling of special cases • Monitor logs special read() system calls and pipes result to shadow processes
System Calls • For SPEC CPU2000, system calls occur around 35 times per second • Forking after each puts lots of pressure on CoW pages, Pin JIT engine • 95% of dynamic system calls can be safely handled • Some system calls can be allowed to execute (49%) • getrusage, _llseek, times, time, brk, munmap, fstat64, close, stat64, umask, getcwd, uname, access, exit_group, …
System Calls • Some can be replaced with success assumed (39%) • write, ftruncate, writev, unlink, rename, … • Some are handled specially, but execution may continue (1.8%) • mmap2, open(creat), mmap, mprotect, mremap, fcntl • read() is special (5.4%) • For reads from pipes/sockets, the data must be logged from the original app • For reads from files, the file must be closed and reopened after the fork() because the OS file pointer is not duplicated • ioctl() is special (4.8%) • Frequent in perlbmk • Behavior is device-dependent, safest action is to simply terminate the segment and re-fork()
Other Issues • Shared Memory • Disallow writes to shared memory • Asynchronous Interrupts (Userspace signals) • Since we are only mostly deterministic, no longer an issue • When main program receives a signal, pass it along to live children • JIT Overhead • After each fork(), it is like Pinning a new program • Warmup is too slow • Use Persistent Code Caching [CGO’07]
Multithreaded Programs Issue:fork() does not duplicate all threads • Only the thread that called fork() Solution: • Barrier all threads in the program and store their CPU state • Fork the process and clone new threads for those that were destroyed • Identical address space; only register state was really ‘lost’ • In each new thread, restore previous CPU state • Modified clone() handling in Pin VM • Continue execution, virtualize thread IDs for relevant system calls
Tuning Overhead • Load • Number of active shadow processes • Tested 0.125, 0.25, 0.5, 1.0, 2.0 • Sample Size • Number of instructions to profile • Longer samples for less overhead, more data • Shorter samples for more evenly dispersed data • Tested 1M, 10M, 100M
Experiments • Value Profiling • Typical overhead ~100X • Accuracy measured by Difference in Invariance • Path Profiling • Typical overhead 50% - 10X • Accuracy measured by percent of hot paths detected (2% threshold) • All experiments use SPEC2000 INT Benchmarks with “ref” data set • Arithmetic mean of 3 runs presented
Results - Value Profiling Overhead • Overhead versus native execution • Several configurations less than 1% • Path profiling exhibits similar trends
Results - Value Profiling Accuracy • All configurations within 7% of perfect profile • Lower is better
Results - Path Profiling Accuracy • Most configurations over 90% accurate • Higher is better • Some benchmarks (e.g., 176.gcc, 186.crafty, 187.parser) have millions of paths, but few are “hot”
Results - Page Fault Increase • Proportional increase in page faults • Shadow/Native
Results - Page Fault Rate • Difference in page faults per second experienced by native application
Future Work • Improve stability for multithreaded programs • Investigate effects of different persistent code cache policies • Compare sampling policies • Random (current) • Phase/event-based • Static analysis • Study convergence • Apply technique • Profile-guided optimizations • Simulation techniques
Conclusion • Shadow Profiling allows collection of bursts of detailed traces • Accuracy is over 90% • Incurs negligible overhead • Often less than 1% • With increasing numbers of cores, allows developers’ focus to shift from profiling to applying optimizations