Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems

Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems Kim Hazelwood Greg Lueck Robert Cohn

counter++; counter++; counter++; counter++; counter++; Dynamic Binary Instrumentation • sub $0xff, %edx • cmp %esi, %edx • jle <L1> • mov $0x1, %edi • add $0x10, %eax • Inserts or modify arbitrary instructions in executing binaries, e.g.: instruction count

Instruction Count Output $ /bin/lsMakefile imageload.out itrace proccount imageload inscount atrace itrace.out $ pin -t inscount.so -- /bin/ls Makefile imageload.out itrace proccount imageload inscount atrace itrace.out • Count 422838

EXE Transform Profile Code Cache Execute How Does it Work? • Generates and caches modified copies of instructions • Modified (cached) instructions are executed in lieu of original instructions

Instr 1 Instr 2 Instr 3 Jump Reg DATA Instr 5 Instr 6 Uncond Branch PADDING Instr 8 Why “Dynamic” Instrumentation? • Robustness! • No need to recompile or relink • Discover code at runtime • Handle dynamically-generated code • Attach to running processes The Code Discovery Problem on x86 Indirect jump to ?? Data interspersed with code Pad for alignment

Intel Pin • A dynamic binary instrumentation system • Easy-to-use instrumentation interface • Supports multiple platforms • Four ISAs – IA32, Intel64, IPF, ARM • Four OSes – Linux, Windows, FreeBSD, MacOS • Popular and well supported • 32,000+ downloads • 400+ citations • 500+ mailing list subscribers 5

Research Applications • Gather profile information about applications • Compare programs generated by competing compilers • Generate a select stream of live information for event-driven simulation • Add security features • Emulate new hardware • Anything and everything multicore

The Problem with Modern Tools • Many research tools do not support multithreaded guest applications • Providing support for MT apps is mostly straightforward • Providing scalable support can be tricky!

Issues that Arise • Gaining control of executing threads • Determining what should be private vs. shared between threads • Code cache maintenance and consistency • Concurrent instruction writes • Providing/handling thread-local storage • Handling indirect branches • Handling signals / system calls

The Pin Architecture Pin Tool Instrumentation Code Call-Back Handlers Analysis Code Pin T1 T1 T2 JIT Compiler Dispatcher Code Cache T1 T1 Syscall Emulator T2 Signal Emulator Serialized Parallel

Code Cache Consistency • Cached code must be removed for a variety of reasons: • Dynamically unloaded code • Ephemeral/adaptive instrumentation • Self-modifying code • Bounded code caches EXE Transform Profile Code Cache Execute

Motivating a Bounded Code Cache • The Perl Benchmark

Flushing the Code Cache • Option 1: All threads have a private code cache (oops, doesn’t scale) • Option 2: Shared code cache across threads • If one thread flushes the code cache, other threads may resume in stale memory

Naïve Flush • Wait for all threads to return to the code cache • Could wait indefinitely! Flush Delay VM CC1 VM stall CC2 Thread1 VM CC1 VM stall CC2 Thread2 VM CC1 VM CC2 Thread3 Time

VM CC1 VM CC2 VM CC1 VM CC2 VM CC1 VM CC2 Generational Flush • Allow threads to continue to make progress in a separate area of the code cache Thread1 Thread2 Thread3 Time • Requires a high water mark

Memory Scalability of the Code Cache • Ensuring scalability also requires carefully configuring the code stored in the cache • Trace Lengths • First basic block is non-speculative, others are speculative • Longer traces = fewer entries in the lookup table, but more unexecuted code • Shorter traces = two off-trace paths at ends of basic blocks with conditional branches = more exit stub code

Effect of Trace Length on Trace Count

Effect of Trace Length on Memory

Rewriting Instructions • Pin must regularly rewrite branches • No atomic branch write on x86 • We use a neat trick*: “old” 5-byte branch 2-byte self branch n-2 bytes of “new” branch “new” 5-byte branch * Sundaresan et al. 2006

Performance Results • We use the SPEC OMP 2001 benchmarks • OMP_NUM_THREADS environment variable • We compare • Native performance and scalability • Pin (no Pintool) performance scalability • Pin (lightweight Pintool) scalability • InsCount Pintool – counts instructions at BB granularity • Pin (middleweight Pintool) scalability • MemTrace Pintool – records memory addresses • Pin (heavyweight Pintool) scalability • CMP$im – collects memory addresses and applies a software model of the CMP cache

Native Scalability of SPEC OMP 2001

Performance Scalability (No Instrumentation)

Performance Scalability (LightWeight Instrumentation)

Performance Scalability (MiddleWeight Instrumentation)

Performance Scalability (HeavyWeight Instrumentation)

Memory Scalability

Summary • Dynamic instrumentation tools are useful • In the multicore era, we must provide support for MT application analysis and simulation • Providing MT support in Pin was easy • Making it robust and scalable was not easy • http://www.pintool.org 26

Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems

Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems

Presentation Transcript

Hoard: A Scalable Memory Allocator for Multithreaded Applications

Valgrind A Framework for Heavyweight Dynamic Binary Instrumentation

Improving Software Security with Dynamic Binary Instrumentation

Dynamic Binary Translation for Embedded Systems with Scratchpad Memory

Dynamic binary instrumentation for improving performance of running applications

Hoard: A Scalable Memory Allocator for Multithreaded Applications

Scalable Performance Optimizations for Dynamic Applications

Dynamic Data Driven Applications Systems

Compiler Support for Multithreaded Software

Dynamic Instrumentation on the IA-64

Using the VTune Analyzer on Multithreaded Applications

Rules for Designing Multithreaded Applications

Scalable Dynamic Instrumentation for Bluegene/L

Dynamic languages for dynamic applications

Distributed Java applications: dynamic instrumentation and automatic optimisation

Scalable and Dynamic Quorum Systems

Dynamic Instrumentation of Distributed Java Applications

Hoard: A Scalable Memory Allocator for Multithreaded Applications

Comprehensive Kernel Instrumentation via Dynamic Binary Translation

Hoard: A Scalable Memory Allocator for Multithreaded Applications

Hoard: A Scalable Memory Allocator for Multithreaded Applications

Hoard: A Scalable Memory Allocator for Multithreaded Applications