270 likes | 487 Views
Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems. Kim Hazelwood Greg Lueck Robert Cohn. counter++;. counter++;. counter++;. counter++;. counter++;. Dynamic Binary Instrumentation. sub $0xff, %edx cmp %esi, %edx jle <L1> mov $0x1, %edi
E N D
Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems Kim Hazelwood Greg Lueck Robert Cohn
counter++; counter++; counter++; counter++; counter++; Dynamic Binary Instrumentation • sub $0xff, %edx • cmp %esi, %edx • jle <L1> • mov $0x1, %edi • add $0x10, %eax • Inserts or modify arbitrary instructions in executing binaries, e.g.: instruction count
Instruction Count Output $ /bin/lsMakefile imageload.out itrace proccount imageload inscount atrace itrace.out $ pin -t inscount.so -- /bin/ls Makefile imageload.out itrace proccount imageload inscount atrace itrace.out • Count 422838
EXE Transform Profile Code Cache Execute How Does it Work? • Generates and caches modified copies of instructions • Modified (cached) instructions are executed in lieu of original instructions
Instr 1 Instr 2 Instr 3 Jump Reg DATA Instr 5 Instr 6 Uncond Branch PADDING Instr 8 Why “Dynamic” Instrumentation? • Robustness! • No need to recompile or relink • Discover code at runtime • Handle dynamically-generated code • Attach to running processes The Code Discovery Problem on x86 Indirect jump to ?? Data interspersed with code Pad for alignment
Intel Pin • A dynamic binary instrumentation system • Easy-to-use instrumentation interface • Supports multiple platforms • Four ISAs – IA32, Intel64, IPF, ARM • Four OSes – Linux, Windows, FreeBSD, MacOS • Popular and well supported • 32,000+ downloads • 400+ citations • 500+ mailing list subscribers 5
Research Applications • Gather profile information about applications • Compare programs generated by competing compilers • Generate a select stream of live information for event-driven simulation • Add security features • Emulate new hardware • Anything and everything multicore
The Problem with Modern Tools • Many research tools do not support multithreaded guest applications • Providing support for MT apps is mostly straightforward • Providing scalable support can be tricky!
Issues that Arise • Gaining control of executing threads • Determining what should be private vs. shared between threads • Code cache maintenance and consistency • Concurrent instruction writes • Providing/handling thread-local storage • Handling indirect branches • Handling signals / system calls
The Pin Architecture Pin Tool Instrumentation Code Call-Back Handlers Analysis Code Pin T1 T1 T2 JIT Compiler Dispatcher Code Cache T1 T1 Syscall Emulator T2 Signal Emulator Serialized Parallel
Code Cache Consistency • Cached code must be removed for a variety of reasons: • Dynamically unloaded code • Ephemeral/adaptive instrumentation • Self-modifying code • Bounded code caches EXE Transform Profile Code Cache Execute
Motivating a Bounded Code Cache • The Perl Benchmark
Flushing the Code Cache • Option 1: All threads have a private code cache (oops, doesn’t scale) • Option 2: Shared code cache across threads • If one thread flushes the code cache, other threads may resume in stale memory
Naïve Flush • Wait for all threads to return to the code cache • Could wait indefinitely! Flush Delay VM CC1 VM stall CC2 Thread1 VM CC1 VM stall CC2 Thread2 VM CC1 VM CC2 Thread3 Time
VM CC1 VM CC2 VM CC1 VM CC2 VM CC1 VM CC2 Generational Flush • Allow threads to continue to make progress in a separate area of the code cache Thread1 Thread2 Thread3 Time • Requires a high water mark
Memory Scalability of the Code Cache • Ensuring scalability also requires carefully configuring the code stored in the cache • Trace Lengths • First basic block is non-speculative, others are speculative • Longer traces = fewer entries in the lookup table, but more unexecuted code • Shorter traces = two off-trace paths at ends of basic blocks with conditional branches = more exit stub code
Rewriting Instructions • Pin must regularly rewrite branches • No atomic branch write on x86 • We use a neat trick*: “old” 5-byte branch 2-byte self branch n-2 bytes of “new” branch “new” 5-byte branch * Sundaresan et al. 2006
Performance Results • We use the SPEC OMP 2001 benchmarks • OMP_NUM_THREADS environment variable • We compare • Native performance and scalability • Pin (no Pintool) performance scalability • Pin (lightweight Pintool) scalability • InsCount Pintool – counts instructions at BB granularity • Pin (middleweight Pintool) scalability • MemTrace Pintool – records memory addresses • Pin (heavyweight Pintool) scalability • CMP$im – collects memory addresses and applies a software model of the CMP cache
Summary • Dynamic instrumentation tools are useful • In the multicore era, we must provide support for MT application analysis and simulation • Providing MT support in Pin was easy • Making it robust and scalable was not easy • http://www.pintool.org 26