Pi pelined P rofiling and A nalysis on Multi-core Systems

PiPA Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong

Why PiPA? Code profiling and analysis • very useful for understanding program behavior • implemented using dynamic instrumentation systems • several challenges – coverage, accuracy, overhead • overhead due to instrumentation engine • overhead due to profiling code The performance problem! • Cachegrind -100x slowdown • Pin dcache-32x slowdown Need faster tools! CGO 2008

Our Goals Improve the performance • reduce the overall profiling and analysis overhead • but maintain the accuracy How? • parallelize! • optimize Keep it simple • easy to understand • easy to build new analysis tools CGO 2008

Previous Approach Parallelized slice profiling • SuperPin, Shadow Profiling Suitable for simple, independent tasks Uninstrumented application Original application Instrumentation overhead Instrumented application Profiling overhead SuperPinned application Instrumented slices CGO 2008

PiPA Key Idea Pipelining! Original application Instrumentation overhead Profiling overhead Time Instrumented application – stage 0 Profile processing – stage 1 Threads or Processes Profile Information Analysis on profile 1 Analysis on profile 2 Parallel analysis stage 2 Analysis on profile 3 Analysis on profile 4 CGO 2008

PiPA Challenges Minimize the profiling overhead • Runtime Execution Profile (REP) Minimize the communication between stages • double buffering Design efficient parallel analysis algorithms • we focus on cache simulation CGO 2008

PiPA Prototype Cache Simulation

Our Prototype Implemented in DynamoRIO Three stages • Stage 0 : instrumented application – collect REP • Stage 1 : parallel profile recovery and splitting • Stage 2 : parallel cache simulation Experiments • SPEC2000 & SPEC2006 benchmarks • 3 systems : dual core, quad core, eight core CGO 2008

Communication Keys to minimize the overhead • double buffering • shared buffers • large buffers Example – communication between stage 0 and stage 1 Profiling thread at stage 0 Shared buffers Processing threads at stage 1 CGO 2008

Stage 0: Profiling compact profile minimal overhead

Stage 0 : Profiling Runtime Execution Profile (REP) • fast profiling • small profile size • easy information extraction Hierarchical Structure • profile buffers • data units • slots Can be customized for different analyses • in our prototype we consider cache simulation CGO 2008

REP Example REP profile basepointer First buffer tag: 0x080483d7num_slots: 2num_refs: 3refs: ref0 pc: 0x080483d7 . . . type: read size: 4 offset: 12 value_slot: 1 bb1: size_slot: -1 mov [eax + 0x0c]  eax bb1 mov ebp  esp pc: 0x080483dctype: readsize: 4offset: 0value_slot: 2size_slot: -1 pop ebp 12 bytes REP Unit eax return esp bb2 REP Unit esp bb2: pop ebx pop ecx cmp eax, 0 jz label_bb3 pc: 0x080483ddtype: readsize: 4offset: 4value_slot: 2size_slot: -1 . . . Canary Zone Next buffer . . . REPS REPD CGO 2008

Profiling Optimization Store register values in REP • avoid computing the memory address Register liveness analysis • avoid register stealing if possible Record a single register value for multiple references • a single stack pointer value for a sequence of push/pop • the base address for multiple accesses to the same structure More in the paper CGO 2008

REP Example REP profile basepointer First buffer tag: 0x080483d7num_slots: 2num_refs: 3refs: ref0 pc: 0x080483d7 . . . type: read size: 4 offset: 12 value_slot: 1 bb1: size_slot: -1 mov [eax + 0x0c]  eax bb1 mov ebp  esp pc: 0x080483dctype: readsize: 4offset: 0size_slot: -1 pop ebp REP Unit eax return esp value_slot: 2 bb2 REP Unit esp bb2: pop ebx pop ecx cmp eax, 0 jz label_bb3 pc: 0x080483ddtype: readsize: 4offset: 4size_slot: -1 . . . value_slot: 2 Canary Zone Next buffer . . . REPS REPD CGO 2008

Profiling Overhead 4-core 9 8 2-core 8 7 7 6 6 5 Slowdown relative to native execution Slowdown relative to native execution 5 4 4 3 3 2 2 1 1 0 0 SPECint2000 SPECfp2000 SPEC2000 SPECint2000 SPECfp2000 SPEC2000 8-core 10 9 8 optimized instrumentation 7 6 Slowdown relative to native execution 5 instrumentation without optimization 4 3 2 Avg slowdown : ~ 3x 1 0 SPECint2000 SPECfp2000 SPEC2000 CGO 2008

Stage 1: Profile Recovery fast recovery

Stage 1 : Profile Recovery Need to reconstruct the full memory reference information • <pc, address, type, size> REP pc: 0x080483d7 pc: 0x080483dc tag: 0x080483d7num_slots: 2num_refs: 3refs: ref0 type: read type: read . . . size: 4 size: 4 . . . offset: 12 offset: 0 value_slot: 1 value_slot: 2 size_slot: -1 size_slot: -1 bb1 REP Unit 0x2304 0x141a PC Address Type Size .... ............. ........ ......... bb2 REP Unit 0x1423 0x080483d7 read 4 0x2310 . . . 0x080483dc read 4 0x141a .... ............. ........ ......... Canary Zone . . . CGO 2008

Profile Recovery Overhead Factor 1 : buffer size Experiments done on the 8-core system, using 8 recovery threads 9 small (64k) medium (1M) large (16M) 8 7 6 5 Slowdown relative to native execution 4 3 2 1 0 SPECint2000 SPECfp2000 SPEC2000 CGO 2008

Profile Recovery Overhead Factor 2 : the number of recovery threads Experiments done on the 8-core system, using 16MB buffers 20 0 threads 2 threads 18 4 threads 16 6 threads 14 8 threads 12 10 Slowdown relative to native execution 8 6 4 2 0 SPECint2000 SPECfp2000 SPEC2000 CGO 2008

Profile Recovery Overhead 2.50 2 cores 4 cores 2.00 8 cores 1.50 Slowdown relative to profiling 1.00 0.50 0.00 SPECint2000 SPECfp2000 SPEC2000 Factor 3 : the number of available cores Experiments done using 16MB buffers and 8 recovery threads CGO 2008

Profile Recovery Overhead 35.00 30.00 25.00 20.00 Slowdown relative to native execution 15.00 10.00 5.00 0.00 179.art 252.eon 254.gap 175.vpr 181.mcf 176.gcc 301.apsi 164.gzip 172.mgrid 173.applu 188.ammp 177.mesa 256.bzip2 178.galgel 189.lucas 300.twolf 171.swim 191.fma3d 255.vortex 186.crafty FP Average 197.parser 183.equake INT Average 253.perlbmk 200.sixtrack 187.facerec 168.wupwise SPEC2000 Average Factor 4 : the impact of using REP • experiments done on the 8-core system with 16MB buffers and 8 threads PIPA using REP PIPA-REP : 4.5x PIPA using standard profile format PIPA-standard : 20.7x <pc, address, type, size> CGO 2008

Stage 2: Cache Simulation parallel analysis independent simulators

Stage 2 : Parallel Cache Simulation How to parallelize? • split the address trace into independent groups Set associative caches • partition the cache sets and simulate them using several independent simulators • merge the results (no of hits and misses) at the end of the simulation Example: • 32K cache, 32-byte line, 4-way associative => 256 sets • 4 independent simulators, each one simulates 64 sets (round-robin distribution) • two memory references that access different sets are independent PC Address Type Size .... r 4 .... w 4 .... r 4 .... w 4 .... r 4 .... r 4 0: 0xbf9c4614 , 0xbf9c4705 , 0xbf9c460d ... 0xbf9c4614 1: 0xbf9c4a34 ... 0xbf9c4705 0xbf9c4a34 2: 0xbf9c4a5c ... 0xbf9c4a60 0xbf9c4a5c 3: 0xbf9c4a60 ... 0xbf9c460d Set index: 56 48 82 83 81 48 CGO 2008

Cache Simulation Overhead 60.00 50.00 40.00 Slowdown relative to 30.00 native execution 20.00 10.00 0.00 175.vpr 179.art 252.eon 254.gap 181.mcf 301.apsi 164.gzip 56.bzip2 176.gcc 188.ammp 172.mgrid 173.applu 177.mesa 178.galgel 300.twolf 189.lucas 171.swim 191.fma3d 197.parser 183.equake FP Average 253.perlbmk 186.crafty 255.vortex SPEC2000 Average INT Average 200.sixtrack 187.facerec 168.wupwise Experiments done on the 8-core system • 8 recovery threads and 8 cache simulators PiPA 10.5x PiPA speedup over dcache : 3x Pin dcache 32x CGO 2008

SPEC 2006 Results 18 16 14 12 Slowdown relative to native execution 10 8 6 4 2 0 481.wrf 470.lbm 403.gcc 429.mcf 433.milc 447.dealII 465.tonto 473.astar 444.namd 401.bzip2 458.sjeng 453.povray 410.bwaves FP Average 445.gobmk 464.h264ref INT Average 450.soplex 456.hmmer 437.leslie3d 416.gamess 482.sphinx3 454.calculix 435.gromacs 471.omnetpp 400.perlbench 462.libquantum 483.xalancbmk 459.GemsFDTD 436.cactusADM SPEC2006 Average Experiments done using the 8-core system Profiling 3x Average speedup over dcache : 3.27x Profiling + recovery 3.7x Full cache simulation 10.2x CGO 2008

Summary PiPA is an effective technique for parallel profiling and analysis • based on pipelining • drastically reduces both • profiling time • analysis time • full cache simulation incurs only 10.5x slowdown Runtime Execution Profile • requires minimal instrumentation code • compact enough to ensure optimal buffer usage • makes it easy for next stages to recover the full trace Parallel cache simulation • the cache is partitioned into several independent simulators CGO 2008

Future Work Design APIs • hide the communication between the pipeline stages • focus only on the instrumentation and analysis tasks Further improve the efficiency • parallel profiling • workload monitoring More analysis algorithms • branch prediction simulation • memory dependence analysis • ... CGO 2008

Pin Prototype Second implementation in Pin Preliminary results • 2.6x speedup over Pin dcache Plan to release PiPA www.comp.nus.edu.sg/~ioana CGO 2008

Pi pelined P rofiling and A nalysis on Multi-core Systems