290 likes | 376 Views
Using Interpretation for Profiling the Alpha 21264a. Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William E. Weihl and many more. Introduction. 24 0 0xca8c stt $f16, 56(sp) 36 0 0xca90 addssu $f25,$f28,$f16
E N D
Using Interpretation forProfiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William E. Weihl and many more.
Introduction 24 0 0xca8c stt $f16, 56(sp) 36 0 0xca90 addssu $f25,$f28,$f16 27 0 0xca94 ldl at, 8(a2) 26 0 0xca98 cmpule at, 0xc, at 29 0 0xca9c bne at, 0xdcb0 21 0 0xcaa0 stt $f20, 72(sp) 17 23 0xcaa4 ldt $f20, 56(sp) 29 0 0xcaa8 sts $f20, 60(a2) 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) 31 0 0xcabc sts $f20, 28(a2) 28 0 0xcac0 ldl at, 8(a2) 2262 0x210 addq t5, 0x10, t5 * 2308 0x214 cmptlt $f7,$f3,$f10 2231 0x218 subq t4, 0x1, t4 * 2285 1.0 0x21c cmptlt $f4,$f7,$f11 * 2224 0.9 0x220 cmptlt $f8,$f5,$f12 * 2227 1.0 0x224 cmptlt $f6,$f8,$f13 * 2257 1.0 0x228 cmptlt $f9,$f1,$f14 * 2390 1.0 0x22c cmptlt $f2,$f9,$f15 2265 0x230 lds $f7, 0(t5) * 2343 1.0 0x234 adds $f10,$f16,$f16 2357 0x238 lds $f8, 4(t5) * 2249 1.0 0x23c adds $f11,$f17,$f17 2309 0.1 0x240 lds $f9, 8(t5) 28 0 0xcac0 ldq at, 8(a2) * 2214 1.0 0x244 adds $f12,$f18,$f18 * 2292 1.0 0x248 adds $f13,$f19,$f19 * 2234 1.0 0x24c adds $f14,$f20,$f20 * 2282 1.0 0x250 adds $f15,$f21,$f21 2278 1.0 0x254 bgt t4, 0x210 2563 0xed0 lds $f14, 13052(a0) 2515 0xed4 addq t11, s4, t11 2519 0xed8 adds $f1,$f15,$f1 2577 1.0 0xedc muls $f17,$f3,$f17 2485 0xee0 ldq_u zero, 0(sp) 2490 2.0 0xee4 subs $f19,$f20,$f20 2525 0xee8 muls $f18,$f4,$f18 2546 0xeec muls $f27,$f4,$f3 2546 0xef0 adds $f0,$f10,$f0 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) 31 0 0xcabc sts $f20, 28(a2) 2262 0x210 addq t5, 0x10, t5 * 2308 0x214 cmptlt $f7,$f3,$f10 2231 0x218 subq t4, 0x1, t4 * 2285 1.0 0x21c cmptlt $f4,$f7,$f11 * 2224 0.9 0x220 cmptlt $f8,$f5,$f12 * 2227 1.0 0x224 cmptlt $f6,$f8,$f13 * 2257 1.0 0x228 cmptlt $f9,$f1,$f14 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) 31 0 0xcabc sts $f20, 28(a2) 28 0 0xcac0 ldl at, 8(a2) 2262 0x210 addq t5, 0x10, t5 * 2308 0x214 cmptlt $f7,$f3,$f10 2231 0x218 subq t4, 0x1, t4 * 2285 1.0 0x21c cmptlt $f4,$f7,$f11 * 2224 0.9 0x220 cmptlt $f8,$f5,$f12 * 2227 1.0 0x224 cmptlt $f6,$f8,$f13 * 2257 1.0 0x228 cmptlt $f9,$f1,$f14 * 2285 1.0 0x21c cmptlt $f4,$f7,$f11 * 2224 0.9 0x220 cmptlt $f8,$f5,$f12 * 2227 1.0 0x224 cmptlt $f6,$f8,$f13 * 2257 1.0 0x228 cmptlt $f9,$f1,$f14 24 0 0xca8c stt $f16, 56(sp) 36 0 0xca90 addssu $f25,$f28,$f16 27 0 0xca94 ldl at, 8(a2) * 2390 1.0 0x22c cmptlt $f2,$f9,$f15 2265 0x230 lds $f7, 0(t5) * 2343 1.0 0x234 adds $f10,$f16,$f16 2357 0x238 lds $f8, 4(t5) * 2249 1.0 0x23c adds $f11,$f17,$f17 2309 0.1 0x240 lds $f9, 8(t5) * 2214 1.0 0x244 adds $f12,$f18,$f18 * 2292 1.0 0x248 adds $f13,$f19,$f19 * 2234 1.0 0x24c adds $f14,$f20,$f20 * 2282 1.0 0x250 adds $f15,$f21,$f21 2278 1.0 0x254 bgt t4, 0x210 2563 0xed0 lds $f14, 13052(a0) 2515 0xed4 addq t11, s4, t11 2519 0xed8 adds $f1,$f15,$f1 2577 1.0 0xedc muls $f17,$f3,$f17 2485 0xee0 ldq_u zero, 0(sp) 2490 2.0 0xee4 subs $f19,$f20,$f20 2525 0xee8 muls $f18,$f4,$f18 2546 0xeec muls $f27,$f4,$f3 2546 0xef0 adds $f0,$f10,$f0 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) 31 0 0xcabc sts $f20, 28(a2) * 2285 1.0 0x21c cmptlt $f4,$f7,$f11 * 2224 0.9 0x220 cmptlt $f8,$f5,$f12 * 2227 1.0 0x224 cmptlt $f6,$f8,$f13 * 2257 1.0 0x228 cmptlt $f9,$f1,$f14 24 0 0xca8c stt $f16, 56(sp) 36 0 0xca90 addssu $f25,$f28,$f16 27 0 0xca94 ldl at, 8(a2) * 2390 1.0 0x22c cmptlt $f2,$f9,$f15 2265 0x230 lds $f7, 0(t5) * 2343 1.0 0x234 adds $f10,$f16,$f16 2357 0x238 lds $f8, 4(t5) * 2249 1.0 0x23c adds $f11,$f17,$f17 2309 0.1 0x240 lds $f9, 8(t5) * 2214 1.0 0x244 adds $f12,$f18,$f18 * 2292 1.0 0x248 adds $f13,$f19,$f19 * 2234 1.0 0x24c adds $f14,$f20,$f20 * 2282 1.0 0x250 adds $f15,$f21,$f21 2278 1.0 0x254 bgt t4, 0x210 2563 0xed0 lds $f14, 13052(a0) 2515 0xed4 addq t11, s4, t11 2519 0xed8 adds $f1,$f15,$f1 2577 1.0 0xedc muls $f17,$f3,$f17 2485 0xee0 ldq_u zero, 0(sp) 2490 2.0 0xee4 subs $f19,$f20,$f20 2525 0xee8 muls $f18,$f4,$f18 2546 0xeec muls $f27,$f4,$f3 2546 0xef0 adds $f0,$f10,$f0 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) 31 0 0xcabc sts $f20, 28(a2) 26 0 0xca98 cmpule at, 0xc, at 29 0 0xca9c bne at, 0xdcb0 21 0 0xcaa0 stt $f20, 72(sp) 17 23 0xcaa4 ldt $f20, 56(sp) 29 0 0xcaa8 sts $f20, 60(a2) 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) 31 0 0xcabc sts $f20, 28(a2) 28 0 0xcac0 ldl at, 8(a2) 2262 0x210 addq t5, 0x10, t5 * 2308 0x214 cmptlt $f7,$f3,$f10 2231 0x218 subq t4, 0x1, t4 * 2285 1.0 0x21c cmptlt $f4,$f7,$f11 * 2224 0.9 0x220 cmptlt $f8,$f5,$f12 * 2227 1.0 0x224 cmptlt $f6,$f8,$f13 * 2257 1.0 0x228 cmptlt $f9,$f1,$f14 24 0 0xca8c stt $f16, 56(sp) 36 0 0xca90 addssu $f25,$f28,$f16 27 0 0xca94 ldl at, 8(a2) * 2390 1.0 0x22c cmptlt $f2,$f9,$f15 2265 0x230 lds $f7, 0(t5) * 2343 1.0 0x234 adds $f10,$f16,$f16 2357 0x238 lds $f8, 4(t5) * 2249 1.0 0x23c adds $f11,$f17,$f17 2309 0.1 0x240 lds $f9, 8(t5) * 2214 1.0 0x244 adds $f12,$f18,$f18 * 2292 1.0 0x248 adds $f13,$f19,$f19 * 2234 1.0 0x24c adds $f14,$f20,$f20 * 2282 1.0 0x250 adds $f15,$f21,$f21 2278 1.0 0x254 bgt t4, 0x210 2563 0xed0 lds $f14, 13052(a0) 2515 0xed4 addq t11, s4, t11 2519 0xed8 adds $f1,$f15,$f1 2577 1.0 0xedc muls $f17,$f3,$f17 2485 0xee0 ldq_u zero, 0(sp) 2490 2.0 0xee4 subs $f19,$f20,$f20 2525 0xee8 muls $f18,$f4,$f18 2546 0xeec muls $f27,$f4,$f3 2546 0xef0 adds $f0,$f10,$f0 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) 31 0 0xcabc sts $f20, 28(a2) 28 0 0xcac0 ldl at, 8(a2) 2262 0x210 addq t5, 0x10, t5 * 2308 0x214 cmptlt $f7,$f3,$f10 2231 0x218 subq t4, 0x1, t4 * 2285 1.0 0x21c cmptlt $f4,$f7,$f11 * 2224 0.9 0x220 cmptlt $f8,$f5,$f12 * 2227 1.0 0x224 cmptlt $f6,$f8,$f13 * 2257 1.0 0x228 cmptlt $f9,$f1,$f14 39 0 0xcaac ldl at, 8(a2) 21 0 0xcab0 cmpule at, 0x4, at 21 0 0xcab4 bne at, 0xdcb4 20 7 0xcab8 ldt $f20, 72(sp) Changing this ONE instruction will make my Java programs run 2.3% faster! HOW CAN I FIND IT? HOW DO I FIX IT??
The Options • Read the source - not always useful • Read the assembly - hard, not always useful • Simulation - very slow, infeasible • Instrumentation - slow, interference • Sample-based profiling - not enough detail • Or use periodic interpretation
It’s Not Easy • A true story – • Sometimes program X runs twice as long as usual • Variance due to # of bytes in environment vars! • Base address of main()’s stack had dramatic effect • Simulation eventually revealed the problem • Information requirements • Detailed instruction behavior profile • Contents of registers • Correlated data for nearby instructions
Outline • Out-of-order Processors • Performance Problems • Why Interpretation? • Profiling Infrastructure • An Example • Evaluation • Future Work • Summary
Out-of-order Processors • Try to exploit instruction-level parallelism • Fetch, issue 4 instructions at a time • Many function units • Retire up to 11 instructions in a cycle • Fetch in-order • Execute out of order • Retire in-order
Enemies of Performance • Bad cache utilization • Static stalls / dependences • Branch misprediction • Illegal re-ordering } Pipeline traps!
execute TAKEN! execute [Predict !taken] fetch ... ... ... ... ... ... ... fetch ... ... ... ... ... ... ... ... ABORTED! fetch Traps • Processor detects that it let “bad things” happen • wrong instructions executed • instructions may have seen incorrect data • up to 80 in-flight instructions thrown out! • Branch mispredict: ... ... beq
Memory Order Traps • Memory operations are freely reordered • Must enforce consistent view of memory • Problems are detected dynamically (a) reordered operations to overlapping bytes - “order” trap program order: execute order: store to X ... load from X load from X ... store to X
Troll Traps (b) accesses resulting in contention for a cache line - “troll” trap • not allowed to have more than one outstanding fill request • unspecified ordering of responses from L2 cache • replay the load until the fill happens L1 data cache Miss! Load from Y Load from X X ? ? Y L2 cache
Wrong Size Traps (c) wide load follows narrow store - “size” trap Store queue Store-long mem(x) Load-quad mem(x) L1 data cache Load-quad mem(x)
A Better Way • Need a runtime solution • Notice when two instructions in a trace “match” • Observe effective addresses of memory ops • Interpret instruction traces • Emulate (most) operations • Apply statistically to cover whole system • Extends the power of sample-based profiling
Available Information • Control Flow – Edge Frequencies • Return address (in register or on stack) • Branch taken direction • Computed values • Function arguments, results • Load/store addresses • Possible replay trap culprits
ProfileMe on Alpha 21264a Fetch counter overflow? fetch map issue exec retire random selection ProfileMe tag! interrupt branchpredictor icache tagged? capture! retired? pc imiss? taken? map stall? notrap? replay? mispredict? dtbmiss? … internal processor registers
ProfileMe Interrupt Log event in hash table Read counters; get PID/PC ProfileMe interrupt interrupt returns execute instructions execute instructions ! Program instruction stream
Interpretation - Value Profiling Partial CFG log register contents with profile data interpret in interrupt handler Register contents Update regs, memory ProfileMe interrupt New register values interrupt returns execute native execute native ! Program instruction stream
Interpreter Details • Initial register values delivered with interrupt • Interpret n instructions or until bail • PALcode (OS support) • Page fault • Branches and jumps are interpreted • can’t detect mispredicts • Memory accesses are performed • can’t detect cache misses • Final register state updated
Values Captured Arithmetic - result value Memory op - effective address Indirect jump - destination address … and current return address in all cases
Interpretation - Replay Traps report possible culprits as value samples effective addresses register dependence interpret analyze ProfileMe interrupt interrupt returns execute native execute native ! Program instruction stream
Example Profile - MTRT > dcpiprof $labels $db -pm replays mtrtbase.exe Column Total Period (for events) ------ ----- ------ replays:count 397 126976 =========================================================== replays :count % procedure image 100 25.19% ...OctNode.Intersect(...) mtrtbase.exe 51 12.85% java.io.BufferedInputStream.read() mtrtbase.exe 48 12.09% ...Vector.Dot(...) mtrtbase.exe
Replays in OctNode.Intersect > dcpilist $labels $db -pm replays\ '...OctNode.Intersect(...)’ mtrtbase.exe ...OctNode.Intersect(...): replays :count code elided 0 0x2002d2a0 stt $f8, 104(sp) 0 0x2002d2a4 bis a0, a0, s5 0 0x2002d2a8 bis a1, a1, s6 0 0x2002d2ac bis a2, a2, s4 0 0x2002d2b0 stt $f19, 8(sp) 0 0x2002d2b4 bsr ra, 0x20022250 0 0x2002d2b8 bis v0, v0, a0 0 0x2002d2bc cpys $f31,$f31,$f17 0 0x2002d2c0 cpys $f31,$f31,$f18 0 0x2002d2c4 cpys $f31,$f31,$f19 0 0x2002d2c8 bis v0, v0, s2 430x2002d2cc ldq at, 0(a0) 0 0x2002d2d0 bsr ra, 0x20027a50 Order? Wrong Size? Troll? Queue Full?
Replay Trap Value Profile > dcpilist $labels $db -pm replays-vreplay \ '...OctNode.Intersect(...)’ mtrtbase.exe ...OctNode.Intersect(...): replays :count vtot thld nv code elided 0 0x2002d2a0 stt $f8, 104(sp) 5 1.0 1 (100.0% 0x2002b4f8) 0 0x2002d2a4 bis a0, a0, s5 0 0.0 0 0 0x2002d2a8 bis a1, a1, s6 0 0.0 0 0 0x2002d2ac bis a2, a2, s4 0 0.0 0 0 0x2002d2b0 stt $f19, 8(sp) 0 0.0 0 0 0x2002d2b4 bsr ra, 0x20022250 0 0.0 0 0 0x2002d2b8 bis v0, v0, a0 0 0.0 0 0 0x2002d2bc cpys $f31,$f31,$f17 0 0.0 0 0 0x2002d2c0 cpys $f31,$f31,$f18 0 0.0 0 0 0x2002d2c4 cpys $f31,$f31,$f19 0 0.0 0 0 0x2002d2c8 bis v0, v0, s2 0 0.0 0 430x2002d2cc ldq at, 0(a0) 25 1.0 1 (100.0% 0x203f10d0) 0 0x2002d2d0 bsr ra, 0x20027a50 0 0.0 0 Possible Conflicting Instruction (accesses overlapping bytes)
Conflicting Instruction > dcpilist -vreplay -vshow 1 $labels $db -pm repl '0x203f10d0' \ mtrtbase.exe comp_alloc_fast: replays :count vtot thld nv 0 0x203f10c0 ldq t1, 64(s0) 88 1.0 4 (48.9% 0x203f10d8) 0 0x203f10c4 ldq v0, 56(s0) 98 1.0 12 (43.9% 0x203f10dc) 0 0x203f10c8 subq t1, a2, t1 0 0.0 0 0 0x203f10cc blt t1, 0x203f1134 0 0.0 0 1 0x203f10d0 stl a1, 0(v0) 16 1.0 16 (6.2% T 0x2002b464) 0 0x203f10d4 addq v0, a2, t2 0 0.0 0 0 0x203f10d8 stq t1, 64(s0) 43 1.0 2 (97.7% 0x203f10d8) 1 0x203f10dc stq t2, 56(s0) 46 1.0 6 (89.1% 0x203f10dc) 0 0x203f10e0 ret zero, (ra), 1 0 0.0 0 4-byte method pointer write in code for JVM’s new; 8-byte object header read for null check wrong_size replay trap for every allocation. Fix with 4-byte reads for null check! 2.3% speedup across SPECjvm98 (yes it matters!!)
Avoiding Traps • “Build a better …” {program,compiler,processor} • Change access widths • Try to get loads/stores further apart • Correct unfortunate data alignment • Avoid filling load/store queues • Improve instruction slotting
Interpretation Parameters • Frequency • don’t need to interpret on every interrupt • Duration • longer runs find more possible traps... (interacting instructions can be > 80 apart!) • ...but they are more expensive • we are running at highest priority • more time interpreting • more culprits data to collect
Evaluation - Overhead • Single runs of 11 early cpu2000 int benchmarks • Dual 667 MHz Alpha 21264a • Paths of 128 every 128 interrupts 225/sec ? ?
Future Work • Measure overhead for other frequencies/lengths • Evaluate ability to actually find culprits • Optimize data flow • Sample unbiasing • more likely to discover culprits nearby • more interpretation windows will cover both instrs. • Try to filter more unlikely culprits
Summary • Low-impact way to get trace information • No special requirements for processor • Benefits of statistical sampling • Manageable overhead • Useful applications • Value profiling - code specialization, online optim. • Path profiling - edge counts • Pipeline trap explanation - replay trap culprits