Idempotent Code Generation: Implementation, Analysis, and Evaluation

Idempotent Code Generation: Implementation, Analysis, and Evaluation Marc de Kruijf( ) KarthikeyanSankaralingam CGO 2013, Shenzhen

Example source code int sum(int *array, intlen) { int x = 0; for (inti = 0; i < len; ++i) x += array[i]; return x; } 1

Example x assembly code load ? R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 exceptions mis-speculations F FFF 0 faults 2

Example assembly code R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 Bad Stuff Happens! 3

Example assembly code R0 and R1 are unmodified R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 just re-execute! convention: use checkpoints/buffers 4

It’s Idempotent! idempoh… what…? = int sum(int *data, intlen) { int x = 0; for (inti = 0; i < len; ++i) x += data[i]; return x; } 5

Idempotent Region Construction previously… in PLDI ’12 idempotent regions All The Time before: after: 6

Idempotent Code Generation now… in CGO ’13 int sum(int *array, intlen) { int x = 0; for (inti = 0; i < len; ++i) x += array[i]; return x; } how do we get from here... 7

Idempotent Code Generation now… in CGO ’13 R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 to here... 8

Idempotent Code Generation now… in CGO ’13 R2 = load [R1] R1= 0 LOOP: R4 = load [R0 + R2] R1 = add R1, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 not here (this is not idempotent) ... 9

Idempotent Code Generation now… in CGO ’13 R3 = R1 R2 = load [R3] R3= 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 and not here (this is slow) ... 10

Idempotent Code Generation now… in CGO ’13 R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 here... 11

Idempotent Code Generation x applications to prior work Hampton & Asanović, ICS ’06 De Kruijf & Sankaralingam, MICRO ’11 Menonet al., ISCA ’12 load ? exceptions Kim et al., TOPLAS ’06 Zhanget al., ASPLOS ‘13 F mis-speculations FFF 0 De Kruijfet al., ISCA ’10 Fenget al., MICRO ’11 De Kruijfet al., PLDI ’12 faults 12

Idempotent Code Generation executive summary how do we generate efficient idempotent code? not covered in this talk algorithms made available in source code form: http://research.cs.wisc.edu/vertical/iCompiler (2)how do external factors affect overhead? idempotent region size instruction set (ISA) characteristics control flow side-effects each can affect overheads by 10% or more 13

Presentation Overview ❶ Introduction idempotent region size ISA characteristics control flow side-effects ❷ Analysis ❸ Evaluation 14

Analysis (a) idempotent region size overhead region size - number of inputs increasing - likelihood of spills growing - maximum spill cost reached - amortized over more instructions 15

Analysis (b) ISA characteristics (1) two-address (e.g. x86) vs. three-address (e.g. ARM) ADD R1, R2 -> R1 Idempotent? NO! ADD R1, R2 = R3 idempotent? YES! (2) register-memory (e.g. x86) vs. register-register (e.g. ARM) for register-memory, register spills may be less costly (microarchitecture dependent) (3) number of available registers impact is obvious, but… more registers is not always enough (see back-up slide) 16

Analysis (c) control flow side-effects x’s live interval x = ... region boundaries ... = f(x) y = ... x’s“shadow interval” given no side-effects 17

Analysis (c) control flow side-effects x’s live interval x = ... region boundaries ... = f(x) y = ... x’s“shadow interval” given side-effects 18

Presentation Overview ❶ Introduction ❷ Analysis idempotent region size ISA characteristics control flow side-effects ❸ Evaluation 19

Evaluation methodology benchmarks – SPEC 2006, PARSEC, and Parboil suites measurements – performance overhead: dynamic instruction count – for x86, using PIN – for ARM, using gem5 – region size: instructions between boundaries (path length) 20

Evaluation (a) idempotent region size 10+ instructions overhead 13.1% (geometric mean) ? region size You are HERE (baseline: typically 10-30 instructions) 21

Evaluation (a) idempotent region size detection latency overhead 13.1% ? ? region size 22

Evaluation (a) idempotent region size detection latency re-execution time overhead 11.1% 13.1% 0.06% ? region size 23

Evaluation (b) ISA characteristics x86-64 ARMv7 percentage overhead Three-address support matters more for FP benchmarks Register-memory matters more for integer benchmarks 24

Evaluation (c) control flow side-effects no side-effects side-effects percentage overhead substantial only in two cases; insubstantial otherwise intuition:typically compiler already spills for control flow divergence 25

Presentation Overview ❶ Introduction ❷ Analysis ❸ Evaluation 26

Conclusions (a) region size – matters a lot; large regions are ideal if recovery is infrequent overheads approach zero as regions grow (b) instruction set – matters when region sizes must be small overheads drop below 10% only with careful co-design (c) control flow side-effects – generally does not matter supporting control flow side-effects is not expensive 27

Conclusions code generation and static analysis algorithms http://research.cs.wisc.edu/vertical/iCompiler applicability not limited to architecture design see Zhang et al., ASPLOS ‘13: “ConAir: Featherweight Concurrency Bug Recovery [...]” thank you! 28

Back-up Slides 29

ISA Characteristics more registers isn’t always enough C code R0 = 0 x = 0; if (y > 0) x = 1; z = x + y; if (R1 > 0) R0 = 1 R2 = R0 + R1 30

ISA Characteristics more registers isn’t always enough C code R0 = 0 x = 0; if (y > 0) x = 1; z = x + y; if (R1 > 0) R3 = R0 R3 = 1 need an extra instruction no matter what R2 = R3 + R1 31

ISA Characteristics idempotence vs. fewer registers no idempotence, #GPR reduced from 16 percentage overhead data from SPEC INT only (SPEC INT uses General Purpose Registers (GPRs) only) 32

Very Large Regions how do we get there? Problem #1: aliasing analysis – no flow-sensitive analysis in LLVM; hurts loops Problem #2: loop optimizations – boundaries in loops are bad for everyone (next slides) – loop blocking, fission/fusion, inter-change, peeling, unrolling, scalarization, etc. can all help Problem #3: large array structures – awareness of array access patterns can help (next slides) Problem #4: intra-procedural scope – limited scope aggravates all effects listed above 33

Very Large Regions Re: Problem #2 (cut in loops are bad) C code CFG + SSA for (i = 0; i < X; i++) { ... } i0= φ(0, i1) i1= i0+ 1 if (i1 < X) 34

Very Large Regions Re: Problem #2 (cut in loops are bad) C code machine code for (i = 0; i < X; i++) { ... } R0 = 0 No Boundaries = No Problem R0 = R0 + 1 if (R0 < X) 35

Very Large Regions Re: Problem #2 (cut in loops are bad) C code machine code for (i = 0; i < X; i++) { ... } R0 = 0 R0 = R0 + 1 if (R0 < X) 36

Very Large Regions Re: Problem #2 (cut in loops are bad) C code machine code for (i = 0; i < X; i++) { ... } R1 = 0 R0 = R1 R1 = R0 + 1 if (R1 < X) – “redundant” copy – extra boundary (pressure) 37

Very Large Regions Re: Problem #3 (array access patterns) PLDI ‘12 algorithm makes this simplifying assumption: [x] = a; b = [x]; [x] = c; [x] = a; b = a; [x] = c; non-clobber antidependences… GONE! cheap for scalars, expensive for arrays 38

Very Large Regions Re: Problem #3 (array access patterns) not really practical for large arrays but if we don’t do it, non-clobber antidependences remain // initialize: int[100] array; memset(&array, 100*4, 0); // accumulate: for (...) array[i] += foo(i); solution: handle potential non-clobbers in a post-pass (same way we deal with loop clobbers in static analysis) 39

Idempotent Code Generation: Implementation, Analysis, and Evaluation