1 / 118

Compiler Construction of Idempotent Regions and Applications in Architecture Design

Compiler Construction of Idempotent Regions and Applications in Architecture Design. Marc de Kruijf Advisor: Karthikeyan Sankaralingam. PhD Defense 07/20/2012. Example. source code. int sum( int *array, int len ) { int x = 0; for ( int i = 0; i < len ; ++ i )

adelle
Download Presentation

Compiler Construction of Idempotent Regions and Applications in Architecture Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compiler Construction of Idempotent Regions and Applications in Architecture Design Marc de Kruijf Advisor: KarthikeyanSankaralingam PhD Defense 07/20/2012

  2. Example source code int sum(int *array, intlen) { int x = 0; for (inti = 0; i < len; ++i) x += array[i]; return x; } 2

  3. Example x assembly code load ? R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 exceptions mis-speculations F FFF 0 faults 3

  4. Example assembly code R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 Bad Stuff Happens! 4

  5. Example assembly code R0 and R1 are unmodified R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 just re-execute! convention: use checkpoints/buffers 5

  6. It’s Idempotent! idempoh… what…? = int sum(int *data, intlen) { int x = 0; for (inti = 0; i < len; ++i) x += data[i]; return x; } 6

  7. Thesis the thing that I am defending idempotent regions All The Time specifically… – using compiler analysis (intra-procedural) – transparent; no programmer intervention – hardware/software co-design – software analysis hardware execution 7

  8. Thesis prelim.pptx defense.pptx preliminary exam (11/2010) – idempotence:concept and simple empirical analysis – compiler: preliminary design & partial implementation – architecture: some area and power savings…? defense (07/2012) – idempotence: formalization and detailed empirical analysis – compiler: complete design, source code release* – architecture: compelling benefits (various) * http://research.cs.wisc.edu/vertical/iCompiler 8

  9. Contributions & Findings a summary contribution areas – idempotence: models and analysis framework – compiler: design, implementation, and evaluation – architecture: design and evaluation findings – potentially large idempotent regions exist in applications – for compilation, larger is better – small regions (5-15 instructions), 10-15% overheads – large regions (50+ instructions), 0-2% overheads – enables efficient exception and hardware fault recovery 9

  10. Overview ❶ Idempotence Models in Architecture ❷ Compiler Design & Evaluation ❸ Architecture Design & Evaluation 10

  11. IdempotenceModels idempotence: what does it mean? • Definition • a region is idempotentiff its re-execution has no side-effects • a region is idempotentiff it preserves its inputs OK, but what does it mean to preserve an input? four models (next slides): A, B, C, & D Model A An input is a variable that is live-into the region. A region preserves an input if the input is not overwritten. 11

  12. IdempotenceModel A a starting point R1 = R2 + R3 if R1> 0 Live-ins: {all registers} \{R1} false {all memory} mem[R4]= R1 true SP = SP - 16 … ?? = mem[R4] ? 12

  13. IdempotenceModel A a starting point R1 = R2 + R3 if R1> 0 Live-ins: {all registers} false {all memory} mem[R4]= R1 true SP = SP - 16 13

  14. IdempotenceModel A a starting point R1 = R2 + R3 if R1> 0 Live-ins: {all registers} false {all memory} \{mem[R4]} mem[R4]= R1 true SP = SP - 16 … Two regions. Some stuff not covered. ?? = mem[R4] ? 14

  15. IdempotenceModel B varying control flow assumptions R1 = R2 + R3 if R1> 0 false mem[R4]= R1 true SP = SP - 16 live-in but dynamically dead at time of write – OK to overwrite if control flow invariable … One region. Some stuff still not covered. ?? = mem[R4] ? 15

  16. IdempotenceModel C varying sequencing assumptions R1 = R2 + R3 if R1> 0 false mem[R4]= R1 true SP = SP - 16 allow final instruction to overwrite input (to include otherwise ineligible instructions) One region. Everything covered. 16

  17. IdempotenceModel D varying isolation assumptions R1 = R2 + R3 if R1> 0 false mem[R4]= R1 true SP = SP - 16 may be concurrently read in another thread – consider as input Two regions. Everything covered. 17

  18. Idempotence Models an idempotence taxonomy (Model C) sequencing axis control axis isolation axis (Model B) (Model D) 18

  19. what are the implications? 19

  20. Empirical Analysis methodology measurement – dynamicregion size (path length) – subject to axis constraints – x86 dynamic instruction count (using PIN) benchmarks – SPEC 2006, PARSEC, and Parboil suites experimental configurations – unconstrained: ideal upper bound (Model C) – oblivious: actual in normal compiled code (Model C) – X-constrained: ideal upper bound constrained by axis X 20

  21. Empirical Analysis oblivious vs. unconstrained average region size* 160.2 5.2 30x improvement achievable* *geometrically averaged across suites 21

  22. Empirical Analysis control axis sensitivity average region size* 160.2 40.1 5.2 4x DROP if variable control flow *geometrically averaged across suites 22

  23. Empirical Analysis isolation axis sensitivity average region size* 160.2 40.1 27.4 5.2 1.5x DROP if no isolation *geometrically averaged across suites 23

  24. Empirical Analysis sequencing axis sensitivity non-idempotent instructions* 0.19% inherently non-idempotent Instructions: rare and of limited type *geometrically averaged across suites 24

  25. IdempotenceModels a summary a spectrum of idempotence models – significant opportunity: 100+ sizes possible – 4x reduction constraining control axis – 1.5x reduction constraining isolation axis two models going forward – architectural idempotence& contextual idempotence – both are effectively the ideal case (Model C) – architectural idempotence: invariable control always – contextual idempotence: variable control w.r.t. locals 25

  26. Overview ❶ Idempotence Models in Architecture ❷ Compiler Design & Evaluation ❸ Architecture Design & Evaluation 26

  27. Compiler Design choose your own adventure Analyze: identify semantic clobberantidependences cutsemanticclobber antidependences Partition: • preservesemantic idempotence Code Gen: Compiler Evaluation Analyze Partition Code Gen X 27

  28. Compiler Evaluation preamble Free Analyze: identify semantic clobberantidependences Free cutsemanticclobber antidependences Partition: NOTFree • preservesemantic idempotence Code Gen: what do you mean: PERFORMANCE OVERHEADS? 28

  29. Compiler Evaluation preamble register pressure overhead region size - preserve input values in registers - spill other values (if needed) - spill input values to stack - allocate other values to registers 29

  30. Compiler Evaluation methodology compiler implementation – LLVM, support for both x86 and ARM benchmarks – SPEC 2006, PARSEC, and Parboil suites measurements – performance overhead: dynamic instruction count – for x86, using PIN – for ARM, using gem5 (just for ISA comparison at end) – region size: instructions between boundaries (path length) – x86 only, using PIN 30

  31. Results, Take 1/3 initial results – overhead 13.1% performance overhead Non-Trivial Overhead percentage increase in x86 dynamic instruction count geometrically averaged across suites 31

  32. Results, Take 1/3 analysis of trade-offs register pressure 10+ instructions overhead ? region size You are HERE (typically 10-30 instructions) 32

  33. Results, Take 1/3 analysis of trade-offs detection latency register pressure overhead ? ? region size 33

  34. Results, Take 2/3 minimizing register pressure Before After 13.1% 11.1% performance overhead Still Non-Trivial Overhead 34

  35. Results, Take 2/3 analysis of trade-offs detection latency register pressure re-execution time overhead ? region size 35

  36. Big Regions how do we get there? Problem #1: aliasing analysis – no flow-sensitive analysis in LLVM; really hurts loops Problem #2: loop optimizations – boundaries in loops are bad for everyone – loop blocking, fission/fusion, inter-change, peeling, unrolling, scalarization, etc. can all help Problem #3: large array structures – awareness of array access patterns can help Problem #4: intra-procedural scope – limited scope aggravates all effects listed above 36

  37. Big Regions how do we get there? solutions can be automated – a lot of work… what would be the gain? ad hoc for now – consider PARSEC and Parboil suites as a case study – aliasing annotations – manual loop refactoring, scalarization, etc. – partitioning algorithm refinements (application-specific) – inlining annotations 37

  38. Results Take 3/3 big regions Before After 13.1% performance overhead Gain = BIG; Trivial Overhead 0.06% 38

  39. Results Take 3/3 50+ instructions is good enough 50+ instructions register pressure overhead (mis-optimized) region size 39

  40. ISA Sensitivity you might be curious ISA matters? (1) two-address (e.g. x86) vs. three-address (e.g. ARM) (2)register-memory(e.g. x86) vs. register-register (e.g. ARM) (3) number of available registers the short version ( ) – impact of (1) & (2) not significant (+/- 2% overall) – even less significant as regions grow larger – impact of (3): to get same performance w/ idempotence – increase registers by 0% (large) to ~60% (small regions) 40

  41. Compiler Design & Evaluation a summary design and implementation – static analysis algorithms, modular and perform well – code-gen algorithms, modular and perform well – LLVM implementation source code available* findings – pressure-related performance overheads range from: –0% (large regions) to ~15% (small regions) – greatest opportunity: loop-intensive applications – ISA effects are insignificant * http://research.cs.wisc.edu/vertical/iCompiler 41

  42. Overview ❶ Idempotence Models in Architecture ❷ Compiler Design & Evaluation ❸ Architecture Design & Evaluation 42

  43. Architecture Recovery: It’s Real safety first speed first (safety second) 43

  44. Architecture Recovery: It’s Real lots of sharp turns Fetch Decode 1 2 Execute 3 Write-back 4 Execute 3 Fetch 1 Write-back 4 Decode 2 closer to the truth 44

  45. Architecture Recovery: It’s Real lots of interaction Fetch Decode 1 2 Execute 3 Write-back 4 !!! Execute 3 Fetch 1 Write-back 4 Decode 2 too late! 45

  46. Architecture Recovery: It’s Real bad stuff can happen mis-speculation hardware faults exceptions (a) branch mis-prediction, (b) memory re-ordering, (c) transaction violation, etc. (d) wear-out fault, (e) particle strike, (f) voltage spike, etc. (g) page fault, (h) divide-by-zero, (i) mis-aligned access, etc. 46

  47. Architecture Recovery: It’s Real bad stuff can happen detection latency register pressure re-execution time mis-speculation hardware faults exceptions (a) branch mis-prediction, (b) memory re-ordering, (c) transaction violation, etc. (d) wear-out fault, (e) particle strike, (f) voltage spike, etc. (g) page fault, (h) divide-by-zero, (i) mis-aligned access, etc. 47

  48. Architecture Recovery: It’s Real bad stuff can happen mis-speculation hardware faults exceptions (a) branch mis-prediction, (b) memory re-ordering, (c) transaction violation, etc. (d) wear-out fault, (e) particle strike, (f) voltage spike, etc. (g) page fault, (h) divide-by-zero, (i) mis-aligned access, etc. 48

  49. Architecture Recovery: It’s Real bad stuff can happen integrated GPU low-power CPU high-reliability systems hardware faults exceptions (g) page fault, (h) divide-by-zero, (i) mis-aligned access, etc. (d) wear-out fault, (e) particle strike, (f) voltage spike, etc. 49

  50. GPU Exception Support 50

More Related