1 / 29

Diverge-Merge Processor (DMP)

Diverge-Merge Processor (DMP). Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin. Outline. Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation

keaira
Download Presentation

Diverge-Merge Processor (DMP)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Diverge-Merge Processor(DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin

  2. Outline • Predicated Execution • Diverge-Merge Processor (DMP) • Implementation of DMP • Experimental Evaluation • Conclusion 2

  3. (normal branch code) A A T N if (cond) { b = 0; } else { b = 1; } B C B C D D A p1 = (cond) branch p1, TARGET B mov b, 1 jmp JOIN C TARGET: mov b,0 Predicated Execution (predicated code) Convert control flow dependence to data dependence A p1 = (cond) (!p1) mov b,1 (p1) mov b,0 B C 3

  4. F A E F D F E C D E F C F B E D A D C E B A B C D A C B B A A B A A B A F F E E D D C C B B A A F E D F E C D E C D B C B A A C B A B D C Benefit of Predicated Execution • Predicated Execution can be high performance and energy-efficient. Predicated Execution A Fetch Decode Rename Schedule RegisterRead Execute C B nop Branch Prediction D Fetch Decode Rename Schedule RegisterRead Execute F E D B A E Pipeline flush!! F 4

  5. Limitations/Problems of Predication • ISA: Predicate registers and predicated instructions • Dynamic-Hammock Predication[Klauser’98] can solve this problem but it is only applicable to simple hammocks. • Adaptivity: Static predication is not adaptive to run-time branch behavior. • Branch behavior changes based on input set, phase, control-flow path. • Wish Branches[Kim’05] • Complex CFG: A large subset of control-flow graphs is not converted to predicated code. • Function calls, loops, many instructions inside a region, and complex CFGs • Hyperblock[Mahlke’92] cannot adapt to frequently-executed paths dynamically. 5

  6. Outline • Predicated Execution • Diverge-Merge Processor (DMP) • Implementation of DMP • Experimental Evaluation • Conclusion 6

  7. Diverge-Merge Processor (DMP) • DMP can dynamically predicate complex branches (in addition to simple hammocks). • The compileridentifies • Diverge branches • Control-flow merge (CFM) points • The microarchitecturedecideswhen and what to predicate dynamically. 7

  8. A T N C B H A p1 = (cond) branch p1, TARGET B mov R1, 1 jmp JOIN C TARGET: mov R1,0 Dynamic Predication Low-confidence A (mov R1, 1) PR10 = 1 B (mov R1, 0) PR11 = 0 C select-µops (φ-nodes in SSA) PR12 = (cond) ? PR11 : PR10 H H JOIN: add R5, R1, 1 Klauser et al.[PACT’98]: Dynamic-hammock predication 8

  9. Diverge-Merge Processor A A Diverge Branch B C B D C E E F G Insert select-µops H CFM point H Frequently executed path Not frequently executed path 9

  10. A A A A A A Diverge-Merge Processor A C B D E F G H Frequently executed path Not frequently executed path diverge-branch executed block CFM point 10

  11. A A A A A . . . . . . . . . . . simple hammock nested hammock frequently-hammock loop non-merging Control-Flow Graphs 11

  12. Dual-path Execution vs. DMP Dual-path DMP Low-confidence A path 1 path 2 path 1 path 2 C B C B C B CFM CFM D D D D E E E E F F F F 12

  13. A A A A A . . . . . . . . . . . simple hammock nested hammock frequently-hammock loop non-merging Control-Flow Graphs sometimes sometimes 13

  14. Distribution of Mispredicted Branches • 66% of mispredicted branches can be dynamically predicated in DMP. 14

  15. Distribution of Mispredicted Branches • 66% of mispredicted branches can be dynamically predicated in DMP. 15

  16. Outline • Predicated Execution • Diverge-Merge Processor (DMP) • Implementation of DMP • Experimental Evaluation • Conclusion 16

  17. Fetch Mechanism A A Diverge Branch Low Confidence C B B D Round-robin fetch C E E F G CFM point H H predicted path 17

  18. branch pr10,C p1 = pr10 branch r0, C add r1  r3, #1 add pr21pr13, #1 (p1) add r1  r2, # -1 add pr31pr12, # -1(!p1) add r4  r1, r3 add pr24pr41, pr13 Dynamic Predication A PR11 1 PR41 PR21 B RAT1 C PR11 PR31 1 E select-µop pr41 = p1? pr21 : pr31 RAT2 H Forks RAT, RAS, and GHR 18

  19. DMP Support • ISA Support • Mark diverge branches/CFM points. • Compiler Support [CGO’07] • The compiler identifies diverge branches and the corresponding CFM points. • Hardware Support • Confidence estimator • Fetch mechanisms • Load/store processing • Instruction retirement • Dynamic predication 19

  20. Hardware Complexity Analysis DMP Dyn.ham. Dualpath Multi path SW pred. Wish br. Front-End       Confidence Estimator     Rename Support      Predicate Registers     Select-Uop Gen.     ST-LD Forwarding       Check Flush/no Flush      20

  21. Outline • Predicated Execution • Diverge-Merge Processor (DMP) • Implementation of DMP • Experimental Evaluation • Conclusion 21

  22. Simulation Methodology • 12 SPEC 2000 INT, 5 SPEC 95 INT • Different input sets for profiling and evaluation • Alpha ISA execution driven simulator • Baseline processor configuration • 64KB perceptron predictor/O-GEHL (paper) • Minimum 30-cycle branch misprediction penalty • 8-wide, 512-entry instruction window • 2 KB 12-bit history enhanced JRS confidence estimator • Less aggressive processor (paper) • Power model using Wattch 22

  23. Different CFG types 23

  24. Performance Improvement 24

  25. Energy Consumption 25

  26. Outline • Predicated Execution • Diverge-Merge Processor (DMP) • Implementation of DMP • Experimental Evaluation • Conclusion 26

  27. Conclusion • DMP introduces the concept offrequently-hammocksand it dynamically predicates complex CFGs. • DMP can overcome the threemajor limitationsof software predication: ISA support, adaptivity, complex CFG. • DMP reduces branch mispredictions energy efficiently • 19% performance improvement, 9% less energy • DMP divides the work between the compiler and the microarchitecture: • The compiler analyzes the control-flow graphs. • The microarchitecture decideswhen and what to predicate dynamically. 27

  28. Thank You!!

  29. Questions?

More Related