290 likes | 444 Views
Diverge-Merge Processor (DMP). Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin. Outline. Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation
E N D
Diverge-Merge Processor(DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin
Outline • Predicated Execution • Diverge-Merge Processor (DMP) • Implementation of DMP • Experimental Evaluation • Conclusion 2
(normal branch code) A A T N if (cond) { b = 0; } else { b = 1; } B C B C D D A p1 = (cond) branch p1, TARGET B mov b, 1 jmp JOIN C TARGET: mov b,0 Predicated Execution (predicated code) Convert control flow dependence to data dependence A p1 = (cond) (!p1) mov b,1 (p1) mov b,0 B C 3
F A E F D F E C D E F C F B E D A D C E B A B C D A C B B A A B A A B A F F E E D D C C B B A A F E D F E C D E C D B C B A A C B A B D C Benefit of Predicated Execution • Predicated Execution can be high performance and energy-efficient. Predicated Execution A Fetch Decode Rename Schedule RegisterRead Execute C B nop Branch Prediction D Fetch Decode Rename Schedule RegisterRead Execute F E D B A E Pipeline flush!! F 4
Limitations/Problems of Predication • ISA: Predicate registers and predicated instructions • Dynamic-Hammock Predication[Klauser’98] can solve this problem but it is only applicable to simple hammocks. • Adaptivity: Static predication is not adaptive to run-time branch behavior. • Branch behavior changes based on input set, phase, control-flow path. • Wish Branches[Kim’05] • Complex CFG: A large subset of control-flow graphs is not converted to predicated code. • Function calls, loops, many instructions inside a region, and complex CFGs • Hyperblock[Mahlke’92] cannot adapt to frequently-executed paths dynamically. 5
Outline • Predicated Execution • Diverge-Merge Processor (DMP) • Implementation of DMP • Experimental Evaluation • Conclusion 6
Diverge-Merge Processor (DMP) • DMP can dynamically predicate complex branches (in addition to simple hammocks). • The compileridentifies • Diverge branches • Control-flow merge (CFM) points • The microarchitecturedecideswhen and what to predicate dynamically. 7
A T N C B H A p1 = (cond) branch p1, TARGET B mov R1, 1 jmp JOIN C TARGET: mov R1,0 Dynamic Predication Low-confidence A (mov R1, 1) PR10 = 1 B (mov R1, 0) PR11 = 0 C select-µops (φ-nodes in SSA) PR12 = (cond) ? PR11 : PR10 H H JOIN: add R5, R1, 1 Klauser et al.[PACT’98]: Dynamic-hammock predication 8
Diverge-Merge Processor A A Diverge Branch B C B D C E E F G Insert select-µops H CFM point H Frequently executed path Not frequently executed path 9
A A A A A A Diverge-Merge Processor A C B D E F G H Frequently executed path Not frequently executed path diverge-branch executed block CFM point 10
A A A A A . . . . . . . . . . . simple hammock nested hammock frequently-hammock loop non-merging Control-Flow Graphs 11
Dual-path Execution vs. DMP Dual-path DMP Low-confidence A path 1 path 2 path 1 path 2 C B C B C B CFM CFM D D D D E E E E F F F F 12
A A A A A . . . . . . . . . . . simple hammock nested hammock frequently-hammock loop non-merging Control-Flow Graphs sometimes sometimes 13
Distribution of Mispredicted Branches • 66% of mispredicted branches can be dynamically predicated in DMP. 14
Distribution of Mispredicted Branches • 66% of mispredicted branches can be dynamically predicated in DMP. 15
Outline • Predicated Execution • Diverge-Merge Processor (DMP) • Implementation of DMP • Experimental Evaluation • Conclusion 16
Fetch Mechanism A A Diverge Branch Low Confidence C B B D Round-robin fetch C E E F G CFM point H H predicted path 17
branch pr10,C p1 = pr10 branch r0, C add r1 r3, #1 add pr21pr13, #1 (p1) add r1 r2, # -1 add pr31pr12, # -1(!p1) add r4 r1, r3 add pr24pr41, pr13 Dynamic Predication A PR11 1 PR41 PR21 B RAT1 C PR11 PR31 1 E select-µop pr41 = p1? pr21 : pr31 RAT2 H Forks RAT, RAS, and GHR 18
DMP Support • ISA Support • Mark diverge branches/CFM points. • Compiler Support [CGO’07] • The compiler identifies diverge branches and the corresponding CFM points. • Hardware Support • Confidence estimator • Fetch mechanisms • Load/store processing • Instruction retirement • Dynamic predication 19
Hardware Complexity Analysis DMP Dyn.ham. Dualpath Multi path SW pred. Wish br. Front-End Confidence Estimator Rename Support Predicate Registers Select-Uop Gen. ST-LD Forwarding Check Flush/no Flush 20
Outline • Predicated Execution • Diverge-Merge Processor (DMP) • Implementation of DMP • Experimental Evaluation • Conclusion 21
Simulation Methodology • 12 SPEC 2000 INT, 5 SPEC 95 INT • Different input sets for profiling and evaluation • Alpha ISA execution driven simulator • Baseline processor configuration • 64KB perceptron predictor/O-GEHL (paper) • Minimum 30-cycle branch misprediction penalty • 8-wide, 512-entry instruction window • 2 KB 12-bit history enhanced JRS confidence estimator • Less aggressive processor (paper) • Power model using Wattch 22
Outline • Predicated Execution • Diverge-Merge Processor (DMP) • Implementation of DMP • Experimental Evaluation • Conclusion 26
Conclusion • DMP introduces the concept offrequently-hammocksand it dynamically predicates complex CFGs. • DMP can overcome the threemajor limitationsof software predication: ISA support, adaptivity, complex CFG. • DMP reduces branch mispredictions energy efficiently • 19% performance improvement, 9% less energy • DMP divides the work between the compiler and the microarchitecture: • The compiler analyzes the control-flow graphs. • The microarchitecture decideswhen and what to predicate dynamically. 27