360 likes | 555 Views
EE 382N Guest Lecture Wish Branches. Hyesoon Kim HPS Research Group The University of Texas at Austin. Lecture Outline. Predicated execution Wish branches 2D-profiling. Motivation. Branch predictors are still not perfect .
E N D
EE 382N Guest LectureWish Branches Hyesoon Kim HPS Research Group The University of Texas at Austin
Lecture Outline • Predicated execution • Wish branches • 2D-profiling
Motivation • Branch predictors are still not perfect. • Deeper pipeline andlarger instruction window increase the branch misprediction penalty. • Predicated execution can eliminate branch misprediction by converting control-dependency to data dependency. However, predicated code has overhead.
(normal branch code) A A T N if (cond) { b = 0; } else { b = 1; } B C B C D D A p1 = (cond) branch p1, TARGET B mov b, 1 jmp JOIN C TARGET: mov b,0 Predicated Execution (predicated code) Convert control flow dependency to data dependency Pro: Eliminate hard-to-predict branches A p1 = (cond) (!p1) mov b,1 (p1) mov b,0 B C D add x, b, 1 Cons: (1) Fetch blocks B and C all the time (2) Wait until p1 is resolved
The Overhead of Predicated Execution -2% 16% 13% non-predicated p1 = (cond) (!p1) mov b,1 (p1) mov b,0 p1 = (cond) (0) mov b,1 (1)mov b,0 A B C D add x, b, 1 (Predicated code) If all overhead is ideally eliminated, predicated execution would provide 16% improvement in average execution time
The Problem • Due to the predication overhead, predicated execution sometimes reduces performance • Branch misprediction characteristics are dependent on run-time behavior: input set, control-flow path andphase behavior. The compiler cannot accurately estimate the run-time behavior of branches
A A T N B C B C D D Predicated Code Performance vs. Branch Misprediction Rate Predicated code performs better • Converting a branch to predicated code could hurt performance if run-time misprediction rate is lower than profile-time misprediction rate run-time (input B) profile-time (input A) X Normal branch code performs better • Execution time(normal branch code) = exec_T * P(T) + exec_N * P(N) + misp_penalty * P(misprediction) • Execution time of predicated code = exec_pred
Lecture Outline • Predicated execution • Wish branches • 2D-profiling
Wish Branches [Kim et al. Micro-38] • A new type of control flow instruction 3 types: wish jump/join and wish loop • The compilergenerates code (with wish branches) that can be executed either as predicated code or non-predicated code (normal branch code) • The hardwaredecides to execute predicated code or normal branch code at run-time based on the confidence of branch prediction • Easy to predict: normal branch code • Hard to predict: predicated code
A A T N B C B C D D A A p1 = (cond) (!p1) mov b,1 (p1) mov b,0 B B mov b, 1 jmp JOIN C C TARGET: mov b,0 normal branch code predicated code Wish Jump/Join High Confidence Low Confidence A wish jump nop B wish join Taken Not-Taken C D A p1=(cond) wish.jump p1 TARGET p1 = (cond) branch p1, TARGET B nop (!p1) mov b,1 wish.join !p1Join (1) mov b,1 wish.join (1)Join C TARGET: (1) mov b,0 TARGET: (p1) mov b,0 D JOIN: wish jump/join code
do { a++; i++; } while (i<N); Wish Loop H X T X T N N High Confidence Low Confidence Y Y H mov p1, 1 LOOP: (p1) add a, a, 1 (p1) add i, i, 1 (p1) p1 = (cond) wish. loopp1, LOOP EXIT: X X LOOP: add a, a, 1 add i, i, 1 p1 = (i<N) branch p1, LOOP EXIT: (1) (1) (1) Y Y wish loop code normal backward branch code
Mispredicted Case 1: Early-Exit Compared to normal branch code: predicate data dependency and one extra instruction(-) H X1 X2 X3 Y H Correct execution: T T N X T Early-exit: (Low confidence) Flush pipeline N H X1 X2 Y … T N Y X3 Y N
Mispredicted Case 2: Late-Exit Compared to normal branch code: pro: reduce flush penalty (+++) cons: predicate data dependency and one extrainstruction(-) H Correct execution: X1 X2 X3 Y H T T N X T nop nop Late-exit: (Low confidence) N H X1 X2 X3 X4 X5 Y … T T T T N Y
Mispredicted Cases3: No-Exit No-Exit: predicate data dependency and one extra instruction(-) H Correct execution: X1 X2 X3 Y H T T N nop nop Late-exit: X T H X1 X2 X3 X4 X5 Y … N T T T T N Flush pipeline Y No-exit: H X1 X2 X3 X4 X5 X6 … T T T T T Y
Questions? • What kind of branches should be converted to wish branches (jump/join)? • Why not all branches? • What kind of branches should be converted to wish loops?
Advantages/Disadvantages of Wish Branches • Advantages compared to predicated execution • Reduce the overhead of predication • Increase the benefits of predicated code by allowing the compiler to generate more aggressively-predicated code • Provide a mechanism to exploit predication to reduce the branch misprediction penalty for backward branches (Wish loops) • Make predicated code less dependent on machine configuration (e.g. branch predictor)
Advantages/Disadvantages of Wish Branches • Disadvantages compared to predicated execution • Extra branch instructions use machine resources • Extra branch instructions increase the contention for branch predictor table entries • May constrain the compiler’s scope for code optimizations
Wish Branch Support • ISA Support • predicated execution, wish branch instruction • Compiler Support Wish branch generation algorithms The compiler needs to decide which branches are predicated, which are converted to wish branches, and which stay as normal branches • Hardware Support Instruction decode logic Predicate dependency elimination module Confidence estimator Front-end and branch misprediction detection/recovery module
ISA Support • Using existing hint bits (IA-64, x86, PowerPC) • Hint bits can be ignored. A wish branch can be treated as a normal branch. OPCODE btypewtype target offset p btye: branch type (0:normal branch 1:wish branch) wtype: wish branch type (0:jump 1:loop 2:join) p: predicate register identifier
Wish Branch Support • ISA Support • predicated execution, wish branch instruction • Compiler Support • Wish branch generation algorithms The compiler needs to decide which branches are predicated, which are converted to wish branches, and which stay as normal branches • Hardware Support Instruction decode logic Predicate dependency elimination module Confidence estimator Front-end and branch misprediction detection/recovery module
select candidates cost-benefit analysis wish jump conversion predicate selected blocks edge/value profiling if-conversion branch elimination wish join insertion wish loop conversion loop opt Compiler Support Major phase ordering with wish branch generation in code generation [ORC] region formation if-conversion loop opt (swp, unrolling) global inst. sched register allocation modified local inst. sched new existing
Wish Branch Generation Algorithm • wish jump/join candidates: all branch which are suitable for if-conversion • The number of instructions in the fall-through block > N (N=5) : wish jump and join are inserted • All other branches converted to predicated code • A loop branch is converted into a wish loop: when the loop body has fewer than L instructions (L=30)
Wish Branch Support • ISA Support • predicated execution, wish branch instruction • Compiler Support • Wish branch generation algorithms The compiler needs to decide which branches are predicated, which are converted to wish branches, and which stay as normal branches • Hardware Support • Instruction decode logic • Predicate dependency elimination module • Front-end and branch misprediction detection/recovery module • Confidence estimator
Hardware Support • Instruction Fetch/decode logic Decoder: decode wish branches BTB: mark wish branches • Wish branch state machine hardware • Wish loop stays as low-confidence mode until the loop exits • Predicate dependency elimination module • High-confidence mode: predicate values are predicted • Branch misprediction detection/recovery module • No flush if wish branch is mispredicted during low-confidence mode • Confidence estimator
> th? JRS Confidence Estimator Estimate how much confidence the processor has in a branch prediction Trained with branch misprediction information Assigning Confidence to Conditional Branch Predictions [Jacobsen et al. Micro-29] n bit Counters m bits PC + 2^m entries High Confidence Low Confidence Global BHR
Experimental Infrastructure • IA-64 provides full support for predication • Convert IA-64 traces to micro-ops to simulate an out-of-order superscalar processor model Source Code IA-64 Binary IA-64 Trace µops IA-64 Compiler (ORC) Micro-op Translator Micro-op Simulator Trace generation module
Simulation Methodology • Nine SPEC 2000 integer benchmarks • Baseline Processor Configuration • Front End • Large and accurate branch predictor(64KB hybrid branch predictor: gshare + local) • Minimum 30-cycle branch misprediction penalty • 64KB, 2-cycle latency I-cache • Execution Core • 8-wide out-of-order processor • 512-entry instruction window • Confidence Estimator • 1KB tagged 16-bit history JRS confidence estimator (Jacobsen et al. MICRO-29)
Performance Improvement -4% 14% 2.02 8% 24% non-predicated 16% over conditional branch prediction (w/o mcf) 11% over selective-predication (w/o mcf) 7 % over aggressive predication (w/o mcf) 14% over conditional branch prediction and 13% over selective-predication and 16% over aggressive-predication 12% over conditional branch prediction 11% over selective-predication 13 % over aggressive predication SELECTIVE-PREDICATION: branches are selectively predicated using compile-time cost-benefit analysis AGGRESSIVE-PREDICATION: all branches that are suitable for if-conversion are predicated
Wish Branch: Conclusion • New control flow instructions: wish branches (jump/join/loop) • Wish branches improve performance by dividing the work of predication between the compiler and the microarchitecture • Compiler: analyzes the control-flow graph and generates code • Microarchitecture: makes run-time decision to use predication • Wish branches provide significant performance benefits • 16% compared to conditional branch prediction • 13% compared to selectively predicated code • Wish branches can make predicated execution more viable and effective in high performance processors • By enablingadaptive and aggressive predicated execution
Lecture Outline • Predicated execution • Wish branches • 2D-profiling
2D-profiling • Goal: Identify input-dependent branches by using a single input set for profiling • If We Know a Branch is Input-Dependent • May not convert it to predicated code. • May convert it to a wish branch. • May not perform other compiler optimizations or may perform them less aggressively. • Hot-path/trace/superblock-based optimizations [Fisher’81, Pettis’90, Hwu’93, Merten’99]
input-dependent input-independent Key Insight of 2D-profiling Phase behavior in prediction accuracy is a good indicator of input dependence phase 2 phase 3 phase 1
brA time brB time Traditional Profiling pr. Acc MEAN pr.Acc(brA) pr. Acc MEAN pr.Acc(brB) MEAN pr.Acc(brA) MEAN pr.Acc(brB) behavior of brA behavior of brB
brA time brB time 2D-profiling pr. Acc MEAN pr.Acc(brA) STD pr.Acc(brA) pr. Acc MEAN pr.Acc(brB) STD pr.Acc(brB) MEAN pr.Acc(brA) MEAN pr.Acc(brB) STD pr.Acc(brA) ≠ STD pr.Acc(brB) behavior of brA ≠ behavior of brB A: input-dependent br, B: input-independent br
2D-profiling Mechanism • The profiler collects branch prediction accuracy information for every static branchover time slice size = M instructions Slice 1 Slice 2 … Slice N time mean Pr.Acc(brA,s1) mean Pr.Acc(brA,s2) ... mean Pr.Acc(brA,sN) mean Pr.Acc(brB,s1) mean Pr.Acc(brB,s2) ... mean Pr.Acc(brB,sN) . . . . . . . . . PAM:50% brA mean brA Calculate MEAN (brA, brB, …), Standard deviation (brA, brB, …), PAM:Points Above Mean (brA, brB, …) brB PAM:0% mean brB
2D-profiling: Conclusion & Future Work • 2D-profiling is a new profiling technique to find input-dependent characteristics by using a single input data set for profiling • 2D-profiling uses time-varying information instead of just average data • Phase behavior in prediction accuracy in a profile run input-dependent • Future Work: • Better predicated code/wish branch generation algorithms • Detecting other input-dependent program characteristics