EE 382N Guest Lecture Wish Branches

EE 382N Guest LectureWish Branches Hyesoon Kim HPS Research Group The University of Texas at Austin

Lecture Outline • Predicated execution • Wish branches • 2D-profiling

Motivation • Branch predictors are still not perfect. • Deeper pipeline andlarger instruction window increase the branch misprediction penalty. • Predicated execution can eliminate branch misprediction by converting control-dependency to data dependency. However, predicated code has overhead.

(normal branch code) A A T N if (cond) { b = 0; } else { b = 1; } B C B C D D A p1 = (cond) branch p1, TARGET B mov b, 1 jmp JOIN C TARGET: mov b,0 Predicated Execution (predicated code) Convert control flow dependency to data dependency Pro: Eliminate hard-to-predict branches A p1 = (cond) (!p1) mov b,1 (p1) mov b,0 B C D add x, b, 1 Cons: (1) Fetch blocks B and C all the time (2) Wait until p1 is resolved

The Overhead of Predicated Execution -2% 16% 13% non-predicated p1 = (cond) (!p1) mov b,1 (p1) mov b,0 p1 = (cond) (0) mov b,1 (1)mov b,0 A B C D add x, b, 1 (Predicated code) If all overhead is ideally eliminated, predicated execution would provide 16% improvement in average execution time

The Problem • Due to the predication overhead, predicated execution sometimes reduces performance • Branch misprediction characteristics are dependent on run-time behavior: input set, control-flow path andphase behavior. The compiler cannot accurately estimate the run-time behavior of branches

A A T N B C B C D D Predicated Code Performance vs. Branch Misprediction Rate Predicated code performs better • Converting a branch to predicated code could hurt performance if run-time misprediction rate is lower than profile-time misprediction rate run-time (input B) profile-time (input A) X Normal branch code performs better • Execution time(normal branch code) = exec_T * P(T) + exec_N * P(N) + misp_penalty * P(misprediction) • Execution time of predicated code = exec_pred

Wish Branches [Kim et al. Micro-38] • A new type of control flow instruction 3 types: wish jump/join and wish loop • The compilergenerates code (with wish branches) that can be executed either as predicated code or non-predicated code (normal branch code) • The hardwaredecides to execute predicated code or normal branch code at run-time based on the confidence of branch prediction • Easy to predict: normal branch code • Hard to predict: predicated code

A A T N B C B C D D A A p1 = (cond) (!p1) mov b,1 (p1) mov b,0 B B mov b, 1 jmp JOIN C C TARGET: mov b,0 normal branch code predicated code Wish Jump/Join High Confidence Low Confidence A wish jump nop B wish join Taken Not-Taken C D A p1=(cond) wish.jump p1 TARGET p1 = (cond) branch p1, TARGET B nop (!p1) mov b,1 wish.join !p1Join (1) mov b,1 wish.join (1)Join C TARGET: (1) mov b,0 TARGET: (p1) mov b,0 D JOIN: wish jump/join code

do { a++; i++; } while (i<N); Wish Loop H X T X T N N High Confidence Low Confidence Y Y H mov p1, 1 LOOP: (p1) add a, a, 1 (p1) add i, i, 1 (p1) p1 = (cond) wish. loopp1, LOOP EXIT: X X LOOP: add a, a, 1 add i, i, 1 p1 = (i<N) branch p1, LOOP EXIT: (1) (1) (1) Y Y wish loop code normal backward branch code

Mispredicted Case 1: Early-Exit Compared to normal branch code: predicate data dependency and one extra instruction(-) H X1 X2 X3 Y H Correct execution: T T N X T Early-exit: (Low confidence) Flush pipeline N H X1 X2 Y … T N Y X3 Y N

Mispredicted Case 2: Late-Exit Compared to normal branch code: pro: reduce flush penalty (+++) cons: predicate data dependency and one extrainstruction(-) H Correct execution: X1 X2 X3 Y H T T N X T nop nop Late-exit: (Low confidence) N H X1 X2 X3 X4 X5 Y … T T T T N Y

Mispredicted Cases3: No-Exit No-Exit: predicate data dependency and one extra instruction(-) H Correct execution: X1 X2 X3 Y H T T N nop nop Late-exit: X T H X1 X2 X3 X4 X5 Y … N T T T T N Flush pipeline Y No-exit: H X1 X2 X3 X4 X5 X6 … T T T T T Y

Questions? • What kind of branches should be converted to wish branches (jump/join)? • Why not all branches? • What kind of branches should be converted to wish loops?

Advantages/Disadvantages of Wish Branches • Advantages compared to predicated execution • Reduce the overhead of predication • Increase the benefits of predicated code by allowing the compiler to generate more aggressively-predicated code • Provide a mechanism to exploit predication to reduce the branch misprediction penalty for backward branches (Wish loops) • Make predicated code less dependent on machine configuration (e.g. branch predictor)

Advantages/Disadvantages of Wish Branches • Disadvantages compared to predicated execution • Extra branch instructions use machine resources • Extra branch instructions increase the contention for branch predictor table entries • May constrain the compiler’s scope for code optimizations

Wish Branch Support • ISA Support • predicated execution, wish branch instruction • Compiler Support Wish branch generation algorithms The compiler needs to decide which branches are predicated, which are converted to wish branches, and which stay as normal branches • Hardware Support Instruction decode logic Predicate dependency elimination module Confidence estimator Front-end and branch misprediction detection/recovery module

ISA Support • Using existing hint bits (IA-64, x86, PowerPC) • Hint bits can be ignored. A wish branch can be treated as a normal branch. OPCODE btypewtype target offset p btye: branch type (0:normal branch 1:wish branch) wtype: wish branch type (0:jump 1:loop 2:join) p: predicate register identifier

Wish Branch Support • ISA Support • predicated execution, wish branch instruction • Compiler Support • Wish branch generation algorithms The compiler needs to decide which branches are predicated, which are converted to wish branches, and which stay as normal branches • Hardware Support Instruction decode logic Predicate dependency elimination module Confidence estimator Front-end and branch misprediction detection/recovery module

select candidates cost-benefit analysis wish jump conversion predicate selected blocks edge/value profiling if-conversion branch elimination wish join insertion wish loop conversion loop opt Compiler Support Major phase ordering with wish branch generation in code generation [ORC] region formation if-conversion loop opt (swp, unrolling) global inst. sched register allocation modified local inst. sched new existing

Wish Branch Generation Algorithm • wish jump/join candidates: all branch which are suitable for if-conversion • The number of instructions in the fall-through block > N (N=5) : wish jump and join are inserted • All other branches converted to predicated code • A loop branch is converted into a wish loop: when the loop body has fewer than L instructions (L=30)

Wish Branch Support • ISA Support • predicated execution, wish branch instruction • Compiler Support • Wish branch generation algorithms The compiler needs to decide which branches are predicated, which are converted to wish branches, and which stay as normal branches • Hardware Support • Instruction decode logic • Predicate dependency elimination module • Front-end and branch misprediction detection/recovery module • Confidence estimator

Hardware Support • Instruction Fetch/decode logic Decoder: decode wish branches BTB: mark wish branches • Wish branch state machine hardware • Wish loop stays as low-confidence mode until the loop exits • Predicate dependency elimination module • High-confidence mode: predicate values are predicted • Branch misprediction detection/recovery module • No flush if wish branch is mispredicted during low-confidence mode • Confidence estimator

> th? JRS Confidence Estimator Estimate how much confidence the processor has in a branch prediction Trained with branch misprediction information Assigning Confidence to Conditional Branch Predictions [Jacobsen et al. Micro-29] n bit Counters m bits PC + 2^m entries High Confidence Low Confidence Global BHR

Experimental Infrastructure • IA-64 provides full support for predication • Convert IA-64 traces to micro-ops to simulate an out-of-order superscalar processor model Source Code IA-64 Binary IA-64 Trace µops IA-64 Compiler (ORC) Micro-op Translator Micro-op Simulator Trace generation module

Simulation Methodology • Nine SPEC 2000 integer benchmarks • Baseline Processor Configuration • Front End • Large and accurate branch predictor(64KB hybrid branch predictor: gshare + local) • Minimum 30-cycle branch misprediction penalty • 64KB, 2-cycle latency I-cache • Execution Core • 8-wide out-of-order processor • 512-entry instruction window • Confidence Estimator • 1KB tagged 16-bit history JRS confidence estimator (Jacobsen et al. MICRO-29)

Performance Improvement -4% 14% 2.02 8% 24% non-predicated 16% over conditional branch prediction (w/o mcf) 11% over selective-predication (w/o mcf) 7 % over aggressive predication (w/o mcf) 14% over conditional branch prediction and 13% over selective-predication and 16% over aggressive-predication 12% over conditional branch prediction 11% over selective-predication 13 % over aggressive predication SELECTIVE-PREDICATION: branches are selectively predicated using compile-time cost-benefit analysis AGGRESSIVE-PREDICATION: all branches that are suitable for if-conversion are predicated

Wish Branch: Conclusion • New control flow instructions: wish branches (jump/join/loop) • Wish branches improve performance by dividing the work of predication between the compiler and the microarchitecture • Compiler: analyzes the control-flow graph and generates code • Microarchitecture: makes run-time decision to use predication • Wish branches provide significant performance benefits • 16% compared to conditional branch prediction • 13% compared to selectively predicated code • Wish branches can make predicated execution more viable and effective in high performance processors • By enablingadaptive and aggressive predicated execution

2D-profiling • Goal: Identify input-dependent branches by using a single input set for profiling • If We Know a Branch is Input-Dependent • May not convert it to predicated code. • May convert it to a wish branch. • May not perform other compiler optimizations or may perform them less aggressively. • Hot-path/trace/superblock-based optimizations [Fisher’81, Pettis’90, Hwu’93, Merten’99]

input-dependent input-independent Key Insight of 2D-profiling Phase behavior in prediction accuracy is a good indicator of input dependence phase 2 phase 3 phase 1

brA time brB time Traditional Profiling pr. Acc MEAN pr.Acc(brA) pr. Acc MEAN pr.Acc(brB) MEAN pr.Acc(brA)  MEAN pr.Acc(brB) behavior of brA  behavior of brB

brA time brB time 2D-profiling pr. Acc MEAN pr.Acc(brA) STD pr.Acc(brA) pr. Acc MEAN pr.Acc(brB) STD pr.Acc(brB) MEAN pr.Acc(brA)  MEAN pr.Acc(brB) STD pr.Acc(brA) ≠ STD pr.Acc(brB) behavior of brA ≠ behavior of brB A: input-dependent br, B: input-independent br

2D-profiling Mechanism • The profiler collects branch prediction accuracy information for every static branchover time slice size = M instructions Slice 1 Slice 2 … Slice N time mean Pr.Acc(brA,s1) mean Pr.Acc(brA,s2) ... mean Pr.Acc(brA,sN) mean Pr.Acc(brB,s1) mean Pr.Acc(brB,s2) ... mean Pr.Acc(brB,sN) . . . . . . . . . PAM:50% brA mean brA Calculate MEAN (brA, brB, …), Standard deviation (brA, brB, …), PAM:Points Above Mean (brA, brB, …) brB PAM:0% mean brB

2D-profiling: Conclusion & Future Work • 2D-profiling is a new profiling technique to find input-dependent characteristics by using a single input data set for profiling • 2D-profiling uses time-varying information instead of just average data • Phase behavior in prediction accuracy in a profile run  input-dependent • Future Work: • Better predicated code/wish branch generation algorithms • Detecting other input-dependent program characteristics

EE 382N Guest Lecture Wish Branches

EE 382N Guest Lecture Wish Branches

Presentation Transcript

EE 516 Lecture 1

EE 434 Lecture 12

EE 615 Lecture 2

EE 615 Lecture 3

IST 590 Guest Lecture

EE 382N Microarchitecture Yale Patt, instructor Eiman Ebrahimi, Khubaib, TAs

GUEST LECTURE Chief guest Mr.Guhan Jayagopal

Onur Mutlu EE 382N Guest Lecture

GUEST LECTURE Chief guest Mr . Anand Purushothaman

EE 42 lecture 5

EE 4BD4 Lecture 22

EE 4BD4 Lecture 6

Guest Lecture Tonight

MIS 424 Guest Lecture

EE 4BD4 Lecture 24

EE 211 Lecture 6

EE 4BD4 Lecture 19

EE 122: Lecture 5

EE 627 Lecture 11

EE 4BD4 Lecture 12

CS103 Guest Lecture

EE 211 Lecture 6