1 / 32

Computer Architecture Advanced Branch Prediction

Computer Architecture Advanced Branch Prediction. Lihu Rappoport and Adi Yoaz. Introduction. Need to predict: Conditional branch direction (taken or no taken) Actual direction is known only after execution Wrong direction prediction causes a full flush

inara
Download Presentation

Computer Architecture Advanced Branch Prediction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer ArchitectureAdvanced Branch Prediction Lihu Rappoport and Adi Yoaz

  2. Introduction • Need to predict: • Conditional branch direction (taken or no taken) • Actual direction is known only after execution • Wrong direction prediction causes a full flush • All taken branch (conditional taken or unconditional) targets • Target of direct branches known at decode • Target of indirect branches known at execution • Branch type • Conditional, uncond. direct, uncond. indirect, call, return • Target: minimize branch misprediction rate for a given predictor size

  3. What/Who/When We Predict/Fix Fetch Decode Execute Target Array • Branch type • conditional • unconditional direct • unconditional indirect • call • return • Branch target • Fix TA miss • Fix wrong direct target Cond. Branch Predictor • Predict conditional T/NT on TA miss try to fix Fix Wrong prediction Return Stack Buffer • Predict return target Fix TA miss Fix Wrong prediction Indirect Target Array • Predict indirect target • override TA target Fix Wrong prediction Dec Flush Exe Flush

  4. Branches and Performance • MPI : misprediction-per-instruction: # of incorrectly predicted branches MPI = total # of instructions • MPI correlates well with performance. For example: • MPI = 1% (1 out of 100 instructions @1 out of 20 branches) • IPC=2 (IPC is the average number of instructions per cycle), • flush penalty of 10 cycles • We get: • MPI = 1%  flush in every 100 instructions • flush in every 50 cycles (since IPC=2), • 10 cycles flush penalty every 50 cycles • 20% in performance

  5. Target Array • TA is accessed using the branch address (branch IP) • Implemented as an n-way set associative cache • Tags usually partial • Save space • Can get false hits • Few branches aliased to the same entry • No correctness only performance • TA predicts the following • Indication that instruction is a branch • Predicted target • Branch type • Unconditional: take target • Conditional: predict direction • TA allocated / updated at execution Branch IP type target tag hit / miss (indicates a branch) predicted type predicted target

  6. Conditional BranchDirection Prediction

  7. One-Bit Predictor • Problem: 1-bit predictor has a double mistake in loops Branch Outcome 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 Prediction ? 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 counter array / cache Prediction (at fetch): previous branch outcome branch IP Update (at execution) Update bit with branch outcome

  8. taken taken taken 11 ST 00 SNT 01 WNT 10 WT not- taken taken not-taken not-taken not-taken Predict not-taken Predict taken Bimodal (2-bit) Predictor • A 2-bit counter avoids the double mistake in glitches • Need “more evidence” to change prediction • 2 bits encode one of 4 states • 00 – strong NT, 01 – weakly NT, 10 – weakly taken, 11 – strong taken • Initial state: weakly-taken (most branches are taken) • Update • Branch was actually taken: increment counter (saturate at 11) • Branch was actually not-taken: decrement counter (saturate at 00) • Predict according to m.s.bit of counter (0 – NT, 1 – taken) • Does not predict well branches with patterns like 010101…

  9. Bimodal Predictor (cont.) l.s. bits of branch IP 2-bit-sat counter array Prediction = msb of counter Update counter with branch outcome

  10. Bimodal Predictor - example • Br1 prediction • Pattern: 1 0 1 0 1 0 • counter: 2 3 2 3 2 3 • Prediction: T T T T T T • Br2 prediction • Pattern: 0 1 0 1 0 1 • counter: 2 1 2 1 2 1 • Prediction: T nT T nT T nT • Br3 prediction • Pattern: 1 1 1 1 1 0 • counter: 2 3 3 3 3 3 • Prediction: T T T T T T • Code: • Loop: …. • br1: if (n/2) { • ……. } • br2: if ((n+1)/2) { • ……. } • n-- • br3: JNZ n, Loop

  11. 0 BHR n 2n-1 2-Level Prediction: Local Predictor • Save the history ofeach branch in a Branch History Register (BHR): • A shift-register updated by branch outcome • Saves the last n outcomes of the branch • Used as a pointer to an array of bits specifying direction per history • Example: assume n=6 • Assume the pattern 000100010001 . . . • At the steady-state, the following patterns are repeated in the BHR: 000100010001 . . . 000100 001000 010001 100010 • Following 000100, 010001, 100010 the jump is not taken • Following 001000 the jump is taken

  12. 2-bit-sat counter array BHR history prediction = msb of counter Update History with branch outcome Update counter with branch outcome Local Predictor (cont.) • There could be glitches from the pattern • Use 2-bit saturating counters instead of 1 bit to record outcome: • Too long BHRs are not good: • Past history may be no longer relevant • Warm-Up is longer • Counter array becomes too big

  13. 2-bit-sat counter arrays history cache prediction = msb of counter Branch IP tag history Update History with branch outcome Update counter with branch outcome Local Predictor: private counter arrays Holding BHRs and counter arrays for many branches: Predictor size: #BHRs × (tag_size + history_size + 2 ×2 history_size) Example: #BHRs = 1024; tag_size=8; history_size=6  size=1024 × (8 + 6 + 2×26) = 142Kbit

  14. 2-bit-sat counter array history cache Branch IP tag history prediction = msb of counter Local Predictor: shared counter arrays • Using a single counter array shared by all BHR’s • All BHR’s index the same array • Branches with similar patterns interfere with each other Predictor size: #BHRs × (tag_size + history_size) + 2 ×2 history_size Example: #BHRs = 1024; tag_size=8; history_size=6  size=1024 × (8 + 6) + 2×26 = 14.1Kbit

  15. 2-bit-sat counter array history cache Branch IP tag history h prediction = msb of counter h+m m l.s.bits of IP Local Predictor: lselect • lselect reduces inter-branch-interference in the counter array Predictor size: #BHRs × (tag_size + history_size) + 2 ×2 history_size + m

  16. 2-bit-sat counter array history cache Branch IP tag history h prediction = msb of counter h h l.s.bits of IP Local Predictor: lshare lsharereduces inter-branch-interference in the counter array: maps common patterns in different branches to different counters Predictor size: #BHRs × (tag_size + history_size) + 2 ×2 history_size

  17. Global Predictor • The behavior of some branches is highly correlated with the behavior of other branches: if (x < 1) . . . if (x > 1) . . . • Using a Global History Register (GHR), the prediction of the second ifmay be based on the direction of the firstif • Forother branches the history interference might be destructive

  18. 2-bit-sat counter array GHR history prediction = msb of counter Update History with branch outcome Update counter with branch outcome Global Predictor (cont.) The predictor size: history_size + 2*2 history_size Example: history_size = 12  size = 8 K Bits

  19. 2-bit-sat counter array GHR history prediction = msb of counter Branch IP Update History with branch outcome Update counter with branch outcome Global Predictor: Gshare gshare combines the global history information with the branch IP

  20. Global GHR Prediction Bimodal / Local Branch IP +1 if Bimodal / Local correct and Global wrong -1 if Bimodal / Local wrong and Global correct Chooser array (an array of 2-bit sat. counters) Chooser A chooser selects between 2 predictor that predict the same branch: Use the predictor that was more correct in the past • The chooser may also be indexed by the GHR

  21. Speculative History Updates • Deep pipeline  many cycles between fetch and branch resolution • If history is updated only at resolution • Local: future occurrences of the same branch may see stale history • Global: future occurrences of all branches may see stale history • History is speculatively updated according to the prediction • History must be corrected if the branch is mispredicted • Speculative updates are done in a special field to enable recovery • Speculative History Update • Speculative history updated assuming previous predictions are correct • Speculation bit set to indicate that speculative history is used • Counter array is not updated speculatively • Prediction can change only on a misprediction (state 01→10 or 10→01) • On branch resolution • Update the real history and reset speculative histories if mispredicted

  22. Return Stack Buffer • A return instruction is a special case of an indirect branch: • Each times it jumps to a different target • The target is determined by the location of the corresponding call instruction • The idea: • Hold a small stack of targets • When the target array predicts a call • Push the address of the instruction which follows the call into the stack • When the target array predicts a return • Pop a target from the stack and use it as the return address

  23. Branch Prediction in commercial Processors

  24. Older Processors • 386 / 486 • All branches are statically predicted Not Taken • Pentium • IP based, 2-bit saturating counters (Lee-Smith) • BTB miss - statically predicted Not Taken

  25. 1001 0 counters Tag Target Hist V T P IP LRR 9 Pred= msb of counter 1 2 1 4 2 9 32 32 128 sets 15 Branch Type 00- cond 01- ret 10- call 11- uncond Per-Set Way 3 Way 0 Way 1 Way 2 Intel Pentium III • 2-level, local histories, per-set counters • 4-way set associative: 512 entries in 128 sets Return Stack Buffer

  26. Alpha 264 - LG Chooser Local • New entry on the Local stage is allocated on a global stage miss-prediction • Chooser state-machines: 2 bit each: • one bit saves last time global correct/wrong, • and the other bit saves for the local correct/wrong • Chooses Local only if local was correct and global was wrong Histories Counters In each entry: 6 bit tag + 10 bit History 256 1024 IP 3 4 ways Global Counters 4096 GHR 2 12 Chooser Counters 4096 2

  27. Pentium® M • Combines 3 predictors • Bimodal, Global and Loop predictor • Loop predictor analyzes branches to see if they have loop behavior • Moving in one direction (taken or NT) a fixed number of times • Ended with a single movement in the opposite direction

  28. Pentium® M – Indirect Branch Predictor • Indirect branch targets is data dependent • Can have many targets: e.g., a case statement • Can still have only a single target at run time • Resolved at execution  high misprediction penalty • Used in object-oriented code (C++, Java) • becomes a growing source of branch mispredictions • A dedicated indirect branch target predictor (iTA) • Chooses targets based on a global history (similar to global predictor) • Initially indirect branch is allocated only in the target array (TA) • If target is mispredicted  allocate an iTA entry corresponding to the global history leading to this instance of the indirect branch • Data-dependent indirect branches allocate as many targets as needed • Monotonic indirect branches are still predicted by the TA

  29. hit M HIT Target Array indirect branch Branch IP U X Target Predicted Target Target IndirectTargetPredictor Global history hit Indirect branch target prediction (cont) • Prediction from the iTA is used if • TA indicates an indirect branch • iTA hits for the current global history (XORed with branch address)

  30. Backup

  31. 13 bit global history AMD-K6 BHT - Branch History Table 2-level 8,192-entry global predictor: • 16 Entry BTC - Branch Target Cache • Supplies the first 16 bytes of target instructions to the decoders when the branches are predicted. • Organized as 16 entries of 16 bytes. • Avoids a bubble for correct predictions. • No Target Address Buffer: • Address ALUs calculate target addresses on-the-fly during decode GHR 2-bit-sat counter array • 16 Entry RAS - Return Address Stack • Caches the return addresses

  32. HP PA-8000 • 256 X 3- bit branch history table (BHT) • Instead of 2-bit-sat counters, stores the results of the last three iterations of each branch • The prediction is based on a majority vote of the three bits • Offers a similar level of hysteresis and accuracy as 2-bit-sat, but easier to update (shift results vs. read-modify-write) • The BHT is only updated as branch instructions are retired • prevents corrupting the history information with speculative executions of the branch

More Related