1 / 43

9th Lecture

9th Lecture. Branch prediction (rest) Predication Intel Pentium II/III Intel Pentium 4. Hybrid Predictors. The second strategy of McFarling is to combine multiple separate branch predictors, each tuned to a different class of branches.

landen
Download Presentation

9th Lecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 9th Lecture • Branch prediction (rest) • Predication • Intel Pentium II/III • Intel Pentium 4

  2. Hybrid Predictors • The second strategy of McFarling is to combine multiple separate branch predictors, each tuned to a different class of branches. • Two or more predictors and a predictor selection mechanism are necessary in a combining or hybrid predictor. • McFarling: combination of two-bit predictor and gshare two-level adaptive, • Young and Smith: a compiler-based static branch prediction with a two-level adaptive type, • and many more combinations! • Hybrid predictors often better than single-type predictors.

  3. Simulations of Grunwald 1998 Table 1.1. SAg, gshare and MCFarling‘s combining predictor

  4. Results • Simulation of Keeton et al. 1998 using an OLTP (online transaction workload) on a PentiumPro multiprocessor reported a misprediction rate of 14% with an branch instruction frequency of about 21%. • The speculative execution factor, given by the number of instructions decoded divided by the number of instructions committed, is 1.4 for the database programs. • Two different conclusions may be drawn from these simulation results: • Branch predictors should be further improved • and/or branch prediction is only effective if the branch is predictable. • If a branch outcome is dependent on irregular data inputs, the branch often shows an irregular behavior.  Question: Confidence of a branch prediction?

  5. 4.3.4 Predicated Instructions and Multipath Execution- Confidence Estimation • Confidence estimation is a technique for assessing the quality of a particular prediction. • Applied to branch prediction, a confidence estimator attempts to assess the prediction made by a branch predictor. • A low confidence branch is a branch which frequently changes its branch direction in an irregular way making its outcome hard to predict or even unpredictable. • Four classes possible: • correctly predicted with high confidence C(HC), • correctly predicted with low confidence C(LC), • incorrectly predicted with high confidence I(HC), and • incorrectly predicted with low confidence I(LC).

  6. Implementation of a confidence estimator • Information from the branch prediction tables is used: • Use of saturation counter information to construct a confidence estimator speculate more aggressively when the confidence level is higher • Used of a miss distance counter table (MDC):  Each time a branch is predicted, the value in the MDC is compared to a threshold. If the value is above the threshold, then the branch is considered to have high confidence, and low confidence otherwise. • A small number of branch history patterns typically leads to correct predictions in a PAs predictor scheme. The confidence estimator assigned high confidence to a fixed set of patterns and low confidence to all others. • Confidence estimation can be used for speculation control,thread switching in multithreaded processors or multipath execution

  7. Predicated Instructions • Provide predicated or conditional instructions and one or more predicate registers. • Predicated instructions use a predicate register as additional input operand. • The Boolean result of a condition testing is recorded in a (one-bit) predicate register. • Predicated instructions are fetched, decoded and placed in the instruction window like non predicated instructions. • It is dependent on the processor architecture, how far a predicated instruction proceeds speculatively in the pipeline before its predication is resolved: • A predicated instruction executes only if its predicate is true, otherwise the instruction is discarded. In this case predicated instructions are not executed before the predicate is resolved. • Alternatively, as reported for Intel's IA64 ISA, the predicated instruction may be executed, but commits only if the predicate is true, otherwise the result is discarded.

  8. Predication Example if (x = = 0) { /*branch b1 */ a = b + c; d = e - f; } g = h * i; /* instruction independent of branch b1 */ (Pred = (x = = 0) ) /* branch b1: Pred is set to true in x equals 0 */ if Predthena = b + c; /* The operations are only performed */ ifPredthene = e - f; /* if Pred is set to true */ g = h * i;

  9. Predication • Able to eliminate a branch and therefore the associated branch prediction  increasing the distance between mispredictions. • The the run length of a code block is increased  better compiler scheduling. • Predication affects the instruction set, adds a port to the register file, and complicates instruction execution. • Predicated instructions that are discarded still consume processor resources; especially the fetch bandwidth. • Predication is most effective when control dependences can be completely eliminated, such as in an if-then with a small then body. • The use of predicated instructions is limited when the control flow involves more than a simple alternative sequence.

  10. Eager (Multipath) Execution • Execution proceeds down both paths of a branch, and no prediction is made. • When a branch resolves, all operations on the non-taken path are discarded. • Oracle execution: eager execution with unlimited resources • gives the same theoretical maximum performance as a perfect branch prediction • With limited resources, the eager execution strategy must be employed carefully. • Mechanism is required that decides when to employ prediction and when eager execution: e.g. a confidence estimator • Rarely implemented (IBM mainframes) but some research projects: • Dansoft processor, Polypath architecture, selective dual path execution, simultaneous speculation scheduling, disjoint eager execution

  11. .3 .7 1 .21 .49 2 .3 .15 .34 .7 4 3 1 .21 .10 .24 .49 6 4 2 .3 .07 .17 .15 .34 .7 2 1 5 3 .09 6 .21 .05 .12 .21 .49 .10 .24 5 3 6 5 4 (a) (b) (c) (a) Single path speculative execution(b) full eager execution (c) disjoint eager execution

  12. 4.3.5 Prediction of Indirect Branches • Indirect branches, which transfer control to an address stored in register, are harder to predict accurately. • Indirect branches occur frequently in machine code compiled from object-oriented programs like C++ and Java programs. • One simple solution is to update the PHT to include the branch target addresses.

  13. Branch handling techniques and implementations Technique Implementation examples No branch prediction Intel 8086 Static prediction always not taken Intel i486 always taken Sun SuperSPARC backward taken, forward not taken HP PA-7x00 semistatic with profiling early PowerPCs Dynamic prediction: 1-bit DEC Alpha 21064, AMD K5 2-bit PowerPC 604, MIPS R10000, Cyrix 6x86 and M2, NexGen 586 two-level adaptive Intel PentiumPro, Pentium II, AMD K6, Athlon Hybrid prediction DEC Alpha 21264 Predication Intel/HP Merced and most signal processors as e.g. ARM processors, TI TMS320C6201 and many other Eager execution (limited) IBM mainframes: IBM 360/91, IBM 3090 Disjoint eager execution none yet

  14. High-Bandwidth Branch Prediction • Future microprocessor will require more than one prediction per cycle starting speculation over multiple branches in a single cycle, • e.g. Gag predictor is independent of branch address. • When multiple branches are predicted per cycle, then instructions must be fetched from multiple target addresses per cycle, complicating I-cache access. • Possible solution: Trace cache in combination with next trace prediction. • Most likely a combination of branch handling techniques will be applied, • e.g. a multi-hybrid branch predictor combined with support for context switching, indirect jumps, and interference handling.

  15. The Intel P5 and P6 family P5 P6 NetBurst including L2 cache

  16. Micro-Dataflow in PentiumPro 1995 • ... The flow of the Intel Architecture instructions is predicted and these instructions are decoded into micro-operations (ops), or series of ops, and these ops are register-renamed, placed into an out-of-order speculative pool of pending operations, executed in dataflow order (when operands are ready), and retired to permanent machine state in source program order. ... • R.P. Colwell, R. L. Steck: A 0.6 m BiCMOS Processor with Dynamic Execution, International Solid State Circuits Conference, Feb. 1995.

  17. PentiumPro and Pentium II/III • The Pentium II/III processors use the same dynamic execution microarchitecture as the other members of P6 family. • This three-way superscalar, pipelined micro-architecture features a decoupled, multi-stage superpipeline, which trades less work per pipestage for more stages. • The Pentium II/III processor has twelve stages with a pipestage time 33 percent less than the Pentium processor, which helps achieve a higher clock rate on any given manufacturing process. • A wide instruction window using an instruction pool. • Optimized scheduling requires the fundamental “execute” phase to be replaced by decoupled “issue/execute” and “retire” phases. This allows instructions to be started in any order but always be retired in the original program order. • Processors in the P6 family may be thought of as three independent engines coupled with an instruction pool.

  18. Pentium® Pro Processor and Pentium II/III Microarchitecture

  19. External Bus L2 Cache Memory Reorder Buffer Bus Interface Unit D-cache Unit Memory Interface Unit Instruction Fetch Unit (with I-cache) Branch Target Buffer Functional Units Reservation Station Unit Microcode Instruction Sequencer Instruction Decode Unit Reorder Buffer & Retirement Register File Register Alias Table Pentium II/III

  20. Pentium II/III: The In-Order Section • The instruction fetch unit (IFU) accesses a non-blocking I-cache, it contains the Next IP unit. • The Next IP unit provides the I-cache index (based on inputs from the BTB), trap/interrupt status, and branch-misprediction indications from the integer FUs. • Branch prediction: • two-level adaptive scheme of Yeh and Patt, • BTB contains 512 entries, maintains branch history information and the predicted branch target address. • Branch misprediction penalty: at least 11 cycles, on average 15 cycles • The instruction decoder unit (IDU) is composed of three separate decoders

  21. Pentium II/III: The In-Order Section (Continued) • A decoder breaks the IA-32 instruction down to ops, each comprised of an opcode, two source and one destination operand. These ops are of fixed length. • Most IA-32 instructions are converted directly into single micro ops (by any of the three decoders), • some instructions are decoded into one-to-four ops (by the general decoder), • more complex instructions are used as indices into the microcode instruction sequencer (MIS) which will generate the appropriate stream of ops. • The ops are send to the register alias table (RAT) where register renaming is performed, i.e., the logical IA-32 based register references are converted into references to physical registers. • Then, with added status information, ops continue to the reorder buffer (ROB, 40 entries) and to the reservation station unit (RSU, 20 entries).

  22. The Fetch/Decode Unit IA-32 instructions Instruction Fetch Unit Next_IP Alignment I-cache Branch Target Buffer Simple Decoder Simple Decoder General Decoder Microcode Instruction Sequencer Instruction Decode Unit Register Alias Table op1 op2 op3 (b) instruction decoder unit (IDU) (a) in-order section

  23. The Out-of-Order Execute Section • When the ops flow into the ROB, they effectively take a place in program order. • ops also go to the RSU which forms a central instruction window with 20 reservation stations (RS), each capable of hosting one op. • ops are issued to the FUs according to dataflow constraints and resource availability, without regard to the original ordering of the program. • After completion the result goes to two different places, RSU and ROB. • The RSU has five ports and can issue at a peak rate of 5 ops each cycle.

  24. Latencies and throughtput for Pentium II/III FUs

  25. MMX Functional Unit Floating-point Functional Unit Integer Functional Unit Port 0 MMX Functional Unit Jump Functional Unit to/from Reorder Buffer Integer Functional Unit Reservation Station Unit Port 1 Load Functional Unit Port 2 Store Functional Unit Port 3 Store Functional Unit Port 4 Issue/Execute Unit

  26. The In-Order Retire Section. • A op can be retired • if its execution is completed, • if it is its turn in program order, • and if no interrupt, trap, or misprediction occurred. • Retirement means taking data that was speculatively created and writing it into the retirement register file (RRF). • Three ops per clock cycle can be retired.

  27. to/from D-cache Memory Interface Unit Reservation Station Unit Retirement Register File to/from Reorder Buffer Retire Unit

  28. The Pentium II/III Pipeline ROB read BTB0 Reorder buffer read BTB access Issue BTB1 RSU Reservation station IFU0 Fetch and predecode Port 0 I-cache access IFU1 IFU2 Port 1 IDU0 Execution and completion Decode Port 2 IDU1 ROB write Port 3 Reorder buffer write-back Register renaming RAT Retirement ROB read Port 4 RRF Reorder buffer read Retirement (a) (b) (c)

  29. Pentium® Pro Processor Basic Execution Environment 232-1 Eight 32-bit Registers General Purpose Registers Six 16-bit Registers Address Space* Segment Registers 32 bits EFLAGS Register 32 bits EIP (Instruction Pointer Register) 0 * The address space can be flat or segmented

  30. Application Programming Registers

  31. Pentium III

  32. Pentium II/III summary and offsprings • Pentium III in 1999, initially at 450 MHz (0.25 micron technology), former name Katmai • two 32 kB caches, faster floating-point performance • Coppermine is a shrink of Pentium III down to 0.18 micron.

  33. Pentium 4 • Wasannounced formid-2000 under the code name Willamette • native IA-32 processor with Pentium III processor core • running at 1.5 GHz • 42 million transistors • 0.18 µm • 20 pipeline stages (integer pipeline), IF and ID not included • trace execution cache (TEC) for the decoded µOps • NetBurst micro-architecture

  34. Pentium 4 Features Rapid Execution Engine: • Intel: “Arithmetic Logic Units (ALUs) run at twice the processor frequency” • Fact: Two ALUs, running at processor frequency connected with a multiplexer running at twice the processor frequency Hyper Pipelined Technology: • Twenty-stage pipeline to enable high clock rates • Frequency headroom and performance scalability

  35. Advanced Dynamic Execution • Very deep, out-of-order, speculative execution engine • Up to 126 instructions in flight (3 times larger than the Pentium III processor) • Up to 48 loads and 24 stores in pipeline (2 times larger than the Pentium III processor) • branch prediction • based on µOPs • 4K entry branch target array (8 times larger than the Pentium III processor) • new algorithm (not specified), reduces mispredictions compared to G-Share of the P6 generation about one third

  36. First level caches • 12k µOP Execution Trace Cache (~100 k) • Execution Trace Cache that removes decoder latency from main execution loops • Execution Trace Cache integrates path of program execution flow into a single line • Low latency 8 kByte data cache with 2 cycle latency

  37. Second level caches • Included on the die • size: 256 kB • Full-speed, unified 8-way 2nd-level on-die Advance Transfer Cache • 256-bit data bus to the level 2 cache • Delivers ~45 GB/s data throughput (at 1.4 GHz processor frequency) • Bandwidth and performance increases with processor frequency

  38. NetBurst Micro-Architecture

  39. Streaming SIMD Extensions 2 (SSE2) Technology • SSE2 Extends MMX and SSE technology with the addition of 144 new instructions, which include support for: • 128-bit SIMD integer arithmetic operations. • 128-bit SIMD double precision floating point operations. • Cache and memory management operations. • Further enhances and accelerates video, speech, encryption, image and photo processing.

  40. 400 MHz Intel NetBurst micro-architecture system bus • Provides 3.2 GB/s throughput (3 times faster than the Pentium III processor). • Quad-pumped 100MHz scalable bus clock to achieve 400 MHz effective speed. • Split-transaction, deeply pipelined. • 128-byte lines with 64-byte accesses.

  41. Pentium 4 data types

  42. Pentium 4

  43. Pentium 4 offsprings Foster • Pentium 4 with external L3 cache and DDR-SDRAM support • provided for server • clock rate 1.7 - 2 GHz • to be launched in Q2/2001 Northwood • 0.13 µm technique • new 478 pin socket

More Related