310 likes | 997 Views
The Microarchitecture Of The Pentium 4 Processor. Presented By David Phillips CSE 520. The “Netburst” Microarchitecture. Pentium 4 Features. * Deep instruction pipeline. * Industry Leading Clock Frequency (in the year 2000) of 1.5 GHz. * Specialized Low Latency L1 “Trace” Cache.
E N D
The Microarchitecture Of The Pentium 4 Processor Presented By David Phillips CSE 520 The “Netburst” Microarchitecture
Pentium 4 Features • * Deep instruction pipeline. • * Industry Leading Clock Frequency (in the year 2000) of 1.5 GHz. • * Specialized Low Latency L1 “Trace” Cache. • * Improved Branch Prediction Algorithm. • * “Double Pumped” ALU's that enable “bursts” of instruction issues that exceed front-end bandwidth. • * Out Of Order Execution. • * Speculative “replay” execution logic.
Background The Challenge: More MEGAHERTZ The Goal: Achieve Greater CPU Performance than previous generation (and clock higher than competing CPUs). * Increase the clock frequency of the chip. Which element of the CPU imposes boundaries on clock frequency? * Pipeline Depth * Why? As clock frequency increases there is less time for each stage to complete its task. Longer stages must be divided it into shorter tasks that can be completed within each (shorter) clock cycle. Effect: More stages = Deeper Pipeline
But doesn't increasing the pipeline depth have disadvantages? Yes. * When branches are predicted incorrectly, more CPU cycles are flushed (lost). * When a cache miss occurs, more stages are stalled. * Deeper buffering of data is required which adds overhead and complexity. Then how does performance increase with all this added complexity and overhead? * Since each pipeline stage is shorter (in the deeper pipeline), the CPU can run at a faster clock rate. * The increased speed of the processor creates an overall net-increase in performance even though the added complexity imposes some overhead. Important Consequence * Increasing pipeline depth without increasing clock frequency will cause performance loss! * Increasing CPU frequency by 50% may achieve only 30% overall performance gain.
* The Pentium 4 was designed to run at 1.5x the clock speed of the Pentium III How? * The depth of the CPU pipeline was doubled from 10 stages to 20. * Later models of the Pentium 4 increased the pipeline depth to 31 stages. * The longer CPU pipeline enables to processor to run at a much higher clock rate.
In-Order Front End * Fetches undecoded IA32 instructions from the L2 cache. * Decodes fetched instructions into lower level micro- instructions. * The decoded micro-instructions are written to the L1 “trace cache” and placed into a First-In-First-Out queue for the Out-Of-Order-Execution logic. * Performs branch prediction as instructions are decoded. The predicted branch target is then fetched from the L2 cache. * Contains the Micro-code ROM which stores the micro- operations for complex IA32 instructions that are decoded into 4 or more micro-operations.
The L1 Trace Cache • Purpose • * Stores the decoded micro-operations. • * Branch targets stored in same cache line as branch source. • * During the instruction fetch stage, the trace cache is checked first. • If there is a hit, the micro-operations are retrieved which saves a decode step. • If there is a miss, the instruction has to be fetched from L2 cache and decoded all over again. Why Bother? The IA32 instructions have variable widths and types of options which makes them cumbersome to decode. By storing the decoded instructions in the trace cache, the instructions do not need to be re-decoded the next time around. Branch mis-prediction suffers less of a penalty if the “correct” path trace is still in the cache and can be quickly retrieved to bypass the decoder.
Out Of Order Execution Logic Purpose * Allow instructions to begin execution as soon as their operands are ready. * While one instruction is stalled, subsequent independent instructions are allowed to continue down the pipeline without waiting for the previous instruction to be ready. * Instructions that continue down the pipeline are considered to be “in-flight”. The registers/memory that each instruction modifies are not committed to permanent state until the instruction is “retired” in order to ensure that instructions commit their results in original program order. * Concept: Instructions execute OUT-OF-ORDER, but commit their results IN-ORDER. * Out-Of-Order Execution allows the processor to keep its functional units busy even if an instruction is fetched that must wait. Independent instructions following the stalled instruction can go around the stalled instruction and execute. This increases overall instruction bandwidth which yields faster performance due to the increased instruction level parallelism. * The Pentium 4 has enough buffering to allow 126 instructions, 48 loads, and 24 stores to be “in flight” at one time.
Out Of Order Execution Instruction Scheduler * Preserve program order. * Maintains instruction dependencies by stalling dependent instructions until their operands are ready. * Consists of two separate queues Memory Operations Queue (Load/Store) Non-Memory Operations Queue (ADD, SUB, etc) The queues can be read out of order with respect to each other. * The schedulers send instructions from the queues to their intended execution units via four separate dispatch ports.
Out Of Order Execution Dispatch Ports * 1 Port For LOADS (Can schedule 1 LOAD per cycle) * 1 Port For STORES (Can schedule 1 STORE per cycle) * 2 Ports for ALU operations (Each can schedule 2 instructions per cycle) * Considering the best case scenario, 6 instructions can be scheduled each cycle (1 LOAD, 1 STORE, 4 ALU instructions) Which hints to the origin of the name “Netburst” * The front end trace cache can deliver 3 instructions per cycle. However, the schedulers can sometimes dispatch up to 6 instructions per cycle. This behavior is described as periodic “bursts” of instructions that exceed the front end bandwidth.
Out Of Order Execution Logic Main Components Re-Order Buffer (ROB) * Contains an entry for each in-flight instruction. * Each entry stores the status of the instruction which is used to track and commit the instruction in original program order. Register Alias Table (RAT) * The Pentium 4 performs register renaming to remove false dependencies between instructions. This table keeps track of the most recent rename information. * Example: //Before registers are renamed mov EAX 12 add EAX EBX mov EAX 13 add EAX EBX Register re-naming removes false dependencies to allow instructions to run in parallel. The RAT tracks the most recent register renamed from EAX and EBX: mov $t1, 12 ; RAT [EAX] = $t1 add $t1, $2 ; RAT [EBX] = $t2 mov $t3, 13 ; RAT [EAX] = $t3 add $t2, $t4 ; RAT [EBX] = $t4
Execution Unit “Double Pumped” Arithmetic And Logic Units * The ALU's for the common operations run at twice the speed of the main clock. * Allows the ALU's to execute two instructions PER CYCLE. Separate Register Files * Integer and Floating Point both have their own 128-entry register file. Low Latency L1 Data Cache * Perform one load and one store per clock cycle. Typically takes 2 (Integer) to 6 cycles (Floating Point) for loaded value to be available. * New “speculative” access algorithm assumes many L1 cache hits. * “Replay” * The instruction scheduler dispatches dependent operations before the loaded value is available. If L1 hits then the value will be ready by the time the dependent operation uses it. If the L1 misses then the instruction will use an invalid value and will have to be canceled and re-executed (or “replayed”). Store To Load Forwarding * Before stores are committed to permanent machine state (L1 cache), they are kept in a “store buffer”. * If a dependent load needs the result of a previously uncommitted store then the store buffer forwards the value to the instruction in order to prevent a stall.
Memory Subsystem L2 Cache * Stores (non-decoded) instructions that are missed in the L1 Trace Cache. * Stores data that is missed in the L1 Data Cache. * 7 cycles for loaded value to become available Hardware Pre-fetcher * Monitors the data accesses by the CPU to predict instructions that will be executed in the future. * Tries to stay 256 bytes ahead of the current data access location. - Speeds up instruction fetches, array data accesses, etc.
Critique • * Much of the focus of this chip was to increase the clock frequency due to the marketing strategy of “more MHz” based upon the “Megahertz Myth”. • * No discussion of the power density or thermal management of the chip. This chip created much more heat than competing processors (due to the increased clock frequency). • * Intel had to abandon this architecture in later generations due to the amount of heat that was generated by this chip. • Benchmark results. The greatly improved benchmark figures depended heavily on code that was specifically optimized for the this CPU. There is no mention of how much performance gain legacy applications should expect to gain (without being re-compiled). The paper cherry-picks the benchmarks that is presents. • * When similar benchmarks were performed with legacy applications that had many branches and floating point calculations, the performance of the Pentium 4 typically matched (and sometimes performed worse) than the fastest Pentium III.
Class Term Project Assembly code optimization techniques for AMD64 Athlon/Opteron Processor. Applying optimization techniques to assembly code fragments and analyzing their computation times with performance tools to determine which optimizations provide the most significant performance gain. The goal is to better understand the considerations that compilers must take into account when generating code for a specific architecture.
Thank you. Questions?