1 / 31

The anatomy of a modern superscalar processor

The anatomy of a modern superscalar processor . Constantinos Kourouyiannis Madhava Rao Andagunda. Outline. Introduction Microarchitecture Alpha 21264 processor Sim-alpha simulator Out of order execution Prediction-Speculation. Introduction.

cira
Download Presentation

The anatomy of a modern superscalar processor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The anatomy of a modern superscalar processor Constantinos Kourouyiannis Madhava Rao Andagunda

  2. Outline • Introduction • Microarchitecture • Alpha 21264 processor • Sim-alpha simulator • Out of order execution • Prediction-Speculation

  3. Introduction • Superscalar processing is the ability to initiate multiple instructions during the same cycle. • It aims at producing ever faster microprocessors. • A typical superscalar processor fetches and decodes several instructions at a time. Instructions are executed in parallel based on the availability of operand data rather than their original program sequence. Upon completion instructions are re-sequenced so that they can be used to update the process state in the correct program order.

  4. Outline • Introduction • Microarchitecture • Alpha 21264 processor • Sim-alpha simulator • Out of order execution • Prediction-Speculation

  5. Microarchitecture • Instruction Fetch and Branch Prediction • Decode and Register Dependence Analysis • Issue and Execution • Memory Operation Analysis and Execution • Instruction Reorder and Commit

  6. Organization of a superscalar processor

  7. Instruction Fetch and Branch Prediction • The fetch phase supplies instructions to the rest of the processing pipeline. • An instruction cache is used to reduce the latency and increase the bandwidth of instruction fetch process. • PC searches the cache contents to determine if the instruction being addressed is present in one of the cache lines. • In a superscalar implementation, the fetch phase fetches multiple instructions per cycle from cache memory.

  8. Branch Instructions • Recognizing conditional branches • Decode information (extra bits) is held in the instruction cache with every instruction • Determining the branch outcome • Branch prediction using information regarding past history of branch outcomes. • Computing the branch target • Usually integer addition (PC+ offset value) • Branch Target Buffer holds target address used last time the branch was executed • Transferring control • If branch taken  at least one clock cycle delay to recognize branch, modify PC and fetch instructions from target address

  9. Instruction Decode • Instructions are removed from fetch buffers, examined and data dependence linkages are set up. • Data dependences • True dependences :can cause a read after write (RAW) hazard • Artificial dependences: can cause write after read (WAR) and write after write (WAW) hazards. • Hazards • RAW: occurs when a consuming instruction reads a value before the producing instruction writes it. • WAR: occurs when an instruction writes a value before a preceding instruction reads it. • WAW: occurs when multiple instructions update the same storage location but not in the proper order.

  10. Instruction Decode (cont.) • Example of data hazards • For each instruction, decode phase sets up the operation to be executed, the identities of storage elements where input reside and the locations where result must be placed. • Artificial dependences are eliminated through register renaming.

  11. Instruction Issue and Parallel Execution • Run-time checking for availability of data and resources. • An instruction is ready to execute as soon as its input operands are available. However, there are other constraints such as execution units and register file ports. • An issue queue is responsible for holding the instructions until their input operands are available. • Out-of-order execution: the ability of executing instructions not in the program order but as soon as their operands are ready.

  12. Handling Memory Operations • For memory operations, the decode phase cannot identify the memory locations that will be accessed. • The determination of the memory location that will be accessed requires an address calculation, usually integer addition. • Once a valid address is obtained, the load or store operation is submitted to memory.

  13. Committing State • The effects of an instruction are allowed to modify the logical process state. • The purpose of this phase is to implement the appearance of a sequential execution model, even though the actual execution is not sequential. • Machine state is separated into physical and logical. Physical state is updated as the operations complete while logical is updated in sequential program order.

  14. Outline • Introduction • Microarchitecture • Alpha 21264 processor • Sim-alpha simulator • Out of order execution • Prediction-Speculation

  15. Alpha 21264 (EV6)

  16. Instruction Fetch • Fetches 4 instructions per cycle • Large 64 KB 2-way associative instruction cache • Branch predictor: dynamically chooses between local and global history

  17. Register Renaming • Assignment of a unique storage location with each write-reference to a register. • The register allocated becomes part of the architectural state only when the instruction commits. • Elimination of WAW and WAR dependences but preservation of RAW dependences necessary for correct computation. • 64 architectural + 41 integer + 41 floating point registers available to hold speculative results prior to instruction retirement in an 80 instruction in-flight window.

  18. Issue Queues • 20 entry integer queue • can issue 4 instructions per cycle • 15 entry floating-point queue • can issue 2 instructions per cycle • A list of pending instructions is kept and each cycle these queues select from these instructions as their input data are ready. • Queues issue instructions speculatively and older instructions are given priority over newer in the queue. • An issue queue entry becomes available when the instruction issues or is squashed due to mis-speculation.

  19. Execution Engine • All execution units require access to the register file. • The register file is split into two clusters that contain duplicates of the 80-entry register file. • Two pipes access a single register file to form a cluster and the two clusters are combined to support 4-way integer execution. • Two floating point execution pipes are organized in a single cluster with a single 72-entry register file.

  20. Memory System • Supports in-flight memory references and out-of-order operation • Receives up to 2 memory operations from the integer execution pipes every cycle • 64 KB 2-way set associative data cache and direct mapped level-two cache (ranges from 1 to 16 MB) • 3-cycle latency for integer loads and 4 cycles for FP loads

  21. Store/Load Memory Ordering • Memory system supports capabilities of out-of-order execution but maintains an in-order architectural memory model. • It would be wrong if a later load issued prior to an earlier store to the same address. • This RAW memory dependency cannot be handled by rename logic because it doesn’t know the memory address before instruction issue. • If a load is incorrectly issued before an earlier store to the same address, the 21264 trains the out-of-order execution core to avoid it on subsequent executions of the same load. It sets a bit on a load wait table, that forces the issue point of the load to be delayed until all prior stores have issued.

  22. Load Hit/ Miss Prediction • To achieve the 3-cycle integer load hit latency, it is necessary to speculatively issue consumers of integer load data before knowing if the load hit or missed in the data cache. • If the load eventually misses, two integer cycles are squashed and all integer instructions that issued during those cycles are pulled back in the issue queue to be re-issued later. • The 21264 predicts when loads will miss and does not speculatively issue the consumers of the load in that case. Effective load latency: 5 cycles for an integer load hit that is incorrectly predicted to miss.

  23. Outline • Introduction • Microarchitecture • Alpha 21264 processor • Sim-alpha simulator • Out of order execution • Prediction-Speculation

  24. Sim-alpha simulator • Sim-alpha is a simulator that models Alpha 21264. • It models the implementation constraints and low-level features in the 21264. • Allows user to vary the different parameters of the processor, such as fetch width, reorder buffer size and issue queue sizes.

  25. Outline • Introduction • Microarchitecture • Alpha 21264 processor • Sim-alpha simulator • Out of order execution • Prediction-Speculation

  26. Out-Of-Order Execution • Why Out-Of-Order Execution? - In-order Processors Stalls Pipeline may not be full because of the frequent stalls Example: In the Out-Of-Order Processors - No dependency Move the Instruction for execution - Means Allow the Instructions that are Ready

  27. Out-Of-Order Execution • Theme : Stall Only on RAW hazards & Structural Hazards - RAW Hazard Example: LD R4 10(R5) ADD R6 R4 R8 - Structural Hazard - Occurs because of Resource Conflicts - Example: If the CPU is designed with a single Interface to Memory This Interface is always used during IF Also used in MEM for LOAD or STORE Operations. When a Load or Store gets into MEM stage IF must stall

  28. Outline • Introduction • Microarchitecture • Alpha 21264 processor • Sim-alpha simulator • Out of order execution • Prediction-Speculation

  29. Prediction & Speculation • Problem: Serialization due to Dependences - Wait Until Producing Instruction Execute - But We need Optimal Utilization of Resources • Solution: - Predict Unknown information - Allow the processor to proceed (Assuming predicted information is correct) - If the Prediction is Correct Proceed Normally - If not squash the speculatively executed instructions - Restore the status - Start In the correct Direction

  30. Taxonomy Of Speculation Two Possibilities -Worst case : 50 % accuracy For a 32 bit Register Possibilities - Worst case: ?

  31. Prediction & Speculation • Control Speculation -Current Branch Predictors are highly Accurate - Implemented in Commercial Processors • Data Speculation - Lot of research (Value Profiling, Value prediction etc) - Not Implemented in the Current Processors - Till Now Very Little Effects on Current Designs

More Related