1 / 46

Out-of-Order Speculative Execution

Out-of-Order Speculative Execution. Designing a Configurable Simulator for an OOO Microprocessor. By Mustafa Imran Ali ID# 230203. Presentation Outline. Introduction Examples - Representative Micro-architectures Some Issues - Limitations and Other Approaches Simulator Details.

Download Presentation

Out-of-Order Speculative Execution

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

  2. Presentation Outline • Introduction • Examples - Representative Micro-architectures • Some Issues - Limitations and Other Approaches • Simulator Details COE 501 Presentation by Mustafa Imran Ali

  3. Out-of-order Speculative Execution – Maximizing ILP • In-order Execution • Pipelining – exploiting temporal parallelism through overlap • Superscalar – more parallelism by allowing multiple instructions to issue • Problem – Pipeline Stalls • Data dependencies allow limited ILP • Large latency functions cause structural hazards • Data loads - Cache miss stalls COE 501 Presentation by Mustafa Imran Ali

  4. Out-of-order Speculative Execution • instructions execute as soon as possible and in parallel with other nondependent work • results in faster execution because critical-path computations start and complete quickly • speculatively fetch and execute instructions even though it may not know immediately whether the instructions will be on the final execution path • Multilevel Branch prediction to avoid waiting for outcome of multiple branches COE 501 Presentation by Mustafa Imran Ali

  5. OOO Speculative Execution - Benefits • Reduced reliance on compilers • Compilers are cannot examine runtime dependencies • No need for recompilation • Source code access not always possible • Binary compatibility with existing code COE 501 Presentation by Mustafa Imran Ali

  6. OOO Speculative Execution -Problems and Issues • Overcoming WAW and WAR hazards – Register Renaming • More branches/cycle – accurate branch prediction • Register Renaming – Dependency checking mechanism (Large comparisions) • Data forwarding from producers to consumers – use of tagging and broadcast mechanism • Exceptions – Committing instructions in program order COE 501 Presentation by Mustafa Imran Ali

  7. Compaq Alpha 21264 (1998) • OOO superscalar with speculative execution • Fetches 4 instructions/cycle • Dynamically issues up to 6 instructions/cycle: 4 integer and 2 floating point • Can speculate through up to 20 branches • 64 architectural register • 41 integer + 41 floating point rename register • Up to 80 instructions in-flight + 32 in-flight loads + 32 in-flight stores • 20-entry integer queue  Issues 4 instructions • 15-entry floating point queue  Issues 2 instructions • Can retire at most 11 instructions/cycle, can sustain a rate of 8/cycle (over short periods) COE 501 Presentation by Mustafa Imran Ali

  8. Stages in Instruction Pipeline All pipeline stages subsequent to the register map stage operate on internal registers rather than user-visible registers Dynamically selects from up to 6 instructions – Issue reordering takes place Provides 4 instructions/cycle Maps virtual register to physical registers COE 501 Presentation by Mustafa Imran Ali

  9. Register Renaming Process • assigns a unique storage location with each write-reference to a register • speculatively allocates a register to each instruction with a register result • register only becomes part of the user-visible (architectural) register state when the instruction retires/commits • allows instruction to speculatively issue and deposit its result into the register file before the instruction retires COE 501 Presentation by Mustafa Imran Ali

  10. Register Renaming Process (continued) • processor maintains storage with each internal register indicating the user-visible register that is currently associated with the given internal register (if any) • register renaming is a content-addressable memory (CAM) operation for register sources together with a register allocation for the destination register • register mapper stores the register map state for each in-flight instruction so that the machine architectural state can be restored in case a misspeculation occurs COE 501 Presentation by Mustafa Imran Ali

  11. Map (register rename) and QueueStages • The map stage renames programmer-visible register numbers to internal register numbers structures are duplicated for integer and floating point execution • The queue stage stores instructions until they are ready to issue COE 501 Presentation by Mustafa Imran Ali

  12. Out-of-order Issue Queues • issue queue logic maintains 2 lists of pending instructions in separate integer and floating-point queues • scoreboards maintain status of the internal registers by tracking the progress of single-cycle, multiple-cycle, and variable-cycle (memory load) instructions • the scoreboard unit notifies all instructions in the queue that require the register value when functional unit or load-data results become available COE 501 Presentation by Mustafa Imran Ali

  13. Out-of-order Execution • Each queue/arbiter selects the oldest operand-ready and functional-unit-ready instructions for execution each cycle • queues are collapsable—an entry becomes immediately available once the instruction issues or is squashed due to misspeculation COE 501 Presentation by Mustafa Imran Ali

  14. Retire Mechanism • assigns each mapped instruction a slot in a circular in-flight window (in fetch order) • tracks the internal register usage for all in-flight instructions • each entry in the mechanism contains storage indicating the internal register that held the old contents of the destination register for the corresponding instruction • this (stale) register can be freed for other use after the instruction retires COE 501 Presentation by Mustafa Imran Ali

  15. Exception Handling • exception causes all younger instructions in the in-flight window to be squashed and are removed from all queues in the system • register map is backed up to the state before the last squashed instruction using the saved map state • registers allocated by the squashed instructions become immediately available COE 501 Presentation by Mustafa Imran Ali

  16. HP PA-RISC 8000 COE 501 Presentation by Mustafa Imran Ali

  17. ROB Size Performance Effect COE 501 Presentation by Mustafa Imran Ali

  18. AMD K-5 ROB Entry COE 501 Presentation by Mustafa Imran Ali

  19. AMD K-5 Reservation Station Entry COE 501 Presentation by Mustafa Imran Ali

  20. Approaches for Billion Transistor Architectures • Advanced superscalar processors • scale up from current designs to issue 16 or 32 instructions per cycle • Superspeculative processors • enhance wide-issue superscalar performance by speculating aggressively at every point in the processor pipeline COE 501 Presentation by Mustafa Imran Ali

  21. SPARC64 V9 COE 501 Presentation by Mustafa Imran Ali

  22. Pentium III and 4 Register Renaming and ROB COE 501 Presentation by Mustafa Imran Ali

  23. One BillionTransistors, One Uniprocessor, One Chip? COE 501 Presentation by Mustafa Imran Ali

  24. Superspeculative Architecture COE 501 Presentation by Mustafa Imran Ali

  25. Area Issues • A large circuitry required to feed the processors with a continuous instructions stream • Dynamic execution requires a large amount of comparisons for dependency checking • The size of reorder buffer, reservation stations/rename registers increase accordingly COE 501 Presentation by Mustafa Imran Ali

  26. Limitations • Larger issue machines have high peak to sustained rate ratios – Intel Pentium Pro architecture Approach • Beyond issue widths of 8, inherent limited ILP in single-thread, give diminishing returns – More architectures switching to Simultaneous Multithreading COE 501 Presentation by Mustafa Imran Ali

  27. Alternate Approaches COE 501 Presentation by Mustafa Imran Ali

  28. OOO Speculative Execution Processor - Simulator Design • Tracking all the activities of the pipelined machine in each clock cycle • Issue Unit design that solves structural and data hazards • Dependency checking Mechanisms • Strategy for sending data from producers to consumers COE 501 Presentation by Mustafa Imran Ali

  29. Data Structures • Instruction Queue • Execution Tracking Hardware Structure • Register File Producer Table • Reservation Stations • The Reorder Buffer • Functional Units State Structure COE 501 Presentation by Mustafa Imran Ali

  30. Service Functions • Issue • Dispatch • Completion • CDB Snooping • Retirement and Writeback COE 501 Presentation by Mustafa Imran Ali

  31. Overall Structure COE 501 Presentation by Mustafa Imran Ali

  32. Producer Table • Each register is extended by a tag and valid flag • Valid=true iff register contains appropriate data • Other tag points to instruction producing the data COE 501 Presentation by Mustafa Imran Ali

  33. Reservation Stations • Full bit is set if entry occupied • Tag points to ROB tag of the instruction • op1 and op2 hold the source references COE 501 Presentation by Mustafa Imran Ali

  34. The Reorder Buffer • Realized as a FIFO with ROBhead and ROBtail • New instructions put at ROBtail and instruction is tagged in RS with this. • Each cycle the ROBhead valid entry is checked for instruction completion COE 501 Presentation by Mustafa Imran Ali

  35. Issue Protocol if (there is a free RS and a free ROB entry) { RS.full:=1; RS.tag:=ROBtail; for all operands x of Ii with address r if Rr.valid=1 RS.opx:=Rr; else if CDB.tag=Rr.tag and CDB.valid RS.opx:=CDB; else RS.opx:=ROB[Rr.tag]; if ( Ii has a destination register r) Rr.tag:=ROBtail; Rr.valid=0; ROB[ROBtail].dest:=r; else ROB[ROBtail].dest:=none; ROBtail:=ROBtail+1; } COE 501 Presentation by Mustafa Imran Ali

  36. Dispatch Protocol if there is a RS with RS.opx.valid=1 for all operands x and the function unit is not stalled { Pass instruction, operands, and tag to FU RS.full:=0; } COE 501 Presentation by Mustafa Imran Ali

  37. Completion Protocol if FU has result and got CDB­acknowledge { CDB.valid:=1; CDB.data:=result from FU; CDB.tag:=tag from FU; ROB[CDB.tag].valid:=1; ROB[CDB.tag].data:=CDB.data; } COE 501 Presentation by Mustafa Imran Ali

  38. CDB Snooping For all operands x: if RS.full=1 and RS.opx.valid=0 and RS.opx.tag=CDB.tag { RS.opx:=CDB; } COE 501 Presentation by Mustafa Imran Ali

  39. Retirement/Writeback Protocol if ROB not empty and ROB[ROBhead].valid=1 { if instruction in the ROB[ROBhead] requires writeback { x:=ROB[ROBhead].dest; Rx.data:=ROB[ROBhead].data; if ROBhead=Rx.tag Rx.valid=1; } ROBhead:=ROBhead+1; } COE 501 Presentation by Mustafa Imran Ali

  40. Configurable Parameters • Probability of memory misses • Probability of correct branch prediction • Branch mis-prediction penalty • Cache miss penalty • Window Size for instruction issue • Number of Issues/cycle • Number of Functional Units (FUs) • Pipeline Depth/Latency of each FU • Number of CDBs • Size of reservation stations/rename registers (RS) • Operand matching mechanism in each RS • Size of re-order buffer • Branch Prediction Mechanisms (optional) COE 501 Presentation by Mustafa Imran Ali

  41. Performance Metrics • Number of Clock cycles on an instruction trace • Number of Stalls (Various Types) • Effect on Hardware costs • Peak vs. Sustained Rates (actual issues vs. maximum possible) • Percentage Resource Utilization COE 501 Presentation by Mustafa Imran Ali

  42. OOO Speculative Micro-architecture Simulators • Simple Scalar • University of Wisconsin in Madison • www.simplescalar.com • KScalar • Universidad Autónoma de Barcelona • www.caos.uab.es/kscalar COE 501 Presentation by Mustafa Imran Ali

  43. Simple Scalar v3.0 • tool set includes sample simulators ranging from a fast functional simulator to a detailed, dynamically scheduled processor model that supports non-blocking caches, speculative execution, and state-of-the-art branch prediction • includes performance visualization tools, statistical analysis resources, and debug and verification infrastructure • includes a machine definition infrastructure that permits most architectural details to be separated from simulator implementations COE 501 Presentation by Mustafa Imran Ali

  44. KScalar • allows analyzing the performance behavior of a wide range of processor microarchitectures: from a very simple in-order, scalar pipeline, to a detailed out-of-order, superscalar pipeline with non-blocking caches, speculative execution, and complex branch prediction • The simulator interprets executables for the Alpha AXP instruction set: from very short program fragments to large applications • The object's program execution may be simulated in varying levels of detail: either cycle-by-cycle, observing all the pipeline events that determine processor performance, • or million cycles at once, taking statistics of the main performance issues COE 501 Presentation by Mustafa Imran Ali

  45. Study Direction • Modeling and comparison of representative Micro-architectures • Parameters modeling commercial micro-architecture’s OOO speculative execution core • SPEC benchmarks instruction traces • analysis of relative importance of supporting assumptions COE 501 Presentation by Mustafa Imran Ali

  46. Study Direction (continued) • Modeling Resource Utilization of Simultaneous Multithreaded Workload • Comparison of resource utilization and performance metrics of single-thread vs. SMT execution • Use of instruction traces that model multi-thread workload (e.g. modeling Hyperthreading in Pentium 4) COE 501 Presentation by Mustafa Imran Ali

More Related