Out-of-Order Speculative Execution

Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

Presentation Outline • Introduction • Examples - Representative Micro-architectures • Some Issues - Limitations and Other Approaches • Simulator Details COE 501 Presentation by Mustafa Imran Ali

Out-of-order Speculative Execution – Maximizing ILP • In-order Execution • Pipelining – exploiting temporal parallelism through overlap • Superscalar – more parallelism by allowing multiple instructions to issue • Problem – Pipeline Stalls • Data dependencies allow limited ILP • Large latency functions cause structural hazards • Data loads - Cache miss stalls COE 501 Presentation by Mustafa Imran Ali

Out-of-order Speculative Execution • instructions execute as soon as possible and in parallel with other nondependent work • results in faster execution because critical-path computations start and complete quickly • speculatively fetch and execute instructions even though it may not know immediately whether the instructions will be on the final execution path • Multilevel Branch prediction to avoid waiting for outcome of multiple branches COE 501 Presentation by Mustafa Imran Ali

OOO Speculative Execution - Benefits • Reduced reliance on compilers • Compilers are cannot examine runtime dependencies • No need for recompilation • Source code access not always possible • Binary compatibility with existing code COE 501 Presentation by Mustafa Imran Ali

OOO Speculative Execution -Problems and Issues • Overcoming WAW and WAR hazards – Register Renaming • More branches/cycle – accurate branch prediction • Register Renaming – Dependency checking mechanism (Large comparisions) • Data forwarding from producers to consumers – use of tagging and broadcast mechanism • Exceptions – Committing instructions in program order COE 501 Presentation by Mustafa Imran Ali

Compaq Alpha 21264 (1998) • OOO superscalar with speculative execution • Fetches 4 instructions/cycle • Dynamically issues up to 6 instructions/cycle: 4 integer and 2 floating point • Can speculate through up to 20 branches • 64 architectural register • 41 integer + 41 floating point rename register • Up to 80 instructions in-flight + 32 in-flight loads + 32 in-flight stores • 20-entry integer queue  Issues 4 instructions • 15-entry floating point queue  Issues 2 instructions • Can retire at most 11 instructions/cycle, can sustain a rate of 8/cycle (over short periods) COE 501 Presentation by Mustafa Imran Ali

Stages in Instruction Pipeline All pipeline stages subsequent to the register map stage operate on internal registers rather than user-visible registers Dynamically selects from up to 6 instructions – Issue reordering takes place Provides 4 instructions/cycle Maps virtual register to physical registers COE 501 Presentation by Mustafa Imran Ali

Register Renaming Process • assigns a unique storage location with each write-reference to a register • speculatively allocates a register to each instruction with a register result • register only becomes part of the user-visible (architectural) register state when the instruction retires/commits • allows instruction to speculatively issue and deposit its result into the register file before the instruction retires COE 501 Presentation by Mustafa Imran Ali

Register Renaming Process (continued) • processor maintains storage with each internal register indicating the user-visible register that is currently associated with the given internal register (if any) • register renaming is a content-addressable memory (CAM) operation for register sources together with a register allocation for the destination register • register mapper stores the register map state for each in-flight instruction so that the machine architectural state can be restored in case a misspeculation occurs COE 501 Presentation by Mustafa Imran Ali

Map (register rename) and QueueStages • The map stage renames programmer-visible register numbers to internal register numbers structures are duplicated for integer and floating point execution • The queue stage stores instructions until they are ready to issue COE 501 Presentation by Mustafa Imran Ali

Out-of-order Issue Queues • issue queue logic maintains 2 lists of pending instructions in separate integer and floating-point queues • scoreboards maintain status of the internal registers by tracking the progress of single-cycle, multiple-cycle, and variable-cycle (memory load) instructions • the scoreboard unit notifies all instructions in the queue that require the register value when functional unit or load-data results become available COE 501 Presentation by Mustafa Imran Ali

Out-of-order Execution • Each queue/arbiter selects the oldest operand-ready and functional-unit-ready instructions for execution each cycle • queues are collapsable—an entry becomes immediately available once the instruction issues or is squashed due to misspeculation COE 501 Presentation by Mustafa Imran Ali

Retire Mechanism • assigns each mapped instruction a slot in a circular in-flight window (in fetch order) • tracks the internal register usage for all in-flight instructions • each entry in the mechanism contains storage indicating the internal register that held the old contents of the destination register for the corresponding instruction • this (stale) register can be freed for other use after the instruction retires COE 501 Presentation by Mustafa Imran Ali

Exception Handling • exception causes all younger instructions in the in-flight window to be squashed and are removed from all queues in the system • register map is backed up to the state before the last squashed instruction using the saved map state • registers allocated by the squashed instructions become immediately available COE 501 Presentation by Mustafa Imran Ali

HP PA-RISC 8000 COE 501 Presentation by Mustafa Imran Ali

ROB Size Performance Effect COE 501 Presentation by Mustafa Imran Ali

AMD K-5 ROB Entry COE 501 Presentation by Mustafa Imran Ali

AMD K-5 Reservation Station Entry COE 501 Presentation by Mustafa Imran Ali

Approaches for Billion Transistor Architectures • Advanced superscalar processors • scale up from current designs to issue 16 or 32 instructions per cycle • Superspeculative processors • enhance wide-issue superscalar performance by speculating aggressively at every point in the processor pipeline COE 501 Presentation by Mustafa Imran Ali

SPARC64 V9 COE 501 Presentation by Mustafa Imran Ali

Pentium III and 4 Register Renaming and ROB COE 501 Presentation by Mustafa Imran Ali

One BillionTransistors, One Uniprocessor, One Chip? COE 501 Presentation by Mustafa Imran Ali

Superspeculative Architecture COE 501 Presentation by Mustafa Imran Ali

Area Issues • A large circuitry required to feed the processors with a continuous instructions stream • Dynamic execution requires a large amount of comparisons for dependency checking • The size of reorder buffer, reservation stations/rename registers increase accordingly COE 501 Presentation by Mustafa Imran Ali

Limitations • Larger issue machines have high peak to sustained rate ratios – Intel Pentium Pro architecture Approach • Beyond issue widths of 8, inherent limited ILP in single-thread, give diminishing returns – More architectures switching to Simultaneous Multithreading COE 501 Presentation by Mustafa Imran Ali

Alternate Approaches COE 501 Presentation by Mustafa Imran Ali

OOO Speculative Execution Processor - Simulator Design • Tracking all the activities of the pipelined machine in each clock cycle • Issue Unit design that solves structural and data hazards • Dependency checking Mechanisms • Strategy for sending data from producers to consumers COE 501 Presentation by Mustafa Imran Ali

Data Structures • Instruction Queue • Execution Tracking Hardware Structure • Register File Producer Table • Reservation Stations • The Reorder Buffer • Functional Units State Structure COE 501 Presentation by Mustafa Imran Ali

Service Functions • Issue • Dispatch • Completion • CDB Snooping • Retirement and Writeback COE 501 Presentation by Mustafa Imran Ali

Overall Structure COE 501 Presentation by Mustafa Imran Ali

Producer Table • Each register is extended by a tag and valid flag • Valid=true iff register contains appropriate data • Other tag points to instruction producing the data COE 501 Presentation by Mustafa Imran Ali

Reservation Stations • Full bit is set if entry occupied • Tag points to ROB tag of the instruction • op1 and op2 hold the source references COE 501 Presentation by Mustafa Imran Ali

The Reorder Buffer • Realized as a FIFO with ROBhead and ROBtail • New instructions put at ROBtail and instruction is tagged in RS with this. • Each cycle the ROBhead valid entry is checked for instruction completion COE 501 Presentation by Mustafa Imran Ali

Issue Protocol if (there is a free RS and a free ROB entry) { RS.full:=1; RS.tag:=ROBtail; for all operands x of Ii with address r if Rr.valid=1 RS.opx:=Rr; else if CDB.tag=Rr.tag and CDB.valid RS.opx:=CDB; else RS.opx:=ROB[Rr.tag]; if ( Ii has a destination register r) Rr.tag:=ROBtail; Rr.valid=0; ROB[ROBtail].dest:=r; else ROB[ROBtail].dest:=none; ROBtail:=ROBtail+1; } COE 501 Presentation by Mustafa Imran Ali

Dispatch Protocol if there is a RS with RS.opx.valid=1 for all operands x and the function unit is not stalled { Pass instruction, operands, and tag to FU RS.full:=0; } COE 501 Presentation by Mustafa Imran Ali

Completion Protocol if FU has result and got CDBacknowledge { CDB.valid:=1; CDB.data:=result from FU; CDB.tag:=tag from FU; ROB[CDB.tag].valid:=1; ROB[CDB.tag].data:=CDB.data; } COE 501 Presentation by Mustafa Imran Ali

CDB Snooping For all operands x: if RS.full=1 and RS.opx.valid=0 and RS.opx.tag=CDB.tag { RS.opx:=CDB; } COE 501 Presentation by Mustafa Imran Ali

Retirement/Writeback Protocol if ROB not empty and ROB[ROBhead].valid=1 { if instruction in the ROB[ROBhead] requires writeback { x:=ROB[ROBhead].dest; Rx.data:=ROB[ROBhead].data; if ROBhead=Rx.tag Rx.valid=1; } ROBhead:=ROBhead+1; } COE 501 Presentation by Mustafa Imran Ali

Configurable Parameters • Probability of memory misses • Probability of correct branch prediction • Branch mis-prediction penalty • Cache miss penalty • Window Size for instruction issue • Number of Issues/cycle • Number of Functional Units (FUs) • Pipeline Depth/Latency of each FU • Number of CDBs • Size of reservation stations/rename registers (RS) • Operand matching mechanism in each RS • Size of re-order buffer • Branch Prediction Mechanisms (optional) COE 501 Presentation by Mustafa Imran Ali

Performance Metrics • Number of Clock cycles on an instruction trace • Number of Stalls (Various Types) • Effect on Hardware costs • Peak vs. Sustained Rates (actual issues vs. maximum possible) • Percentage Resource Utilization COE 501 Presentation by Mustafa Imran Ali

OOO Speculative Micro-architecture Simulators • Simple Scalar • University of Wisconsin in Madison • www.simplescalar.com • KScalar • Universidad Autónoma de Barcelona • www.caos.uab.es/kscalar COE 501 Presentation by Mustafa Imran Ali

Simple Scalar v3.0 • tool set includes sample simulators ranging from a fast functional simulator to a detailed, dynamically scheduled processor model that supports non-blocking caches, speculative execution, and state-of-the-art branch prediction • includes performance visualization tools, statistical analysis resources, and debug and verification infrastructure • includes a machine definition infrastructure that permits most architectural details to be separated from simulator implementations COE 501 Presentation by Mustafa Imran Ali

KScalar • allows analyzing the performance behavior of a wide range of processor microarchitectures: from a very simple in-order, scalar pipeline, to a detailed out-of-order, superscalar pipeline with non-blocking caches, speculative execution, and complex branch prediction • The simulator interprets executables for the Alpha AXP instruction set: from very short program fragments to large applications • The object's program execution may be simulated in varying levels of detail: either cycle-by-cycle, observing all the pipeline events that determine processor performance, • or million cycles at once, taking statistics of the main performance issues COE 501 Presentation by Mustafa Imran Ali

Study Direction • Modeling and comparison of representative Micro-architectures • Parameters modeling commercial micro-architecture’s OOO speculative execution core • SPEC benchmarks instruction traces • analysis of relative importance of supporting assumptions COE 501 Presentation by Mustafa Imran Ali

Study Direction (continued) • Modeling Resource Utilization of Simultaneous Multithreaded Workload • Comparison of resource utilization and performance metrics of single-thread vs. SMT execution • Use of instruction traces that model multi-thread workload (e.g. modeling Hyperthreading in Pentium 4) COE 501 Presentation by Mustafa Imran Ali

Out-of-Order Speculative Execution