Out-of-Order Core Implementation: Basics and Examples

Lecture 18: Core Design • Today: basics of implementing a correct ooo core: • register renaming, commit, LSQ, issue queue

The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr fetch Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 Committed Reg Map R1P1 R2P2 Register File P1-P64 R1  R1+R2 R2  R1+R3 BEQZ R2 R3  R1+R2 R1  R3+R2 Decode & Rename P33  P1+P2 P34  P33+P3 BEQZ P34 P35  P33+P34 P36  P35+P34 ALU ALU ALU Speculative Reg Map R1P36 R2P34 Instr Fetch Queue Results written to regfile and tags broadcast to IQ Issue Queue (IQ)

Rename A lr1  lr2 + lr3 B lr2  lr4 + lr5 C lr6  lr1 + lr3 D lr6  lr1 + lr2 RAR lr3 RAW lr1 WAR lr2 WAW lr6 A ; BC ; D pr7  pr2 + pr3 pr8  pr4 + pr5 pr9  pr7 + pr3 pr10  pr7 + pr8 RAR pr3 RAW pr7 WAR x WAW x AB ; CD

Commit Example Assume a processor with 6 logical regs and 10 physical regs Map Old / New lr1 pr1 pr7 lr2 pr2 pr8 lr6 pr6 pr9 lr6 pr9 pr10 lr3 pr3 pr1 lr4 pr4 pr2 A lr1  lr2 + lr3 B lr2  lr4 + lr5 C lr6  lr1 + lr3 D lr6  lr1 + lr2 E lr3  lr6 + lr2 F lr4  lr3 + lr4 pr7  pr2 + pr3 pr8  pr4 + pr5 pr9  pr7 + pr3 pr10  pr7 + pr8 pr1  pr10 + pr8 pr2  pr1 + pr4

Out-of-Order Loads/Stores Ld R1  [R2] Ld R3  [R4] St R5  [R6] Ld R7  [R8] Ld R9[R10]

Memory Dependence Checking Ld 0x abcdef • The issue queue checks for • register dependences and • executes instructions as soon • as registers are ready • Loads/stores access memory • as well – must check for RAW, • WAW, and WAR hazards for • memory as well • Hence, first check for register • dependences to compute • effective addresses; then check • for memory dependences Ld St Ld Ld 0x abcdef St 0x abcd00 Ld 0x abc000 Ld 0x abcd00

Memory Dependence Checking • Load and store addresses are • maintained in program order in • the Load/Store Queue (LSQ) • Loads can issue if they are • guaranteed to not have true • dependences with earlier stores • Stores can issue only if we are • ready to modify memory (can not • recover if an earlier instr raises • an exception) Ld 0x abcdef Ld St Ld Ld 0x abcdef St 0x abcd00 Ld 0x abc000 Ld 0x abcd00

The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr fetch Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 Instr 7 Committed Reg Map R1P1 R2P2 Register File P1-P64 R1  R1+R2 R2  R1+R3 BEQZ R2 R3  R1+R2 R1  R3+R2 LD R4  8[R3] ST R4  8[R1] Decode & Rename P33  P1+P2 P34  P33+P3 BEQZ P34 P35  P33+P34 P36  P35+P34 P37  8[P35] P37  8[P36] ALU ALU ALU Speculative Reg Map R1P36 R2P34 Results written to regfile and tags broadcast to IQ Instr Fetch Queue Issue Queue (IQ) ALU P37  [P35 + 8] P37  [P36 + 8] D-Cache LSQ

Speculative Issue • Instr I1 leaves the issue queue at start of cycle 6; the instr • then reads operands from the regfile, wires are traversed, • instruction executes, result is available at end of cycle 8 • If operand availability is broadcast to issue queue in cycle 9, • dependent instruction leaves in cycle 10 • This causes a 4-cycle gap between successive instrs • Hence, if we know that the instruction takes a cycle to • execute, the operand is broadcast to the issue queue in • cycle 6 and the dependent instr leaves issue queue in • cycle 7; the input operand is correctly bypassed at the FU

Load Hit Speculation • The previous optimization assumes that we know the exact • latency for every operation • This is true for all ops except loads (cache hit or miss?) • Assume hit and schedule accordingly; on a cache miss, • must squash all speculatively issued instructions; an • instruction therefore sits in the queue until load hits are • determined

Register Rename Logic Map Table Physical Source Regs Physical Dest Regs Logical Source Regs Mux Free Pool Logical Dest Regs Dependence Check Logic Logical Source Reg

Map Table – RAM 7-bits 7-bits 7-bits 7-bits 7-bits Phys reg id Num entries = Num logical regs Shadow copies (shift register)

Map Table – CAM 5-bits 1-bit 1-bit Logical reg id v a l i d Num entries = Num phys regs Shadow copies

Wakeup Logic tag1 tagIW … = = or or rdyL tagL tagR rdyR . . . . . . rdyL tagL tagR rdyR

Selection Logic Issue window req grant enable anyreq Arbiter cell enable • For multiple FUs, will need sequential selectors

Structure Complexities • Critical structures: • register map tables, issue queue, LSQ, register file, • register bypass • Cycle time is heavily influenced by: • window size (physical register size), issue width (#FUs) • Conflict between the desire to increase IPC and clock speed

Title • Bullet

Out-of-Order Core Implementation: Basics and Examples

Out-of-Order Core Implementation: Basics and Examples

Presentation Transcript

CMOS VLSI Design Lecture 1 3 : Design for Low Power

New Basic Core I Design

The Design Core

IP Core Design

Lecture 10 Design Principles #1

Session 2: Core Infrastructure Design

Lecture #23

Furniture Design (3)

Multi-core architectures

Multi-Core Architectures and Shared Resource Management Lecture 3: Interconnects

Lecture 2 (Mapping Applications to Multi-core Arch)

Lecture 1: Introduction to Software Design

Lecture 1 Overview

Lecture 1

Lecture 3: Review CPU Design

Lecture 5 The Design Process: Design Teams and Translating Plans into Design Criteria

The Core

Lecture 11 - User Interface Design and Component Level Design

Design and Analysis of Experiments Lecture 3.1