Computer Structure The P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor

Computer StructureThe P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor Lihu Rappoport and Adi Yoaz

The P6 Family • Features • Out Of Order execution • Register renaming • Speculative Execution • Multiple Branch prediction • Super pipeline: 12 pipe stages *off die

External Bus L2 MOB BIU DCU MIU IFU AGU R IEU BPU S I FEU D MS ROB RAT P6 Arch • In-Order Front End • BIU: Bus Interface Unit • IFU: Instruction Fetch Unit (includes IC) • BPU: Branch Prediction Unit • ID: Instruction Decoder • MS: Micro-Instruction Sequencer • RAT: Register Alias Table • Out-of-order Core • ROB: Reorder Buffer • RRF: Real Register File • RS: Reservation Stations • IEU: Integer Execution Unit • FEU: Floating-point Execution Unit • AGU: Address Generation Unit • MIU: Memory Interface Unit • DCU: Data Cache Unit • MOB: Memory Order Buffer • L2: Level 2 cache • In-Order Retire

Next IP Reg Ren RS Wr Icache Decode I1 I2 I3 I4 I5 I6 I7 I8 P6 Pipeline RS disp In-Order Front End Ex 1:Next IP 2: ICache lookup 3: ILD (instruction length decode) 4: rotate 5: ID1 6: ID2 7: RAT- rename sources, ALLOC-assign destinations 8: ROB-read sources RS-schedule data-ready uops for dispatch 9: RS-dispatch uops 10:EX 11:Retirement Out-of-order Core O1 O3 Retirement R1 R2 In-order Retirement

In-Order Front End Bytes Instructions uops BPU Next IP Mux MS • BPU – Branch Prediction Unit – predict next fetch address • IFU – Instruction Fetch Unit • iTLB translates virtual to physical address (access PMH on miss) • IC supplies 16byte/cyc (access L2 cache on miss) • ILD – Induction Length Decode – split bytes to instructions • IQ – Instruction Queue – buffer the instructions • ID – Instruction Decode – decode instructions into uops • MS – Micro-Sequencer – provides uops for complex instructions ILD ID IFU IQ IDQ

Branch Prediction • Implementation • Use local history to predict direction • Need to predict multiple branches • Need to predict branches before previous branches are resolved • Branch history updated first based on prediction, later based on actual execution (speculative history) • Target address taken from BTB • Prediction rate: ~92% • ~60 instructions between mispredictions • High prediction rate is very crucial for long pipelines • Especially important for OOOE, speculative execution: • On misprediction all instructions following the branch in the instruction window are flushed • Effective size of the window is determined by prediction accuracy • RSB used for Call/Return pairs

Jump out of the line Jump into the fetch line jmp jmp jmp jmp Predict taken Predict not taken Predict taken Predict taken Branch Prediction – Clustering • Need to provide predictions for the entire fetch line each cycle • Predict the first taken branch in the line, following the fetch IP • Implemented by • Splitting IP into offset within line, set, and tag • If the tag of more than one way matches the fetch IP • The offsets of the matching ways are ordered • Ways with offset smaller than the fetch IP offset are discarded • The first branch that is predicted taken is chosen as the predicted branch

Prediction bit 1001 0 counters Tag Target ofst Hist V T P IP LRR 9 Pred= msb of counter 4 1 2 1 4 2 9 32 32 128 sets 15 Branch Type 00- cond 01- ret 10- call 11- uncond Return Stack Buffer Per-Set Way 1 Way 3 Way 0 Way 2 The P6 BTB • 2-level, local histories, per-set counters • 4-way set associative: 512 entries in 128 sets • Up to 4 branches can have a tag match

D0 D1 D2 D3 4 uops 1 uop 1 uop 1 uop In-Order Front End: Decoder 16 Instruction bytes from IFU Determine where each IA instruction starts Instruction Length Decode Buffer Instructions IQ • Convert instructions Into uops • If inst aligned with dec1/2/3 decodes into >1 uops, defer it to next cycle • Buffers uops • Smooth decoder’s variable throughput IDQ

Micro Operations (Uops) • Each CISC inst is broken into one or more RISC uops • Simplicity: • Each uop is (relatively) simple • Canonical representation of src/dest (2 src, 1 dest) • Increased ILP • e.g., pop eaxbecomes esp1<-esp0+4, eax1<-[esp0] • Simple instructions translate to a few uops • Typical uop count (it is not necessarily cycle count!)Reg-Reg ALU/Mov inst: 1 uop Mem-Reg Mov (load) 1 uop Mem-Reg ALU (load + op) 2 uops Reg-Mem Mov (store) 2 uops (st addr, st data) Reg-Mem ALU (ld + op + st) 4 uops • Complex instructions need ucode

External Bus L2 MOB BIU DCU MIU IFU AGU R IEU BTB S I FEU MIS D ROB RAT Out-of-order Core • Reorder Buffer (ROB): • Holds “not yet retired” instructions • 40 entries, in-order • Reservation Stations (RS): • Holds “not yet executed” instructions • 20 entries • Execution Units • IEU: Integer Execution Unit • FEU: Floating-point Execution Unit • Memory related units • AGU: Address Generation Unit MIU: Memory Interface Unit • DCU: Data Cache Unit • MOB: Orders Memory operations • L2: Level 2 cache

Alloc & Rat • Perform register allocation and renaming for ≤4 uops/cyc • The Register Alias Table (RAT) • Maps architectural registers into physical registers • For each arch reg, holds the number of latest phyreg that updates it • When a new uop that writes to a arch regR is allocated • Record phyreg allocated to the uop as the latest reg that updates R • The Allocator (Alloc) • Assigns each uop an entry number in the ROB / RS • For each one of the sources (architectural registers) of the uop • Lookup the RAT to find out the latest phyreg updating it • Write it up in the RS entry • Allocate Load & Store buffers in the MOB

Re-order Buffer (ROB) • Hold 40 uops which are not yet committed • At the same order as in the program • Provide a large physical register space for register renaming • One physical register per each ROB entry • physical register number = entry number • Each uop has only one destination • Buffer the execution results until retirement • Valid data is set after uop executed and result written to physical reg

RRF – Real Register File • Holds the Architectural Register File • Architectural Register are numbered: 0 – EAX, 1 – EBX, … • The value of an architectural register • is the value written to it by the last instruction committed which writes to this register RRF:

Uop flow through the ROB • Uops are entered in order • Registers renamed by the entry number • Once assigned: execution order unimportant • After execution: • Entries marked “executed” and wait for retirement • Executed entry can be “retired” once all prior instruction have retired • Commit architectural state only after speculation (branch, exception) has resolved • Retirement • Detect exceptions and mispredictions • Initiate repair to get machine back on right track • Update “real registers” with value of renamed registers • Update memory • Leave the ROB

Reservation station (RS) • Pool of all “not yet executed” uops • Holds the uop attributes and the uop source data until it is dispatched • When a uop is allocated in RS, operand values are updated • If operand is from an architectural register, value is taken from the RRF • If operand is from a phy reg, with data valid set, value taken from ROB • If operand is from a phy reg, with data valid not set, wait for value • The RS maintains operands status “ready/not-ready” • Each cycle, executed uops make more operands “ready” • The RS arbitrate the WB busses between the units • The RS monitors the WB bus to capture data needed by awaiting uops • Data can be bypassed directly from WB bus to execution unit • Uops whose all operands are ready can be dispatched for execution • Dispatcher chooses which of the ready uops to execute next • Dispatches chosen uops to functional units

Register Renaming example IDQ Add EAX, EBX, EAX RAT / Alloc  Add ROB37, ROB19, RRF0 ROB  RS RRF:

Register Renaming example (2) IDQ sub EAX, ECX, EAX RAT / Alloc  sub ROB38, ROB23, ROB37 ROB  RS RRF:

Out-of-order Core: Execution Units internal 0-dealy bypass within each EU 1st bypass in MIU 2nd bypass in RS SHF FMU RS MIU FDIV IDIV FAU IEU Port 0 JEU IEU Port 1 Port 2 AGU Load Address Port 3,4 Store Address AGU SDB DCU

External Bus L2 MOB BIU DCU MIU IFU AGU R IEU BTB S I FEU MIS D ROB RAT In-Order Retire • ROB: • Retires up to 4 uops per clock • Copies the values to the RRF • Retirement is done In-order • Performs exception checking

In-order Retirement • The process of committing the results to the architectural state of the processor • Retire up to 4 uops per clock • Copy the values to the RRF • Retirement is done In Order • Perform exception checking • An instruction is retired after the following checks • Instruction has executed • All previous instructions have retired • Instruction isn’t mis-predicted • no exceptions

Predict/Fetch Decode Alloc Schedule EX Retire IQ IDQ RS ROB Pipeline: Fetch • Fetch 16B from I$ • Length-decode instructions within 16B • Write instructions into IQ

Pipeline: Decode Predict/Fetch Decode Alloc Schedule EX Retire • Read 4 instructions from IQ • Translate instructions into uops • Asymmetric decoders (4-1-1-1) • Write resulting uops into IDQ IQ IDQ RS ROB

Pipeline: Allocate Predict/Fetch Decode Alloc Schedule EX Retire • Allocate, port bind and rename 4 uops • Allocate ROB/RS entry per uop • If source data is available from ROB or RRF, write data to RS • Otherwise, mark data not ready in RS IQ IDQ RS ROB

Predict/Fetch Decode Alloc Schedule EX Retire IQ IDQ RS ROB Pipeline: EXE • Ready/Schedule • Check for data-ready uops if needed functional unit available • Select and dispatch ≤6 ready uops/clock to EXE • Reclaim RS entries • Write back results into RS/ROB • Write results into result buffer • Snoop write-back ports for results that are sources to uops in RS • Update data-ready status of these uops in the RS

Pipeline: Retire Predict/Fetch Decode Alloc Schedule EX Retire • Retire ≤4 oldest uops in ROB • Uop may retire if • its ready bit is set • it does not cause an exception • all preceding candidates are eligible for retirement • Commit results from result buffer to RRF • Reclaim ROB entry • In case of exception • Nuke and restart IQ IDQ RS ROB

Jump Misprediction – Flush at Retire • Flush the pipe when the mispredicted jump retires • retirement is in-order • all the instructions preceding the jump have already retired • all the instructions remaining in the pipe follow the jump • flush all the instructions in the pipe • Disadvantage • Misprediction known after the jump was executed • Continue fetching instructions from wrong path until the branch retires

Jump Misprediction – Flush at Execute • When the JEU detects jump misprediction it • Flush the in-order front-end • Instructions already in the OOO part continue to execute • Including instructions following the wrong jump, which take execution resource, and waste power, but will never be committed • Start fetching and decoding from the “correct” path • The “correct” path still be wrong • A preceding uop that hasn’t executed may cause an exception • A preceding jump executed OOO can also mispredict • The “correct” instruction stream is stalled at the RAT • The RAT was wrongly updated also by wrong path instruction • When the mispredicted branch retires • Resets all state in the Out-of-Order Engine (RAT, RS, RB, MOB, etc.) • Only instruction following the jump are left – they must all be flushed • Reset the RAT to point only to architectural registers • Un-stalls the in-order machine • RS gets uops from RAT and starts scheduling and dispatching them

Pipeline: Branch gets to EXE Alloc Schedule JEU Fetch Decode Retire IQ IDQ RS ROB

Flush Pipeline: Mispredicted Branch EXE • Flush front-end and re-steer it to correct path • RAT state already updated by wrong path • Block further allocation • Update BPU • OOO not flushed: Instructions already in the OOO continue to execute • Including instructions following the wrong jump, which take execution resource, and waste power, but will never be committed • Block younger branches from clearing Alloc Schedule JEU Fetch Decode Retire IQ IDQ RS ROB

Pipeline: Mispredicted Branch Retires Clear • When mispredicted branch retires • Flush OOO • Only instruction following the jump are left – they must all be flushed • Resets all state in the OOO (RAT, RS, RB, MOB, etc.) • Reset the RAT to point only to architectural registers • Allow allocation of uops from correct path Alloc Schedule JEU Fetch Decode Retire IQ IDQ RS ROB

Instant Reclamation • Allow a faster recovery after jump misprediction • Allow execution/allocation of uops from correct path before mispredicted jump retires • Every few cycles take a checkpoint of the RAT • In case of misprediction • Flush the frontend and re-steer it to the correct path • Recover RAT to latest checkpoint taken prior to misprediction • Recover RAT to exact state at misprediction • Rename 4 uops/cycle from checkpoint and until branch • Flush all uops younger than the branch in the OOO

Clear Instant ReclamationMispredicted Branch EXE • JEClear raised on mispredicted macro-branches Predict/Fetch Decode Alloc Schedule JEU Retire IQ IDQ RS ROB

Instant ReclamationMispredicted Branch EXE • JEClear raised on mispredicted macro-branches • Flush frontend and re-steer it to the correct path • Flush all younger uops in OOO • Update BPU • Block further allocation BPU Update Clear Predict/Fetch Decode Alloc Schedule JEU Retire IQ IDQ RS ROB

Pipeline: Instant Reclamation: EXE • Restore RAT from latest check-point before branch • Recover RAT to its states just after the branch • Before any instruction on the wrong path • Meanwhile front-end starts fetching and decoding instructions from the correct path Predict/Fetch Decode Alloc Schedule JEU Retire IQ IDQ RS ROB

Pipeline: Instant Reclamation: EXE • Once done restoring the RAT • allow allocation of uops from correct path Predict/Fetch Decode Alloc Schedule JEU Retire IQ IDQ RS ROB

Large ROB and RS are Important • Large RS • Increases the window in which looking for impendent instructions • Exposes more parallelism potential • Large ROB • The ROB is a superset of the RS  ROB size ≥ RS size • Allows for of covering long latency operations (cache miss, divide) • Example • Assume there is a Load that misses the L1 cache • Data takes ~10 cycles to return  ~30 new instrs get into pipeline • Instructions following the Load cannot commit  Pile up in the ROB • Instructions independent of the load are executed, and leave the RS • As long as the ROB is not full, we can keep executing instructions • A 40 entry ROB can cover for an L1 cache miss • Cannot cover for an L2 cache miss, which is hundreds of cycles

OOO Execution of Memory Operations

P6 Caches • Blocking caches severely hurt OOO • A cache miss prevents from other cache requests (which could possibly be hits) to be served • Hurts one of the main gains from OOO – hiding caches misses • Both L1 and L2 cache in the P6 are non-blocking • Initiate the actions necessary to return data to cache miss while they respond to subsequent cached data requests • Support up to 4 outstanding misses • Misses translate into outstanding requests on the P6 bus • The bus can support up to 8 outstanding requests • Squash subsequent requests for the same missed cache line • Squashed requests not counted in number of outstanding requests • Once the engine has executed beyond the 4 outstanding requests • subsequent load requests are placed in the load buffer

OOO Execution of Memory Operations • The RS operates based on register dependencies • RS cannot detect memory dependencies movl -4(%ebp), %ebx # MEM[ebp-4] ← ebx movl %eax, -4(%ebp) # eax ← MEM[ebp-4] • RS dispatches memory uops when data for address calculation is ready, and the MOB and Address Generation Unit (AGU) are free • AGU computes the linear address Segment-Base + Base-Address + (Scale*Index) + Displacement • Sends linear address to MOB, to be stored in Load Buffer or Store Buffer • MOB resolves memory dependencies and enforces memory ordering • Some memory dependencies can be resolved statically store r1,a load r2,b • Problem: some cannot store r1,[r3]; load r2,b can advance load before store load must wait till r3 is known

Load and Store Ordering • x86 has small register set  uses memory often • Preventing Stores from passing Stores/Loads: 3%~5% perf. loss • P6 chooses not allow Stores to pass Stores/Loads • Preventing Loads from passing Loads/Stores: big perf. loss • P6 allows Loads to pass Stores, and Loads to pass Loads • Stores are not executed OOO • Stores are never performed speculatively • there is no transparent way to undo them • Stores are also never re-ordered among themselves • The Store Buffer dispatches a store only when • the store has both its address and its data, and • there are no older stores awaiting dispatch • Store commits its write to memory (DCU) at retirement

Store Implemented as 2 Uops • Store decoded as two independent uops • STA (store-address): calculates the address of the store • STD (store-data): stores the data into the Store Data buffer • The actual write to memory is done when the store retires • Separating STA & STD is important for memory OOO • Allows STA to dispatch earlier, even before the data is known • Address conflicts resolved earlier opens memory pipeline for other loads • STA and STD can be issued to execution units in parallel • STA dispatched to AGU when its sources (base+index) are ready • STD dispatched to SDB when its source operand is available

Memory Order Buffer (MOB) • Store Coloring • Each Store allocated in-order in Store Buffer, and gets a SBID • Each load allocated in-order in Load Buffer, and gets LBID + current SBID • Load is checked against all previous stores • Stored with SBID ≤ store’s SBID • Load blocked if • Unresolved address of a relevant STAs • STA to same address, but data not ready • Missing resources (DTLB miss, DCU miss) • MOB writes blocking info into load buffer • Re-dispatches load when wake-up signal received • If Load is not blocked  executed (bypassed)

MOB (Cont.) • If a Load misses in the DCU • The DCU marks the write-back data as invalid • Assigns a fill buffer to the load, and issues an L2 request • When critical chunk is returned, wakeup and re-dispatch the load • Store → Load Forwarding • Older STA with same address as load and data ready  Load gets its data directly from the SB (no DCU access) • Memory Disambiguation • MOB predicts if a load can proceed despite unknown STAs • Predict colliding  block Load if there is unknown STA (as usual) • Predict non colliding  execute even if there are unknown STAs • In case of wrong prediction • The entire pipeline is flushed when the load retires

Pipeline: Load: Allocate Alloc Schedule Retire • Allocate ROB/RS, MOB entries • Assign Store ID (SBID) to enable ordering IDQ RS ROB LB AGU LB Write MOB DTLB DCU WB

Pipeline: Bypassed Load: EXE Alloc Schedule AGU LB Write Retire • RS checks when data used for address calculation is ready • AGU calculates linear address: DS-Base + base + (Scale*Index) + Disp. • Write load into Load Buffer • DTLB Virtual → Physical + DCU set access • MOB checks blocking and forwarding • DCU read / Store Data Buffer read (Store → Load forwarding) • Write back data / write block code IDQ RS ROB MOB DTLB DCU WB LB

Pipeline: Blocked Load Re-dispatch Alloc Schedule AGU LB Write Retire • MOB determines which loads are ready, and schedules one • Load arbitrates for MEU • DTLB Virtual → Physical + DCU set access • MOB checks blocking/forwarding • DCU way select / Store Data Buffer read • write back data / write block code IDQ RS ROB MOB DTLB DCU WB LB

Pipeline: Load: Retire Alloc Schedule AGU LB Write Retire • Reclaim ROB, LB entries • Commit results to RRF IDQ RS ROB MOB DTLB DCU WB LB

SB Pipeline: Store: Allocate Alloc Schedule AGU SB Retire • Allocate ROB/RS • Allocate Store Buffer entry IDQ RS ROB DTLB

SB Pipeline: Store: STA EXE Alloc Schedule AGU SB V.A. Retire • RS checks when data used for address calculation is ready • dispatches STA to AGU • AGU calculates linear address • Write linear address to Store Buffer • DTLB Virtual → Physical • Load Buffer Memory Disambiguation verification • Write physical address to Store Buffer IDQ RS ROB DTLB SB P.A.

Computer Structure The P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor