440 likes | 602 Views
Advanced Microarchitecture. Lecture 6: Superscalar Decode and Other Pipelining. RISC ISA Format. This should be review… Fixed-length MIPS all insts are 32-bits/4 bytes Few formats MIPS has 3: R-, I-, J- formats Alpha has 5: Operate, Op w/ Imm, Mem, Branch, FP
E N D
Advanced Microarchitecture Lecture 6: Superscalar Decode and Other Pipelining
RISC ISA Format • This should be review… • Fixed-length • MIPS all insts are 32-bits/4 bytes • Few formats • MIPS has 3: R-, I-, J- formats • Alpha has 5: Operate, Op w/ Imm, Mem, Branch, FP • Regularity across formats (when possible/practical) • MIPS, Alpha opcode in same bit-position for all formats • MIPS rs & rt fields in same bit-position for I- and J-formats • Alpha ra/fa field in same bit-position for all 5 formats Lecture 6: Superscalar Decode and Other Pipelining
001xxx = Immediate 1xxxxx = Memory (1x0: LD, 1x1: ST) 000xxx = Br/Jump (except for 000000) RISC Decode (MIPS) 6 21 5 opcode other func R-format only opcode[2,0] opcode[5,3] Lecture 6: Superscalar Decode and Other Pipelining
4-wide superscalar fetch 32-bit inst 32-bit inst 32-bit inst 32-bit inst Decoder Decoder Decoder Decoder decoded inst decoded inst decoded inst decoded inst superscalar Superscalar Decode for RISC ISAs • To sustain X instructions per cycle, must decode X instructions per cycle • Just duplicate the hardware 1-Fetch 32-bit inst Decoder decoded inst scalar Lecture 6: Superscalar Decode and Other Pipelining
IMB Add Load Branch IMI Sub Store Xor “template” bits VLIW (EPIC-like) VLIW/EPIC ISAs • Compiler finds the parallelism, packs multiple instructions into a “very long” instruction Add Load Branch Sub Store Xor RISC Lecture 6: Superscalar Decode and Other Pipelining
tmplt inst1 inst2 Template Decoder Decoder Decoder decoded inst decoded inst VLIW Decoder • Similar to superscalar RISC decoder inst0 Decoder decoded inst Lecture 6: Superscalar Decode and Other Pipelining
CISC ISA • RISC focus on fast access to information • easy decode, I$, large RF’s, D$ • CISCs are older • designed in era with fewer transistors, chips • each memory access very expensive • pack as much work into as few bytes as possible • more “expressive” instructions • compare to simple RISC insts • better potential code generation • more complex code generation in practice Lecture 6: Superscalar Decode and Other Pipelining
Example: VAX • Superset of ISAs, incl. IBM360, DEC PDP-11 • VAX = “Virtual Address Extension” • 16 32-bit registers • 32-bit memory addressing • Encoding: • 1-2 byte opcode, followed by 0-6 operand specifiers, each of which may be up to 5 bytes • Opcode implies datatype, size, # operands • Orthogonality: any opcode with any addressing mode Lecture 6: Superscalar Decode and Other Pipelining
VAX operand addressing Any mode could be applied to any instruction! Lecture 6: Superscalar Decode and Other Pipelining
x86 Many ways to do the same/similar operation • CISC, stemming from the original 4004 • Example: “Move” instructions • General Purpose data movement • RR, MR, RM, IR, IM • Exchanges • EAX ↔ECX, byte order within a register • Stack Manipulation • push pop R ↔ Stack, PUSHA/POPA • Type Conversion • Conditional Moves Lecture 6: Superscalar Decode and Other Pipelining
Longest Inst 15 bytes Shortest Inst: 1 byte x86 Encoding • Basic x86 Instruction: Prefixes 0-4 bytes Opcode 1-2 bytes Mod R/M 0-1 bytes SIB 0-1 bytes Displacement 0/1/2/4 bytes Immediate 0/1/2/4 bytes • Opcode specifies operation, and if the Mod R/M byte is used • Most instructions use the Mod R/M byte • Mod R/M specifies if optional SIB byte is used • Mod R/M and SIB may specify additional constants Lecture 6: Superscalar Decode and Other Pipelining
11 011 001 00 011 001 Add EBX, ECX Add EBX, [ECX] Mod R/M Mod R/M Mod R/M Byte • Mode = 00: No-displacement, use Mem[ regmmm ] • Mode = 01: 8-bit displacement, Mem[ regmmm + SExt(disp) ] • Mode = 10: 32-bit displacement (similar to previous) • Mode = 11: Register-to-Register, use regmmm Mode Register R/M M M r r r m m m 1 of 8 registers Lecture 6: Superscalar Decode and Other Pipelining
00 010 101 0cff1234 bbb 5: use regbbb bbb = 5: use 32-bit imm (Mod = 00 only) Mod R/M SIB 00 010 100 ss iii bbb iii 4: use si iii = 4: use 0 si = regiii << ss Exceptions • Mod=00, R/M = 5 get operand from 32-bit immediate • Add EDX = EDX+Mem[0cff1234] • Mod=00, 01 or 10, R/M = 4 use the “SIB” byte • SIB = Scale/Index/Base Lecture 6: Superscalar Decode and Other Pipelining
11011000 11 R/M FP opcode Opcode Confusion • There are different opcodes for AB and BA MOV EAX, EBX 10001011 11 000 011 MOV EBX, EAX 10001001 11 000 011 MOV EAX, EBX 10001001 11 011 000 • If Opcode= 0F, then use next byte as opcode • If Opcode = D8-DF, then FP instruction Lecture 6: Superscalar Decode and Other Pipelining
Mod=2 (use 32-bit Disp) R/M = 4 (use SIB) reg ignored ss=3 Scale by 8 use EAX, EBX 10000100 11000011 Mod R/M SIB Disp Imm x86 Decode Example MOV regimm (use Mod R/M, 32-bit Imm to follow) 11000111 opcode *( (EAX<<3) + EBX + Disp ) = Imm Total: 11 byte instruction Note: Add 4 prefixes, and you reach the max size Lecture 6: Superscalar Decode and Other Pipelining
In RISC (MIPS) • lui R1 = Disp[31:16] • ori R1 = R1, Disp[15:0] • add R1 = R1 + R2 • shli R3 = R3 << 3 • add R3 = R3 + R1 • lui R1 = Imm[31:16] • ori R1 = R1, Imm[15:0] • st [R3] R1 • 8 instructions, 32 bits each • 32 bytes total 2.9x Bigger! Lecture 6: Superscalar Decode and Other Pipelining
md rrr mrm ss iii bbb 0100 m R I B m=0 64-bit mode m=1 32-bit mode B bbb I iii R rrr x86-64 / EM64T • 816 general purpose registers • only 3-bit register fields? • Registers extended from 3264 bits each • Default: instructions still 32-bit • New “REX” prefix byte to specify additional information REX opcode Register specifiers are now 4 bits each: can choose 1 of 16 registers Lecture 6: Superscalar Decode and Other Pipelining
CPU architect IA32+64-bit exts IA32 64-bit Extensions to IA32 Ugly? Scary? … but it works (Taken from Bob Colwell’s Eckert-Mauchly Award Talk, ISCA 2005) Lecture 6: Superscalar Decode and Other Pipelining
Left Shift Left Shift Left Shift + + opcode decoder Mod R/M decoder SIB decoder 2nd opcode decoder x86 Decode Hardware Instruction bytes Prefix Decoder Num Prefixes Lecture 6: Superscalar Decode and Other Pipelining
Decoded x86 Format • RISC: easy to expand union of needed info • generalized opcode (not too hard) • reg1, reg2, reg3, immediate (possibly extended) • some fields ignored • CISC: union of all possible info is huge • generalized opcode (too many options) • up to 3 regs, 2 immediates • segment information • “rep” specifiers • would lead to 100’s of bits • common case only needs a fraction a lot of waste Lecture 6: Superscalar Decode and Other Pipelining
ADD EAX, EBX ADD EAX, EBX 1 uop Load tmp = [EBX] ADD EAX, tmp ADD EAX, [EBX] 2 uops Load tmp = [EAX] ADD tmp, EBX STA EAX STD tmp ADD [EAX], EBX 4 uops x86 RISC-like mops • Each x86 instruction decoded into a variable number of “uops” (micro-ops - Intel) or ROPs (RISC ops - AMD) • Each uop is RISC-like • Uops have limitations to keep union of info practical Lecture 6: Superscalar Decode and Other Pipelining
uop Limits • How many uops can a decoder generate? • For complex x86 insts, many are needed (10’s, 100’s?) • Makes decoder horribly complex • Typically there’s a limit to keep complexity under control • One x86 instruction 1-4 uops • Most instructions translate to 1.5-2.0 uops • Ok, what happens if a complex instruction needs more than 4 uops? Lecture 6: Superscalar Decode and Other Pipelining
UROM/MS for Complex x86 Insts • UROM (mcode-ROM) stores the uop equivalents for nasty x86 instructions • “Nasty” could be large/complex (> 4 uops like PUSHA or STRREP.MOV) or obsolete instructions (AAA) • Microsequencer (MS) is the control logic that interfaces between the post-decode pipestages, the UROM, the decoders and the PC-generation Lecture 6: Superscalar Decode and Other Pipelining
ADD REP.MOV INC STORE … STA ADD [ ] mJCC INC … STD XOR LOAD mJCC … Cycle 4 Cycle 5 Cycle … SUB SUB REP.MOV LOAD REP.MOV mJCC ADD [ ] STORE ADD [ ] LOAD XOR ADD Cycle 2 Cycle 3 Cycle n Cycle n+1 Complex instructions, get uops from mcode sequencer Fetch- x86 insts Decode - uops UROM - uops UROM/MS Example (3 uop-wide) ADD STORE SUB Cycle 1 Lecture 6: Superscalar Decode and Other Pipelining
Superscalar CISC Decode • Instruction Length Decode (ILD) • Where are the instructions? • Limited decode – just enough to parse prefixes, modes • Shift/Alignment • Get the right bytes to the decoders • Decode • Crack into uops And then do this for N instructions per cycle! Lecture 6: Superscalar Decode and Other Pipelining
ILD Recurrence/Loop • PCi = X • PCi+1= PCi + sizeof( Mem[PCi] ) • PCi+2= PCi+1 + sizeof( Mem[PCi+1] ) = PCi + sizeof( Mem[PCi] ) + sizeof( Mem[PCi+1] ) • Can’t find start of next instruction without decoding the first • Critical loop not pipelineable • ILD of 4 instructions per cycle imples that clock cycle time will be 4 x latency(ILD) Lecture 6: Superscalar Decode and Other Pipelining
Inst 1 Inst 2 Inst 3 Remainder Cycle 2 Cycle 3 Decoder 1 Decoder 2 Decoder 3 Left Shifter Decode Implementation Instruction Bytes (ex. 16 bytes) Length 1 ILD (limited decode) Cycle 1 Length 2 ILD (limited decode) + Length 3 ILD (limited decode) + bytes decoded ILD dominates cycle time; not scalable Lecture 6: Superscalar Decode and Other Pipelining
Hardware-Intensive Decode Decode from every possible instruction starting point! ILD ILD ILD ILD ILD ILD ILD ILD ILD ILD ILD ILD ILD ILD ILD ILD Giant MUXes to select instruction bytes Decoder Decoder Decoder Lecture 6: Superscalar Decode and Other Pipelining
ILD in Hardware-Intensive Approach Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder 6 bytes 4 bytes 3 bytes + Previous: 3 ILD + 2add Now: 1ILD + 2(mux+add) Total bytes decode = 11 Lecture 6: Superscalar Decode and Other Pipelining
Predecoding • ILD loop is hardware intensive, impacts latency, and can consume substantial power • Observation: when instructions A, B and C are decoded into lengths 3, 5 and 1, the next time we encounter A, B and C, their lengths will still be the same! • cache the ILD work • do once, reuse many times Lecture 6: Superscalar Decode and Other Pipelining
Decoder Example: AMD K5 From Memory 8 bits 8 bytes b0 b1 b2 … b7 Predecode Logic +5 bits 13 bytes b0 b1 b2 … b7 8 (8-bit inst + 5-bit predecode) I$ 16 (8-bit inst + 5-bit predecode) Decode Up to 4 ROPs Lecture 6: Superscalar Decode and Other Pipelining
Decoder Example: AMD K5 • Predecode information makes decode easier • Instruction start/end location (ILD) • Number of ROPs needed per inst • Opcode and prefix locations • Power/performance tradeoffs • Larger I$ (increase data by 62.5%) • Longer I$ latency, More I$ power consumption • Remove logic from decode • Shorter branch mispred penalty, simpler logic • Cache and reused decode work less decode power • Longer effective IL1 miss latency Lecture 6: Superscalar Decode and Other Pipelining
Limits on Decode • Max branches (color allocation) • Taken branches • Incomplete instructions • x86 insts are not aligned, may span two cache lines • can’t decode until both halves have been fetched • Instruction complexity • decoding “complex” (2-4 uop) instructions requires a more complex decoder; expensive to replicate • compromise: fewer complex decoders plus simpler decoders for instructions with single-uop mappings Lecture 6: Superscalar Decode and Other Pipelining
Decoder Example: Intel P-Pro 16 Raw Instruction Bytes Fetch resteer if needed mROM Decoder 0 Decoder 1 Decoder 2 Branch Address Calculator 4 uops 1 uop 1 uop If instruction in Decoder 1 or 2 requires > 1 uop, do not generate any output, and then shift to Decoder to the left on next cycle Only Decoder 0 can interface with the uROM and MS Lecture 6: Superscalar Decode and Other Pipelining
Decoder Example: Intel P4 L2 Cache Raw instruction bytes uROM 4-uop Decoder Decode at most one inst per cycle trace const. buffer Trace Cache Fetch up to 3 uops per cycle P4 has a strangled front-end, at best it can only deliver 3 uops per cycle; contrast to P-Pro that can deliver up to 6 uops per cycle (if they’re 4/1/1) More on this when we study the P4 in detail Lecture 6: Superscalar Decode and Other Pipelining
BPred I$ I$ ILD Rot Dec Dec Ren Disp ROB, LSQ, RS full, stall front-end Pipeline Control Fetch Decode Dispatch ROB, LSQ, RS full, stall front-end This logic starting to get pretty intense Except not everyone stalls… Lecture 6: Superscalar Decode and Other Pipelining
full not full full full full Non-Uniform Stall Just because there’s a stall condition somewhere does not imply that everybody has to stall BP I$ I$ ILD Rot Dec Dec Ren Disp ROB Full! nops due to I$ miss Lecture 6: Superscalar Decode and Other Pipelining
Compressing/Serpentine Pipelines Better “flow”, but much more complex since need to track how manyinsts can advance per stage BP I$ I$ ILD Rot Dec Dec Ren Disp ROB Full! 1 entry free Lecture 6: Superscalar Decode and Other Pipelining
Lots o’ Stalls • I$ miss, ITLB miss • Decoder limitations • x86 4-1-1 limit • branch limits (max/cycle, max taken/cycle) • Renamer – out of physical registers Lecture 6: Superscalar Decode and Other Pipelining
I$ I$ Ren Alloc Sched Smaller Control Domains • Separate long pipeline into multiple smaller pipelines BPred I$ I$ Dec Dec Dec Ren Alloc Sched BPred Dec Dec Dec Lecture 6: Superscalar Decode and Other Pipelining
Decoupled I$ Dec Dec Dec Ren X X Pipeline Control Logic Smaller Control Domains (2) • Non-decoupled pipe needed logic to simultaneously control ~10 stages • Decoupled pipe needs multiple control logic circuits • each only needs to interact with ~5 stages (~3 real stages, plus the queue ahead and behind) No direct control logic for stages outside of local pipeline Pipeline Control Logic Non-decoupled Lecture 6: Superscalar Decode and Other Pipelining
previous stage next stage Inter-pipe queue enqueue logic dequeue logic latch Smaller Control Domains (3) • Queues can effectively add more pipeline stages previous stage next stage cycle boundary • Avoid this by writing and reading in the same cycle (affects timing, complexity) Lecture 6: Superscalar Decode and Other Pipelining
Note: RS is effectively a queue (more later) Queues provide Smoothing • Approximation to serpentine pipes (compress only at certain locations – i.e., the queues) • Different levels of decoupling possible depending on frequency target, power, complexity tolerance BPred I$ I$ Dec Ren Sched The “SimpleScalar” pipeline Lecture 6: Superscalar Decode and Other Pipelining
½x freq (6 uops/2Mclk’s 3 uops/Mclk) 2x freq (2 uops/Fclk 4 uops/Mclk) Different Clocking Domains • Decoupling the pipe allows each segment to operate independently (local control) • Also means each can run at different speeds (P4) 1x freq (3 uops/Mclk) (uopQ) TC TC Dec … Alloc Sched … WB Commit (IAQ) (ROB) 1x freq (3 uops/Mclk) Lecture 6: Superscalar Decode and Other Pipelining