200 likes | 323 Views
CSE 502: Computer Architecture. Superscalar Decode. 4-wide superscalar fetch. 32-bit inst. 32-bit inst. 32-bit inst. 32-bit inst. Decoder. Decoder. Decoder. Decoder. decoded inst. decoded inst. decoded inst. decoded inst. superscalar. Superscalar Decode for RISC ISAs.
E N D
CSE 502:Computer Architecture Superscalar Decode
4-wide superscalar fetch 32-bit inst 32-bit inst 32-bit inst 32-bit inst Decoder Decoder Decoder Decoder decoded inst decoded inst decoded inst decoded inst superscalar Superscalar Decode for RISC ISAs • Decode X insns. per cycle (e.g., 4-wide) • Just duplicate the hardware • Instructions aligned at 32-bit boundaries 1-Fetch 32-bit inst Decoder decoded inst scalar
uop Limits • How many uops can a decoder generate? • For complex x86 insts, many are needed (10’s, 100’s?) • Makes decoder horribly complex • Typically there’s a limit to keep complexity under control • One x86 instruction 1-4 uops • Most instructions translate to 1.5-2.0 uops • What if a complex insn. needs more than 4 uops?
UROM/MS for Complex x86 Insts • UROM (microcode-ROM) stores the uop equivalents • Used for nasty x86 instructions • Complex like > 4 uops (PUSHA or STRREP.MOV) • Obsolete (like AAA) • Microsequencer (MS) handles the UROM interaction
SUB REP.MOV ADD [ ] ADD REP.MOV INC STA ADD [ ] mJCC STD XOR LOAD Cycle 4 Cycle 2 SUB LOAD STORE Cycle 3 Complex instructions, get uops from mcode sequencer Fetch- x86 insts Decode - uops UROM - uops UROM/MS Example (3 uop-wide) XOR ADD REP.MOV REP.MOV REP.MOV STORE ADD [ ] ADD [ ] ADD [ ] … SUB XOR XOR XOR … STORE … mJCC INC … LOAD mJCC … ADD Cycle 5 Cycle n Cycle n+1 Cycle 1
Superscalar CISC Decode • Instruction Length Decode (ILD) • Where are the instructions? • Limited decode – just enough to parse prefixes, modes • Shift/Alignment • Get the right bytes to the decoders • Decode • Crack into uops • Do this for N instructions per cycle!
ILD Recurrence/Loop PCi= X PCi+1 = PCi + sizeof( Mem[PCi] ) PCi+2 = PCi+1 + sizeof( Mem[PCi+1] ) = PCi + sizeof( Mem[PCi] ) + sizeof( Mem[PCi+1] ) • Can’t find start of next insn. before decoding the first • Must do ILD serially • ILD of 4 insns/cycle implies cycle time will be 4x • Critical component not pipeline-able
Inst 1 Inst 2 Inst 3 Remainder Cycle 2 Cycle 3 Decoder 1 Decoder 2 Decoder 3 Left Shifter Bad x86 Decode Implementation Instruction Bytes (ex. 16 bytes) Length 1 ILD (limited decode) Cycle 1 Length 2 ILD (limited decode) + Length 3 ILD (limited decode) + bytes decoded • ILD dominates cycle time; not scalable
Hardware-Intensive Decode Decode from every possible instruction starting point! ILD ILD ILD ILD ILD ILD ILD ILD ILD ILD ILD ILD ILD ILD ILD ILD Giant MUXes to select instruction bytes Decoder Decoder Decoder
ILD in Hardware-Intensive Approach Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder 6 bytes 4 bytes 3 bytes + Previous: 3 ILD + 2add Now: 1ILD + 2(mux+add) Total bytes decode = 11
Predecoding • ILD loop is hardware intensive • Impacts latency • Consumes substantial power • If instructions A, B and C are decoded • ... lengths for A, B, and C will still be the same next time • No need to repeat ILD • Possible to cache the ILD work
Decoder Example: AMD K5 (1/2) From Memory 8 bits 8 bytes b0 b1 b2 … b7 Predecode Logic +5 bits 13 bytes b0 b1 b2 … b7 8 (8-bit inst + 5-bit predecode) I$ 16 (8-bit inst + 5-bit predecode) Decode Up to 4 ROPs • Compute ILD on fetch, store ILD in the I$
Decoder Example: AMD K5 (2/2) • Predecode makes decode easier by providing: • Instruction start/end location (ILD) • Number of ROPs needed per inst • Opcode and prefix locations • Power/performance tradeoffs • Larger I$ (increase array by 62.5%) • Longer I$ latency, more I$ power consumption • Remove logic from decode • Shorter pipeline, simpler logic • Cache and reused decode work less decode power • Longer effective I-L1 miss latency (ILD on fill)
Decoder Example: Intel P-Pro • Only Decoder 0 interfaces with the uROM and MS • If insn. in Decoder 1 or Decoder 2 requires > 1 uop • do not generate output • shift to Decoder 0 on the next cycle 16 Raw Instruction Bytes Fetch resteer if needed mROM Decoder 0 Decoder 1 Decoder 2 Branch Address Calculator 4 uops 1 uop 1 uop
Fetch Rate is an ILP Upper Bound • Instruction fetch limits performance • To sustain IPC of N, must sustain a fetch rate of N per cycle • If you consume 1500 calories per day,but burn 2000 calories per day,then you will eventually starve. • Need to fetch N on average, not on every cycle • N-wide superscalar ideally fetches N insns. per cycle • This doesn’t happen in practice due to: • Instruction cache organization • Branches • … and interaction between the two
Instruction Cache Organization • To fetch N instructions per cycle... • L1-I line must be wide enough for N instructions • PC register selects L1-I line • A fetch group is the set of insns. starting at PC • For N-wide machine, [PC,PC+N-1] PC Cache Line Tag Inst Inst Inst Inst Tag Inst Inst Inst Inst Tag Inst Inst Inst Inst Decoder Tag Inst Inst Inst Inst Tag Inst Inst Inst Inst
Fetch Misalignment (1/2) • If PC = xxx01001, N=4: • Ideal fetch group is xxx01001 through xxx01100 (inclusive) PC: xxx01001 00 01 10 11 000 Tag Inst Inst Inst Inst 001 Tag Inst Inst Inst Inst 010 Tag Inst Inst Inst Inst 011 Tag Inst Inst Inst Inst Decoder 111 Tag Inst Inst Inst Inst Line width Fetch group • Misalignment reduces fetch width
Inst Inst Inst Fetch Misalignment (2/2) • Now takes two cycles to fetch N instructions • ½ fetch bandwidth! PC: xxx01001 00 01 10 11 000 Tag Inst Inst Inst Inst 001 Tag Inst Inst Inst Inst 010 Tag Inst Inst Inst Inst 011 Tag Inst Inst Inst Inst Decoder Cycle 2 111 PC: xxx01100 Tag Inst Inst Inst Inst 00 01 10 11 000 001 Tag Inst Inst Inst Inst Cycle 1 Inst Inst Inst 010 Tag Inst Inst Inst Inst 011 Tag Inst Inst Inst Inst Tag Inst Inst Inst Inst Decoder 111 Tag Inst Inst Inst Inst Inst • Might not be ½ by combining with the next fetch
Cache Line Reducing Fetch Fragmentation (1/2) • Make |Fetch Group| < |L1-I Line| Tag Inst Inst Inst Inst Inst Inst Inst Inst Tag Inst Inst Inst Inst Inst Inst Inst Inst Decoder PC Tag Inst Inst Inst Inst Inst Inst Inst Inst • Can deliver N insns. when PC > N from end of line
Reducing Fetch Fragmentation (2/2) • Needs a “rotator” to decode insns. in correct order Tag Inst Inst Inst Inst Inst Inst Inst Inst Tag Inst Inst Inst Inst Inst Inst Inst Inst Decoder PC Tag Inst Inst Inst Inst Inst Inst Inst Inst Rotator Inst Inst Inst Inst Aligned fetch group