Computer Architecture Principles Dr. Mike Frank

Computer Architecture PrinciplesDr. Mike Frank CDA 5155Summer 2003 Module #21Multiple Issue Pipelining:Superscalar, VLIW, etc.

Multiple Issue

Multiple Issue (3.6) • Goal: Enable multiple instructions to be issued in a single clock cycle. (Can get CPI < 1!) • Two basic “flavors” of multiple-issue: • Superscalar: • Operates on an ordinary serial instruction stream format. • Instructions per clock varies widely. • Scheduling can be either dynamic or static. • VLIW (Very Long Instruction Word) a.k.a. EPIC (Explicitly Parallel Instruction Computing). • New format: Parallel instructions grouped into blocks. • Instructions per clock fairly well fixed (by block size). • Mostly statically scheduled by compiler.

Code Example to be Used C code fragment: double *p; do { *(p--) += c } while (p); DLX code fragment: Loop: LD F0,0(R1) ; F0 = *p ADDD F4,F0,F2 ; F4 = F0 + c SD 0(R1),F4 ; *p = F4 SUBI R1,R1,#8 ; p-- BNEZ R1,Loop ; until p=0

Simple Superscalar RISC • Typical superscalar: 1-8 insts. issued per cycle • Actual number depends on dependences, hazards • Our simple example: At most 2 insts./cycle • Instructions statically pre-paired to ease decoding: • 1st: One load/store/branch/integer-ALU op. • 2nd: One floating-point op.

Some Issues with this Approach • If FP ops are multiple-cycle, • then to issue 1 FP inst./cycle requires multiple or pipelined FP functional units (or both) • FP ops may finish out-of-order, usual issues arise • Why parallelize integer ops against FP ops? • Each uses different registers & functional units • Resource contention is minimized • Only contention is for FP registers on loads/stores • Must detect & deal with structural & data hazards • Note issue with load latency: • Result available 1 cycle after EX (in MEM) • Not available for next 3 instructions!

Unrolled Loop, Superscalar vers. • 5 elements/iteration, 2.4 clock cycles/element:

Pipeline details in this Example F = inst. Fetch D = inst. Decode E = integer execution 1234 = FP exec. stages M = mem. access W = writeback regs. Loop: LD F0,0(R1) FDEMW LD F6,-8(R1) FDEMW LD F10,-16(R1) FDEMW ADDD F4,F0,F2 FD1234MW LD F14,-24(R1) FDEMW ADDD F8,F6,F2 FD1234MW LD F18,-32(R1) FDEMW ADDD F12,F10,F2 FD1234MW SD 0(R1),F4 FDEMW ADDD F16,F14,F2 FD1234MW SD -8(R1),F8 FDEMW ADDD F20,F18,F2 FD1234MW SD -16(R1),F12 FDEMW SUBI R1,R1,#40 FDEMW SD 16(R1),F16 FDEMW BNEZ R1,Loop FDEMW SD 8(R1),F20 FDEMW

Multiple Issue + Dynamic Sched. • Why? Usual advantages of dyn. scheduling… • Compiler independent, data-dependent scheduling • Multiple-issue Tomasulo: • Issue 1 integer + 1 FP instruction to RS each cycle • Problem (again) issuing FP loads + ops simult. • If instructions dependent, hazard detection is broken. • Two solutions to this problem: • 1. Enter inst. into tables in only 1/2 a clock • 2. Statically schedule loads/stores far enough away from instructions that they have dependences with, then queue them till ready, like in Tomasulo load-store buffers. • Another approach: only queue loads/stores • Use static scheduling for all other instructions

Timing of Mult.-Issue, Dyn. 123456789012 LD F0,0(R1) IEM ADDD F4,F0,F2 I~~E>W SD 0(R1),F4 IE~~~M SUBI R1,R1,#8 IEW BNEZ R1,Loop IE LD F0,0(R1) IE~M ADDD F4,F0,F2 I~~~E>W SD 0(R1),F4 IE~~~~M SUBI R1,R1,#8 IEW BNEZ R1,Loop IE I = Issue E = Execute M = Memory~ = stall > = still in Execute W = Writeback

VLIW (Very Long Instruction Word) • Also called EPIC (Explicitly Parallel Instruction Computing) by Intel in the IA-64. • Basic idea: • Have compiler determine multiple independent instructions that can execute simultaneously. • Package them into a wide, fixed-size bundle. • Slots in bundle correspond to functional units. • Advantages: • Permits very wide multiple-issue (high parallelism) • Avoids runtime complexity of dynamic scheduling

VLIW Example • Each VLIW word contains: • 2 slots for memory-reference instructions • 2 slots for floating-point instructions • 1 slot for an integer operation or branch • See fig. 4.29 on p. 286 (no electronic version available yet) • Shows our old familiar loop • Unrolled & packed into VLIW words

More on VLIW • A technique for multiple-issue • Statically scheduled by compiler • Difference vs. statically scheduled superscalar: • Compiler pre-collects instructions into issue packets • Avoids or marks dependences within issue packet • Avoids need for dynamic dependence detection • See example in 3rd ed., fig. 4.5, p. 318: • Unrolled loop on a 5-way VLIW • 2 memory references, 2 FP ops, and 1 int. op per clock • Achieves 9/7 = ~1.3 cycles per array element! • 60% efficiency vs. peak instruction issue rate

Difficulties w. early VLIW • Increased code size: • Aggressively unrolling loops to expand basic blocks • Unfilled instruction slots are wasted bits in VLIW • Can be dealt w. by alternative encodings or in-memory compression • Lockstep operation: • All instructions in packet proceed in lockstep • Entire pipe must still if one functional unit does • Difficult to statically predict some stalls • e.g., cache misses • Binary code incompatibilities • Code layout depends on microarchitecture version!

Limits to Multiple-Issue • Can we continue increasing the issue width (and decreasing CPI) indefinitely? • No, not for serially-written programs in general. • Some problems with increasing issue width: • Inherent limitations of ILP in programs. • Difficulty of scaling shared reg. file or memory. • Dynamic scheduling complexity in superscalar. • Code size, binary incompatibility in VLIW.

An Important Lesson • Multiple-issue increases parallelism without requiring programmer effort, but it has limits! • Beyond a certain point, increasing parallelism requires programmer participation! • Multithreaded shared-memory programming. • Better: Distributed multiprocessor algorithms that take communications limits into account. • Maximal performance requires a programming model based on the structure of physics! • My proposal: a 3-D mesh of reversible (maybe quantum-coherent) processing elements.

HW support for more ILP (4.6) • Static techniques described in 4.5 may miss a lot of ILP opportunities that occur dynamically. • In this section we introduce some HW techniques: • Conditional or predicated instructions • Compiler speculation w. HW support: • HW/SW cooperation for speculation • Speculation with Poison Bits • Speculative instructions with renaming • Hardware-based speculation

Dyn. Mult.-Iss. Timing Example (Example from 3rd ed., figs. 3.25-3.26, pp. 221-223.) 1111111111 1234567890123456789 1 L.D F0,0(R1) IEMW 1 ADD.D F4,F0,F2 I EEEW 1 S.D F4,0(R1) IE M 1 DADDIU R1,R1,#-8 I EW 1 BNE R1,R2,Loop I E 2 L.D F0,0(R1) I EMW 2 ADD.D F4,F0,F2 I EEEW 2 S.D F4,0(R1) I E M 2 DADDIU R1,R1,#-8 I EW 2 BNE R1,R2,Loop I E 3 L.D F0,0(R1) I EMW 3 ADD.D F4,F0,F2 I EEEW 3 S.D F4,0(R1) I E M 3 DADDIU R1,R1,#-8 I EW 3 BNE R1,R2,Loop I E RAW hazards Structural hazards Control hazards With only 1 integeradder unit, shared byALU instructionsand EA calculationsfor load/stores

Timing Example w. Extra Adder (Example from 3rd ed., figs. 3.27-3.28, pp. 223-225.) 1111111111 1234567890123456789 1 L.D F0,0(R1) IEMW 1 ADD.D F4,F0,F2 I EEEW 1 S.D F4,0(R1) IE M 1 DADDIU R1,R1,#-8 IEW 1 BNE R1,R2,Loop I E 2 L.D F0,0(R1) I EMW 2 ADD.D F4,F0,F2 I EEEW 2 S.D F4,0(R1) I E M 2 DADDIU R1,R1,#-8 IEW 2 BNE R1,R2,Loop I E 3 L.D F0,0(R1) I EMW 3 ADD.D F4,F0,F2 I EEEW 3 S.D F4,0(R1) I E M 3 DADDIU R1,R1,#-8 IEW 3 BNE R1,R2,Loop I E RAW hazards Structural hazards Control hazards With 2 integeradder units, one forALU instructions,another for and EAcalculations for load/stores

Computer Architecture Principles Dr. Mike Frank