550 likes | 697 Views
Embedded Computer Architectures. Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nl André Kokkeler (Zilverling 4096), kokkeler@utwente.nl. Contents. Introduction Processor Architecture Loop Unrolling
E N D
EmbeddedComputerArchitectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nl André Kokkeler (Zilverling 4096), kokkeler@utwente.nl
Contents • Introduction • Processor Architecture • Loop Unrolling • Software Pipelining
Processor Architecture • 5 stage pipeline • Static scheduling • Integer and Floating Point unit
Processor Architecture • Latencies: Integer ALU => Integer ALU Int. ALU No Latency Int. ALU Floating point ALU => Floating point ALU FP ALU FP ALU Latency = 3
Processor Architecture • Latencies: Load Memory => Store Memory Load No Latency Store
Processor Architecture • Latencies: Integer ALU => Store Memory Int. ALU No Latency Store Floating point ALU => Store Memory FP ALU Store Latency = 2
Processor Architecture • Latencies: Load Memory => Integer ALU Load Int. ALU Latency = 1 Load Memory => Floating point ALU Load FP ALU Latency = 1
Processor Architecture • Latencies: Integer ALU => Branch Int. ALU Branch Latency = 1
Loop Unrolling • For i:=1000 downto 1 do x[i] := x[i]+s; • Loop: L.D F0,0(R1) ; F0 x[i] ADD.D F4,F0,F2 ; F4 x[i]+s S.D 0(R1),F4 ; x[i] x[i]+s DADDUI R1,R1,#-8 ; i i-1 BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot • R1: pointer within arrayF2: value to be added (s)R2: last element in arrayF0: value in arrayF4: value to be written in array
Loop Unrolling • Loop: L.D F0,0(R1) ; F0 x[i] ADD.D F4,F0,F2 ; F4 x[i]+s S.D 0(R1),F4 ; x[i] x[i]+s DADDUI R1,R1,#-8 ; i i-1 BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot Load Memory => FP ALU 1 stall
Loop Unrolling • Loop: L.D F0,0(R1) ; F0 x[i]stall ADD.D F4,F0,F2 ; F4 x[i]+s S.D 0(R1),F4 ; x[i] x[i]+s DADDUI R1,R1,#-8 ; i i-1 BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot FP ALU => Store Memory => 2 stalls
Loop Unrolling • Loop: L.D F0,0(R1) ; F0 x[i]stall ADD.D F4,F0,F2 ; F4 x[i]+sstall stall S.D 0(R1),F4 ; x[i] x[i]+s DADDUI R1,R1,#-8 ; i i-1 BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot Integer ALU => Branch 1 stall
Loop Unrolling • Loop: L.D F0,0(R1) ; F0 x[i]stall ADD.D F4,F0,F2 ; F4 x[i]+sstall stall S.D 0(R1),F4 ; x[i] x[i]+s DADDUI R1,R1,#-8 ; i i-1stall BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot Smart compiler
Loop Unrolling • Loop: L.D F0,0(R1) ; F0 x[i] DADDUI R1,R1,#-8 ; i i-1 ADD.D F4,F0,F2 ; F4 x[i]+sstall BNE R1,R2,Loop ; repeat if i≠0 S.D 8(R1),F4 ; x[i] x[i]+s Integer ALU => Branch 1 stall From 10 cycles per loop to 6 cycles per loop
Loop Unrolling • Loop: L.D F0,0(R1) ; F0 x[i] DADDUI R1,R1,#-8 ; i i-1 ADD.D F4,F0,F2 ; F4 x[i]+s BNE R1,R2,Loop ; repeat if i≠0 S.D 8(R1),F4 ; x[i] x[i]+s • 5 instructions • 3 ‘doing the job’ • 2 control or ‘overhead’ • Reduce overhead => loop unrolling • Add code • From 1000 iterations to 500 iterations
Loop Unrolling • Original Code Sequence: Loop: L.D F0,0(R1) ; F0 x[i] ADD.D F4,F0,F2 ; F4 x[i]+s S.D 0(R1),F4 ; x[i] x[i]+s DADDUI R1,R1,#-8 ; i i-1 BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot Copy this part With correct ‘data pointer’
Loop Unrolling • Unrolled Code Sequence: Loop: L.D F0,0(R1) ; F0 x[i] ADD.D F4,F0,F2 ; F4 x[i]+s S.D 0(R1),F4 ; x[i] x[i]+sL.D F0,-8(R1) ; F0 x[i] ADD.D F4,F0,F2 ; F4 x[i]+s S.D -8(R1),F4 ; x[i] x[i]+s DADDUI R1,R1,#-16 ; i i-2 BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot • There are still a lot of stalls. Removing is easier if some additional registers are used 1 stall 2 stalls 1 stall 2 stalls 1 stall
Loop Unrolling • Unrolled Code Sequence: Loop: L.D F0,0(R1) ; F0 x[i] ADD.D F4,F0,F2 ; F4 x[i]+s S.D 0(R1),F4 ; x[i] x[i]+sL.D F6,-8(R1) ; F6 x[i] ADD.D F8,F6,F2 ; F8 x[i]+s S.D -8(R1),F8 ; x[i] x[i]+s DADDUI R1,R1,#-16 ; i i-1 BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot 1 stall 2 stalls 1 stall 2 stalls 1 stall
Loop Unrolling • Unrolled Code Sequence: Loop: L.D F0,0(R1) ; F0 x[i]L.D F6,-8(R1) ; F6 x[i] ADD.D F4,F0,F2 ; F4 x[i]+s S.D 0(R1),F4 ; x[i] x[i]+s ADD.D F8,F6,F2 ; F8 x[i]+s S.D -8(R1),F8 ; x[i] x[i]+s DADDUI R1,R1,#-16 ; i i-1 BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot 1 stall 1 stall 2 stalls 1 stall
Loop Unrolling • Unrolled Code Sequence: Loop: L.D F0,0(R1) ; F0 x[i]L.D F6,-8(R1) ; F6 x[i] ADD.D F4,F0,F2 ; F4 x[i]+s ADD.D F8,F6,F2 ; F8 x[i]+s S.D 0(R1),F4 ; x[i] x[i]+s S.D -8(R1),F8 ; x[i] x[i]+s DADDUI R1,R1,#-16 ; i i-1 BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot +16 +8 2 stalls 1 stall
Loop Unrolling • Unrolled Code Sequence: Loop: L.D F0,0(R1) ; F0 x[i]L.D F6,-8(R1) ; F6 x[i] ADD.D F4,F0,F2 ; F4 x[i]+s ADD.D F8,F6,F2 ; F8 x[i]+s DADDUI R1,R1,#-16 ; i i-1 S.D 16(R1),F4 ; x[i] x[i]+s S.D 8(R1),F8 ; x[i] x[i]+s BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot
Loop Unrolling • Unrolled Code Sequence: Loop: L.D F0,0(R1) ; F0 x[i]L.D F6,-8(R1) ; F6 x[i] ADD.D F4,F0,F2 ; F4 x[i]+s ADD.D F8,F6,F2 ; F8 x[i]+s DADDUI R1,R1,#-16 ; i i-1 S.D 16(R1),F4 ; x[i] x[i]+s BNE R1,R2,Loop ; repeat if i≠0S.D 8(R1),F8 ; x[i] x[i]+s
Loop Unrolling • In example: loop-unrolling factor 2 • In general: loop-unrolling factor k • Limitations concerning k • Amdahls law: 3000 cycles are always needed • Increasing k => increasing number of registers • Increasing k => increasing code size
Software Pipelining • Original unrolled loop:Loop: L.D F0,0(R1) ; F0 x[i] ADD.D F4,F0,F2 ; F4 x[i]+s S.D 0(R1),F4 ; x[i] x[i]+s DADDUI R1,R1,#-8 ; i i-1 BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot • Three actions involved with actual calculations: F0 x[i] F4 x[i] + x x[i] x[i] + s • Consider these as three different stages 1 stall 2 stalls 1 stall
Software Pipelining • Original unrolled loop:Loop: L.D F0,0(R1) ; F0 x[i] ADD.D F4,F0,F2 ; F4 x[i]+s S.D 0(R1),F4 ; x[i] x[i]+s DADDUI R1,R1,#-8 ; i i-1 BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot • Three actions involved with actual calculations: F0 x[i] Stage 1 F4 x[i] + x Stage 2 x[i] x[i] + s Stage 3 • Associate array element with the stages
Software Pipelining • Original unrolled loop:Loop: L.D F0,0(R1) ; F0 x[i] ADD.D F4,F0,F2 ; F4 x[i]+s S.D 0(R1),F4 ; x[i] x[i]+s DADDUI R1,R1,#-8 ; i i-1 BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot • Three actions involved with actual calculations: F0 x[i] Stage 1, x[i] F4 x[i] + x Stage 2, x[i] x[i] x[i] + s Stage 3, x[i]
Software Pipelining • Normal Execution Stage 1 Stage 2 Stage 3 F0 F4 Stage 1: fill F0 Stage 2: read F0 fill F4 Stage 3: read F4 1 stall X[1000] X[1000] 2 stalls X[1000] 1 stall X[999] Time X[999] 2 stalls Register Empty X[999] 1 stall X[998] Register Occupied X[998] 2 stalls X[998]
Software Pipelining • Software Pipelined Execution Stage 1 Stage 2 Stage 3 F0 F4 Stage 1: fill F0 Stage 2: read F0 fill F4 Stage 3: read F4 1 stall X[1000] X[1000] 1 stall X[999] 0 stalls X[1000] Time X[999] 1 stall Register Empty X[998] 0 stalls X[999] Register Occupied X[998] 1 stall X[997] X[998]
Software Pipelining • Software Pipelined Execution Stage 1 Stage 2 Stage 3 1 stall Loop: L.D F0,0(R1) ; F0 x[1000] X[1000] X[1000] 1 stall ADD.D F4,F0,F2 ; F4 x[1000] + s X[999] LD.D F0,-8(R1) ; F0 x[999] 0 stalls X[i] S.D 0(R1),F4 ; x[i] F4 X[i-1] 1 stall ADD.D F4,F0,F2 ; F4 x[i-1] + s ADD.D F4,F0,F2 ; F4 x[i-1] + s X[i-2] LD.D F0,-16(R1) ; F0 x[i-2] 0 stalls BNE R1,R2,Loop; repeat if i≠1 DADDUI R1,R1,#-8 ;i i-8
Software Pipelining • Software Pipelined Execution Stage 1 Stage 2 Stage 3 1 stall Loop: L.D F0,0(R1) ; F0 x[1000] X[1000] X[1000] 1 stall ADD.D F4,F0,F2 ; F4 x[1000] + s X[999] LD.D F0,-8(R1) ; F0 x[999] 0 stalls X[i] S.D 0(R1),F4 ; x[i] F4 X[i-1] ADD.D F4,F0,F2 ; F4 x[i-1] + s ADD.D F4,F0,F2 ; F4 x[i-1] + s X[i-2] LD.D F0,-16(R1) ; F0 x[i-2] 0 stalls 0 stalls BNE R1,R2,Loop; repeat if i≠1 DADDUI R1,R1,#-8 ;i i-8
Software Pipelining • No stalls inside loop • Additional start-up (and clean-up) code • No reduction of control overhead • No additional registers
VLIW • To simplify processor hardware: sophisticated compilers (loop unrolling, software pipelining etc.) • Extreme form: Very Long Instruction Word processors
VLIW • Superscalar • VLIW • Hardware • Grouping • Execution Unit Assignment • Initiation Execution Units Instructions
VLIW • Suppose 4 functional units • Memory load unit • Floating point unit • Memory store unit • Integer/Branch unit • Instruction Memory load FP operation Memory store Integer/Branch
VLIW • Original unrolled loop:Loop: L.D F0,0(R1) ; F0 x[i] ADD.D F4,F0,F2 ; F4 x[i]+s S.D 0(R1),F4 ; x[i] x[i]+s DADDUI R1,R1,#-8 ; i i-1 BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot 1 stall 2 stalls 1 stall Memory load FP operation Memory store Integer/Branch L.D stall ADD.D stall stall S.D Limit stall cycles by clever compilers (loop unrolling, software pipelining)
VLIW • Superscalar • VLIW • Hardware • Grouping • Execution Unit Assignment • Initiation Execution Units Instructions
VLIW • Superscalar • Dynamic VLIW • Hardware • Grouping • Execution Unit Assignment • Initiation Execution Units Instructions Initiation
Dynamic VLIW • VLIW: no caches because no hardware to deal with cache misses • Dynamic VLIW: Hardware to stall on a cache miss. • Not used frequently
VLIW • Dynamic VLIW • Explicitly Parallel Instruction Computing (EPIC) Initiation Execution Units Instructions Execution Unit Assign-ment Initiation
EPIC • IA-64 architecture by HP and Intel • IA-64 is an instruction set architecture intended for implementation on EPIC • Itanium is first Intel product • 64-bit architecture • Basic concepts: • Instruction level parallelism indicated by compiler • Long or very long instruction words • Branch predication (≠ prediction) • Speculative loading
Key Features • Large number of registers • IA-64 instruction format assumes 256 • 128 * 64 bit integer, logical & general purpose • 128 * 82 bit floating point and graphic • 64 * 1 bit predicated execution registers (see later) • To support high degree of parallelism • Multiple execution units • Expected to be 8 or more • Depends on number of transistors available • Execution of parallel instructions depends on hardware available • 8 parallel instructions may be spilt into two lots of four if only four execution units are available
IA-64 Execution Units • I-Unit • Integer arithmetic • Shift and add • Logical • Compare • Integer multimedia ops • M-Unit • Load and store • Between register and memory • Some integer ALU • B-Unit • Branch instructions • F-Unit • Floating point instructions
Instruction Format • 128 bit bundle • Holds three instructions (syllables) plus template • Can fetch one or more bundles at a time • Template contains info on which instructions can be executed in parallel • Not confined to single bundle • e.g. a stream of 8 instructions may be executed in parallel • Compiler will have re-ordered instructions to form contiguous bundles • Can mix dependent and independent instructions in same bundle
Assembly Language Format • [qp] mnemonic [.comp] dest = srcs // • qp - predicate register • 1 at execution then execute and commit result to hardware • 0 result is discarded • mnemonic - name of instruction • comp – one or more instruction completers used to qualify mnemonic • dest – one or more destination operands • srcs – one or more source operands • // - comment • Instruction groups and stops indicated by ;; • Sequence without read after write or write after write • Do not need hardware register dependency checks
Assembly Examples ld8 r1 = [r5] ;; //first group add r3 = r1, r4 //second group • Second instruction depends on value in r1 • Changed by first instruction • Can not be in same group for parallel execution
Predication if a == 0 then j = j+1 else k = k+1 Pseudo code cmp a,0 jne L1 add j,1 jmp L2 L1: add k,1 L2: Using branches If a == 0 Then p1 = 1 and p2 = 0 Else p1 = 0 and p2 = 1 cmp.eq p1, p2 = 0, a ;; (p1) add j = 1, j (p2) add k = 1, k Predicated Should NOT be there to enable parallelism