1 / 52

Embedded Computer Architectures

Embedded Computer Architectures. Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nl André Kokkeler (Zilverling 4096), kokkeler@utwente.nl. Contents. Introduction Processor Architecture Loop Unrolling

morrie
Download Presentation

Embedded Computer Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EmbeddedComputerArchitectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nl André Kokkeler (Zilverling 4096), kokkeler@utwente.nl

  2. Contents • Introduction • Processor Architecture • Loop Unrolling • Software Pipelining

  3. Introduction

  4. Processor Architecture • 5 stage pipeline • Static scheduling • Integer and Floating Point unit

  5. Processor Architecture • Latencies: Integer ALU => Integer ALU Int. ALU No Latency Int. ALU Floating point ALU => Floating point ALU FP ALU FP ALU Latency = 3

  6. Processor Architecture • Latencies: Load Memory => Store Memory Load No Latency Store

  7. Processor Architecture • Latencies: Integer ALU => Store Memory Int. ALU No Latency Store Floating point ALU => Store Memory FP ALU Store Latency = 2

  8. Processor Architecture • Latencies: Load Memory => Integer ALU Load Int. ALU Latency = 1 Load Memory => Floating point ALU Load FP ALU Latency = 1

  9. Processor Architecture • Latencies: Integer ALU => Branch Int. ALU Branch Latency = 1

  10. Loop Unrolling • For i:=1000 downto 1 do x[i] := x[i]+s; • Loop: L.D F0,0(R1) ; F0 x[i] ADD.D F4,F0,F2 ; F4 x[i]+s S.D 0(R1),F4 ; x[i] x[i]+s DADDUI R1,R1,#-8 ; i i-1 BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot • R1: pointer within arrayF2: value to be added (s)R2: last element in arrayF0: value in arrayF4: value to be written in array

  11. Loop Unrolling • Loop: L.D F0,0(R1) ; F0 x[i] ADD.D F4,F0,F2 ; F4 x[i]+s S.D 0(R1),F4 ; x[i] x[i]+s DADDUI R1,R1,#-8 ; i i-1 BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot Load Memory => FP ALU 1 stall

  12. Loop Unrolling • Loop: L.D F0,0(R1) ; F0 x[i]stall ADD.D F4,F0,F2 ; F4 x[i]+s S.D 0(R1),F4 ; x[i] x[i]+s DADDUI R1,R1,#-8 ; i i-1 BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot FP ALU => Store Memory => 2 stalls

  13. Loop Unrolling • Loop: L.D F0,0(R1) ; F0 x[i]stall ADD.D F4,F0,F2 ; F4 x[i]+sstall stall S.D 0(R1),F4 ; x[i] x[i]+s DADDUI R1,R1,#-8 ; i i-1 BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot Integer ALU => Branch 1 stall

  14. Loop Unrolling • Loop: L.D F0,0(R1) ; F0 x[i]stall ADD.D F4,F0,F2 ; F4 x[i]+sstall stall S.D 0(R1),F4 ; x[i] x[i]+s DADDUI R1,R1,#-8 ; i i-1stall BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot Smart compiler

  15. Loop Unrolling • Loop: L.D F0,0(R1) ; F0 x[i] DADDUI R1,R1,#-8 ; i i-1 ADD.D F4,F0,F2 ; F4 x[i]+sstall BNE R1,R2,Loop ; repeat if i≠0 S.D 8(R1),F4 ; x[i] x[i]+s Integer ALU => Branch 1 stall From 10 cycles per loop to 6 cycles per loop

  16. Loop Unrolling • Loop: L.D F0,0(R1) ; F0 x[i] DADDUI R1,R1,#-8 ; i i-1 ADD.D F4,F0,F2 ; F4 x[i]+s BNE R1,R2,Loop ; repeat if i≠0 S.D 8(R1),F4 ; x[i] x[i]+s • 5 instructions • 3 ‘doing the job’ • 2 control or ‘overhead’ • Reduce overhead => loop unrolling • Add code • From 1000 iterations to 500 iterations

  17. Loop Unrolling • Original Code Sequence: Loop: L.D F0,0(R1) ; F0 x[i] ADD.D F4,F0,F2 ; F4 x[i]+s S.D 0(R1),F4 ; x[i] x[i]+s DADDUI R1,R1,#-8 ; i i-1 BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot Copy this part With correct ‘data pointer’

  18. Loop Unrolling • Unrolled Code Sequence: Loop: L.D F0,0(R1) ; F0 x[i] ADD.D F4,F0,F2 ; F4 x[i]+s S.D 0(R1),F4 ; x[i] x[i]+sL.D F0,-8(R1) ; F0 x[i] ADD.D F4,F0,F2 ; F4 x[i]+s S.D -8(R1),F4 ; x[i] x[i]+s DADDUI R1,R1,#-16 ; i i-2 BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot • There are still a lot of stalls. Removing is easier if some additional registers are used 1 stall 2 stalls 1 stall 2 stalls 1 stall

  19. Loop Unrolling • Unrolled Code Sequence: Loop: L.D F0,0(R1) ; F0 x[i] ADD.D F4,F0,F2 ; F4 x[i]+s S.D 0(R1),F4 ; x[i] x[i]+sL.D F6,-8(R1) ; F6 x[i] ADD.D F8,F6,F2 ; F8  x[i]+s S.D -8(R1),F8 ; x[i] x[i]+s DADDUI R1,R1,#-16 ; i i-1 BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot 1 stall 2 stalls 1 stall 2 stalls 1 stall

  20. Loop Unrolling • Unrolled Code Sequence: Loop: L.D F0,0(R1) ; F0 x[i]L.D F6,-8(R1) ; F6  x[i] ADD.D F4,F0,F2 ; F4 x[i]+s S.D 0(R1),F4 ; x[i] x[i]+s ADD.D F8,F6,F2 ; F8  x[i]+s S.D -8(R1),F8 ; x[i] x[i]+s DADDUI R1,R1,#-16 ; i i-1 BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot 1 stall 1 stall 2 stalls 1 stall

  21. Loop Unrolling • Unrolled Code Sequence: Loop: L.D F0,0(R1) ; F0 x[i]L.D F6,-8(R1) ; F6  x[i] ADD.D F4,F0,F2 ; F4 x[i]+s ADD.D F8,F6,F2 ; F8  x[i]+s S.D 0(R1),F4 ; x[i] x[i]+s S.D -8(R1),F8 ; x[i] x[i]+s DADDUI R1,R1,#-16 ; i i-1 BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot +16 +8 2 stalls 1 stall

  22. Loop Unrolling • Unrolled Code Sequence: Loop: L.D F0,0(R1) ; F0 x[i]L.D F6,-8(R1) ; F6  x[i] ADD.D F4,F0,F2 ; F4 x[i]+s ADD.D F8,F6,F2 ; F8  x[i]+s DADDUI R1,R1,#-16 ; i i-1 S.D 16(R1),F4 ; x[i] x[i]+s S.D 8(R1),F8 ; x[i] x[i]+s BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot

  23. Loop Unrolling • Unrolled Code Sequence: Loop: L.D F0,0(R1) ; F0 x[i]L.D F6,-8(R1) ; F6  x[i] ADD.D F4,F0,F2 ; F4 x[i]+s ADD.D F8,F6,F2 ; F8  x[i]+s DADDUI R1,R1,#-16 ; i i-1 S.D 16(R1),F4 ; x[i] x[i]+s BNE R1,R2,Loop ; repeat if i≠0S.D 8(R1),F8 ; x[i] x[i]+s

  24. Loop Unrolling • In example: loop-unrolling factor 2 • In general: loop-unrolling factor k • Limitations concerning k • Amdahls law: 3000 cycles are always needed • Increasing k => increasing number of registers • Increasing k => increasing code size

  25. Software Pipelining • Original unrolled loop:Loop: L.D F0,0(R1) ; F0 x[i] ADD.D F4,F0,F2 ; F4 x[i]+s S.D 0(R1),F4 ; x[i] x[i]+s DADDUI R1,R1,#-8 ; i i-1 BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot • Three actions involved with actual calculations: F0 x[i] F4 x[i] + x x[i] x[i] + s • Consider these as three different stages 1 stall 2 stalls 1 stall

  26. Software Pipelining • Original unrolled loop:Loop: L.D F0,0(R1) ; F0 x[i] ADD.D F4,F0,F2 ; F4 x[i]+s S.D 0(R1),F4 ; x[i] x[i]+s DADDUI R1,R1,#-8 ; i i-1 BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot • Three actions involved with actual calculations: F0 x[i] Stage 1 F4 x[i] + x Stage 2 x[i] x[i] + s Stage 3 • Associate array element with the stages

  27. Software Pipelining • Original unrolled loop:Loop: L.D F0,0(R1) ; F0 x[i] ADD.D F4,F0,F2 ; F4 x[i]+s S.D 0(R1),F4 ; x[i] x[i]+s DADDUI R1,R1,#-8 ; i i-1 BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot • Three actions involved with actual calculations: F0 x[i] Stage 1, x[i] F4 x[i] + x Stage 2, x[i] x[i] x[i] + s Stage 3, x[i]

  28. Software Pipelining • Normal Execution Stage 1 Stage 2 Stage 3 F0 F4 Stage 1: fill F0 Stage 2: read F0 fill F4 Stage 3: read F4 1 stall X[1000] X[1000] 2 stalls X[1000] 1 stall X[999] Time X[999] 2 stalls Register Empty X[999] 1 stall X[998] Register Occupied X[998] 2 stalls X[998]

  29. Software Pipelining • Software Pipelined Execution Stage 1 Stage 2 Stage 3 F0 F4 Stage 1: fill F0 Stage 2: read F0 fill F4 Stage 3: read F4 1 stall X[1000] X[1000] 1 stall X[999] 0 stalls X[1000] Time X[999] 1 stall Register Empty X[998] 0 stalls X[999] Register Occupied X[998] 1 stall X[997] X[998]

  30. Software Pipelining • Software Pipelined Execution Stage 1 Stage 2 Stage 3 1 stall Loop: L.D F0,0(R1) ; F0  x[1000] X[1000] X[1000] 1 stall ADD.D F4,F0,F2 ; F4  x[1000] + s X[999] LD.D F0,-8(R1) ; F0  x[999] 0 stalls X[i] S.D 0(R1),F4 ; x[i]  F4 X[i-1] 1 stall ADD.D F4,F0,F2 ; F4  x[i-1] + s ADD.D F4,F0,F2 ; F4  x[i-1] + s X[i-2] LD.D F0,-16(R1) ; F0  x[i-2] 0 stalls BNE R1,R2,Loop; repeat if i≠1 DADDUI R1,R1,#-8 ;i  i-8

  31. Software Pipelining • Software Pipelined Execution Stage 1 Stage 2 Stage 3 1 stall Loop: L.D F0,0(R1) ; F0  x[1000] X[1000] X[1000] 1 stall ADD.D F4,F0,F2 ; F4  x[1000] + s X[999] LD.D F0,-8(R1) ; F0  x[999] 0 stalls X[i] S.D 0(R1),F4 ; x[i]  F4 X[i-1] ADD.D F4,F0,F2 ; F4  x[i-1] + s ADD.D F4,F0,F2 ; F4  x[i-1] + s X[i-2] LD.D F0,-16(R1) ; F0  x[i-2] 0 stalls 0 stalls BNE R1,R2,Loop; repeat if i≠1 DADDUI R1,R1,#-8 ;i  i-8

  32. Software Pipelining • No stalls inside loop • Additional start-up (and clean-up) code • No reduction of control overhead • No additional registers

  33. VLIW • To simplify processor hardware: sophisticated compilers (loop unrolling, software pipelining etc.) • Extreme form: Very Long Instruction Word processors

  34. VLIW • Superscalar • VLIW • Hardware • Grouping • Execution Unit Assignment • Initiation Execution Units Instructions

  35. VLIW • Suppose 4 functional units • Memory load unit • Floating point unit • Memory store unit • Integer/Branch unit • Instruction Memory load FP operation Memory store Integer/Branch

  36. VLIW • Original unrolled loop:Loop: L.D F0,0(R1) ; F0 x[i] ADD.D F4,F0,F2 ; F4 x[i]+s S.D 0(R1),F4 ; x[i] x[i]+s DADDUI R1,R1,#-8 ; i i-1 BNE R1,R2,Loop ; repeat if i≠0 NOP ; branch delay slot 1 stall 2 stalls 1 stall Memory load FP operation Memory store Integer/Branch L.D stall ADD.D stall stall S.D Limit stall cycles by clever compilers (loop unrolling, software pipelining)

  37. VLIW • Superscalar • VLIW • Hardware • Grouping • Execution Unit Assignment • Initiation Execution Units Instructions

  38. VLIW • Superscalar • Dynamic VLIW • Hardware • Grouping • Execution Unit Assignment • Initiation Execution Units Instructions Initiation

  39. Dynamic VLIW • VLIW: no caches because no hardware to deal with cache misses • Dynamic VLIW: Hardware to stall on a cache miss. • Not used frequently

  40. VLIW • Dynamic VLIW • Explicitly Parallel Instruction Computing (EPIC) Initiation Execution Units Instructions Execution Unit Assign-ment Initiation

  41. EPIC • IA-64 architecture by HP and Intel • IA-64 is an instruction set architecture intended for implementation on EPIC • Itanium is first Intel product • 64-bit architecture • Basic concepts: • Instruction level parallelism indicated by compiler • Long or very long instruction words • Branch predication (≠ prediction) • Speculative loading

  42. Key Features • Large number of registers • IA-64 instruction format assumes 256 • 128 * 64 bit integer, logical & general purpose • 128 * 82 bit floating point and graphic • 64 * 1 bit predicated execution registers (see later) • To support high degree of parallelism • Multiple execution units • Expected to be 8 or more • Depends on number of transistors available • Execution of parallel instructions depends on hardware available • 8 parallel instructions may be spilt into two lots of four if only four execution units are available

  43. IA-64 Execution Units • I-Unit • Integer arithmetic • Shift and add • Logical • Compare • Integer multimedia ops • M-Unit • Load and store • Between register and memory • Some integer ALU • B-Unit • Branch instructions • F-Unit • Floating point instructions

  44. Instruction Format Diagram

  45. Instruction Format • 128 bit bundle • Holds three instructions (syllables) plus template • Can fetch one or more bundles at a time • Template contains info on which instructions can be executed in parallel • Not confined to single bundle • e.g. a stream of 8 instructions may be executed in parallel • Compiler will have re-ordered instructions to form contiguous bundles • Can mix dependent and independent instructions in same bundle

  46. Assembly Language Format • [qp] mnemonic [.comp] dest = srcs // • qp - predicate register • 1 at execution then execute and commit result to hardware • 0 result is discarded • mnemonic - name of instruction • comp – one or more instruction completers used to qualify mnemonic • dest – one or more destination operands • srcs – one or more source operands • // - comment • Instruction groups and stops indicated by ;; • Sequence without read after write or write after write • Do not need hardware register dependency checks

  47. Assembly Examples ld8 r1 = [r5] ;; //first group add r3 = r1, r4 //second group • Second instruction depends on value in r1 • Changed by first instruction • Can not be in same group for parallel execution

  48. Predication

  49. Predication if a == 0 then j = j+1 else k = k+1 Pseudo code cmp a,0 jne L1 add j,1 jmp L2 L1: add k,1 L2: Using branches If a == 0 Then p1 = 1 and p2 = 0 Else p1 = 0 and p2 = 1 cmp.eq p1, p2 = 0, a ;; (p1) add j = 1, j (p2) add k = 1, k Predicated Should NOT be there to enable parallelism

  50. Speculative Loading

More Related