ILP: Beyond Pipelining

1. ILP: Beyond Pipelining Subbaiah Venkata

2. 4/29/2012 GMU, ECE 511, Microprocessors 2 Objectives Understand why we care about parallelism Understand Instruction Level Parallelism (ILP) Terms you should know: Scalar (issue and execution) Single In-Order Issue ?In-Order execution Single Out-Of-Order Issue ?Out-Of-Order execution SuperScalar (issue and execution) Single In-Order issue ?In-Order / Out-Of-Order execution Multiple In-Order issue ?In-Order / Out-Of-Order execution Single Out-Of-Order Issue ?Out-Of-Order execution Multiple Out-Of-Order issue ?Out-Of-Order execution Out-Of-Order Issue or Dynamic Scheduling (Hardware based) Register renaming Scoreboarding Tomasulo Algorithm . . . Static Scheduling (Software based) Code Movement Loop Unrolling VLIW processors . . .

3. 4/29/2012 GMU, ECE 511, Microprocessors 3 Ideas To Reduce Stalls

4. 4/29/2012 GMU, ECE 511, Microprocessors 4 Parallelism Basic concept: technology limits how fast we can execute a single instruction/operation Do many things at the same time to make program execute faster Can also do parallelism for throughput: increasing the number of tasks that complete in a given amount of time

5. 4/29/2012 GMU, ECE 511, Microprocessors 5 Why is Parallelism Hard? Dependencies Can�t perform dependent computations at the same time The dependencies in a program create an upper limit on what can be done in parallel Communication Processing units may need to share data Synchronization Communication required to enforce the ordering imposed by dependencies Programming/Machine Languages Sequential languages hide parallelism

6. 4/29/2012 GMU, ECE 511, Microprocessors 6 Sequential Languages Hide Parallelism Algorithms often have parallel structure Example: Vector add For correctness, need to impose a partial order on computations Example: Finish one vector add before you do anything with the result Sequential programming/machine languages impose a total order on computation For (I = 0; I < N; I++){ C[I] = A[I] + B[I]; }

7. 4/29/2012 GMU, ECE 511, Microprocessors 7 Exploiting Parallelism What do we need? Some way to do multiple computations at the same time Some way to communicate data between the units that do the computation Some way to synchronize the different units to enforce ordering when necessary Some way to tell when operations can and cannot be done in parallel Programmer Compiler Hardware All of the parallel architectures we�ll see can be characterized by how they provide these items

8. 4/29/2012 GMU, ECE 511, Microprocessors 8 What is ILP? The characteristic of a program that certain instructions are independent, and can potentially be executed in parallel. Any mechanism that creates, identifies, or exploits the independence of instructions, allowing them to be executed in parallel.

9. 4/29/2012 GMU, ECE 511, Microprocessors 9 What is ILP? The characteristic of a program that certain instructions are independent, and can potentially be executed in parallel. Any mechanism that creates, identifies, or exploits the independence of instructions, allowing them to be executed in parallel. Why do we want/need ILP? In a superscalar architecture? What about a scalar architecture?

10. 4/29/2012 GMU, ECE 511, Microprocessors 10 Instruction-Level Parallelism Basic Idea: Take a sequential program and execute individual instructions in parallel Execution pipelines (multiple) are the resource that does work in parallel Register file is the communication mechanism between parallel instructions Hardware or compiler can be responsible for detecting work that can be done in parallel Instruction issue logic (scoreboard) generally the mechanism that enforces synchronization

11. 4/29/2012 GMU, ECE 511, Microprocessors 11 Instruction-Level Parallel Processor

12. 4/29/2012 GMU, ECE 511, Microprocessors 12 Where do we find ILP? In basic blocks? 15-20% of (dynamic) instructions are branches in typical code Across basic blocks? how?

13. 4/29/2012 GMU, ECE 511, Microprocessors 13 How do we expose ILP? by moving instructions around. How??

14. 4/29/2012 GMU, ECE 511, Microprocessors 14 How do we expose ILP? by moving instructions around. How?? software Hardware

15. 4/29/2012 GMU, ECE 511, Microprocessors 15 Exposing ILP in software instruction scheduling (changes ILP within a basic block) loop unrolling (allows ILP across iterations by putting instructions from multiple iterations in the same basic block) Others (trace scheduling, software pipelining)

16. 4/29/2012 GMU, ECE 511, Microprocessors 16 Key Points You can find, create, and exploit Instruction Level Parallelism in SW or HW Loop level parallelism is usually easiest to see Dependencies exist in a program, and become hazards if HW cannot resolve SW dependencies/compiler sophistication determine if compiler can/should unroll loops

17. 4/29/2012 GMU, ECE 511, Microprocessors 17 First HW ILP Technique:Out-of-order Issue/Dynamic Scheduling Problem -- need to get stalled instructions out of the ID stage, so that subsequent instructions can begin execution. Must separate detection of structural hazards from detection of data hazards Must split ID operation into two: Issue (decode, check for structural hazards) Read operands (read operands when NO DATA HAZARDS) i.e., must be able to issue even when a data hazard exists instructions issue in-order, but proceed to EX out-of-order

18. 4/29/2012 GMU, ECE 511, Microprocessors 18 HW Schemes: Instruction Parallelism Why in HW at run time? Works when can�t know real dependence at compile time Compiler simpler Code for one machine runs well on another Key idea: Allow instructions behind stall to proceed DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F8,F14 Enables out-of-order execution => out-of-order completion

19. 4/29/2012 GMU, ECE 511, Microprocessors 19 HW Schemes: Instruction Parallelism Out-of-order execution divides ID stage: 1. Issue�decode instructions, check for structural hazards 2. Read operands�wait until no data hazards, then read operands Scoreboards allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions CDC 6600: In order issue, out of order execution, out of order commit ( also called completion)

20. 4/29/2012 GMU, ECE 511, Microprocessors 20 Another Dynamic Algorithm: Tomasulo Algorithm For IBM 360/91 about 3 years after CDC 6600 (1966) Goal: High Performance without special compilers Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, �

21. 4/29/2012 GMU, ECE 511, Microprocessors 21 Tomasulo Algorithm Control & buffers distributed with Function Units (FU) FU buffers called �reservation stations�; have pending operands Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming ; avoids WAR, WAW hazards More reservation stations than registers, so can do optimizations compilers can�t Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs Load and Stores treated as FUs with RSs as well Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue

22. 4/29/2012 GMU, ECE 511, Microprocessors 22 Tomasulo Organization Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallelResolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel

23. 4/29/2012 GMU, ECE 511, Microprocessors 23 Reservation Station Components Op�Operation to perform in the unit (e.g., + or �) Vj, Vk�Value of Source operands Store buffers have V field, result to be stored Qj, Qk�Reservation stations producing source registers (value to be written) Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready Store buffers only have Qi for RS producing result Busy�Indicates reservation station or FU is busy Register result status�Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register. What you might have thought 1. 4 stages of instruction executino 2.Status of FU: Normal things to keep track of (RAW & structura for busyl): Fi from instruction format of the mahine (Fi is dest) Add unit can Add or Sub Rj, Rk - status of registers (Yes means ready) Qj,Qk - If a no in Rj, Rk, means waiting for a FU to write result; Qj, Qk means wihch FU waiting for it 3.Status of register result (WAW &WAR)s: which FU is going to write into registers Scoreboard on 6600 = size of FU 6.7, 6.8, 6.9, 6.12, 6.13, 6.16, 6.17 FU latencies: Add 2, Mult 10, Div 40 clocksWhat you might have thought 1. 4 stages of instruction executino 2.Status of FU: Normal things to keep track of (RAW & structura for busyl): Fi from instruction format of the mahine (Fi is dest) Add unit can Add or Sub Rj, Rk - status of registers (Yes means ready) Qj,Qk - If a no in Rj, Rk, means waiting for a FU to write result; Qj, Qk means wihch FU waiting for it 3.Status of register result (WAW &WAR)s: which FU is going to write into registers Scoreboard on 6600 = size of FU 6.7, 6.8, 6.9, 6.12, 6.13, 6.16, 6.17 FU latencies: Add 2, Mult 10, Div 40 clocks

24. 4/29/2012 GMU, ECE 511, Microprocessors 24 Three Stages of Tomasulo Algorithm 1. Issue�get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instruction & sends operands (renames registers). 2. Execution�operate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result 3. Write result�finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available Normal data bus: data + destination (�go to� bus) Common data bus: data + source (�come from� bus) 64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does the broadcast

25. 4/29/2012 GMU, ECE 511, Microprocessors 25 Tomasulo Example Cycle 0




















45. 4/29/2012 GMU, ECE 511, Microprocessors 45 Tomasulo Drawbacks Complexity delays of 360/91, MIPS 10000, IBM 620? Many associative stores (CDB) at high speed Performance limited by Common Data Bus Multiple CDBs => more FU logic for parallel assoc stores

46. 4/29/2012 GMU, ECE 511, Microprocessors 46 Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1 SUBI R1 R1 #8 BNEZ R1 Loop Assume Multiply takes 4 clocks Assume first load takes 8 clocks (cache miss?), second load takes 4 clocks (hit) To be clear, will show clocks for SUBI, BNEZ Reality, integer instructions ahead

47. 4/29/2012 GMU, ECE 511, Microprocessors 47 Loop Example Cycle 0






















69. 4/29/2012 GMU, ECE 511, Microprocessors 69 Tomasulo Summary Reservations stations: renaming to larger set of registers + buffering source operands Prevents registers as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows loop unrolling in HW Not limited to basic blocks (integer units gets ahead, beyond branches) Helps cache misses as well Lasting Contributions Dynamic scheduling Register renaming Load/store disambiguation 360/91 descendants are Pentium II; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264

70. Branch Prediction Next

ILP: Beyond Pipelining

ILP: Beyond Pipelining

Presentation Transcript

Pipelining Datapath

Enhancing Performance with Pipelining

9.2 Pipelining

Pipelining

Chapter 8. Pipelining

Introduction to Pipelining

Appendix C

CSCE 230, Fall 2013 Chapter 6: Pipelining

b 0000 Pipelining

Pipelining

Computer Architecture

PIPELINING 2 nd week

Review: Pipelining

第七章

Lec 8: Pipelining

Complex Pipelining

Pipelining

Pipelining: Basic and Intermediate Concepts

Pipelining a CPU

Pipelining and Retiming

Appendix C: Pipelining