1 / 69

ILP: Beyond Pipelining

4/29/2012. GMU, ECE 511, Microprocessors. 2. Objectives . Understand why we care about parallelismUnderstand Instruction Level Parallelism (ILP)Terms you should know:Scalar (issue and execution)Single In-Order Issue ?In-Order executionSingle Out-Of-Order Issue ?Out-Of-Order executionSuperScala

alpha
Download Presentation

ILP: Beyond Pipelining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. ILP: Beyond Pipelining Subbaiah Venkata

    2. 4/29/2012 GMU, ECE 511, Microprocessors 2 Objectives Understand why we care about parallelism Understand Instruction Level Parallelism (ILP) Terms you should know: Scalar (issue and execution) Single In-Order Issue ?In-Order execution Single Out-Of-Order Issue ?Out-Of-Order execution SuperScalar (issue and execution) Single In-Order issue ?In-Order / Out-Of-Order execution Multiple In-Order issue ?In-Order / Out-Of-Order execution Single Out-Of-Order Issue ?Out-Of-Order execution Multiple Out-Of-Order issue ?Out-Of-Order execution Out-Of-Order Issue or Dynamic Scheduling (Hardware based) Register renaming Scoreboarding Tomasulo Algorithm . . . Static Scheduling (Software based) Code Movement Loop Unrolling VLIW processors . . .

    3. 4/29/2012 GMU, ECE 511, Microprocessors 3 Ideas To Reduce Stalls

    4. 4/29/2012 GMU, ECE 511, Microprocessors 4 Parallelism Basic concept: technology limits how fast we can execute a single instruction/operation Do many things at the same time to make program execute faster Can also do parallelism for throughput: increasing the number of tasks that complete in a given amount of time

    5. 4/29/2012 GMU, ECE 511, Microprocessors 5 Why is Parallelism Hard? Dependencies Can’t perform dependent computations at the same time The dependencies in a program create an upper limit on what can be done in parallel Communication Processing units may need to share data Synchronization Communication required to enforce the ordering imposed by dependencies Programming/Machine Languages Sequential languages hide parallelism

    6. 4/29/2012 GMU, ECE 511, Microprocessors 6 Sequential Languages Hide Parallelism Algorithms often have parallel structure Example: Vector add For correctness, need to impose a partial order on computations Example: Finish one vector add before you do anything with the result Sequential programming/machine languages impose a total order on computation For (I = 0; I < N; I++){ C[I] = A[I] + B[I]; }

    7. 4/29/2012 GMU, ECE 511, Microprocessors 7 Exploiting Parallelism What do we need? Some way to do multiple computations at the same time Some way to communicate data between the units that do the computation Some way to synchronize the different units to enforce ordering when necessary Some way to tell when operations can and cannot be done in parallel Programmer Compiler Hardware All of the parallel architectures we’ll see can be characterized by how they provide these items

    8. 4/29/2012 GMU, ECE 511, Microprocessors 8 What is ILP? The characteristic of a program that certain instructions are independent, and can potentially be executed in parallel. Any mechanism that creates, identifies, or exploits the independence of instructions, allowing them to be executed in parallel.

    9. 4/29/2012 GMU, ECE 511, Microprocessors 9 What is ILP? The characteristic of a program that certain instructions are independent, and can potentially be executed in parallel. Any mechanism that creates, identifies, or exploits the independence of instructions, allowing them to be executed in parallel. Why do we want/need ILP? In a superscalar architecture? What about a scalar architecture?

    10. 4/29/2012 GMU, ECE 511, Microprocessors 10 Instruction-Level Parallelism Basic Idea: Take a sequential program and execute individual instructions in parallel Execution pipelines (multiple) are the resource that does work in parallel Register file is the communication mechanism between parallel instructions Hardware or compiler can be responsible for detecting work that can be done in parallel Instruction issue logic (scoreboard) generally the mechanism that enforces synchronization

    11. 4/29/2012 GMU, ECE 511, Microprocessors 11 Instruction-Level Parallel Processor

    12. 4/29/2012 GMU, ECE 511, Microprocessors 12 Where do we find ILP? In basic blocks? 15-20% of (dynamic) instructions are branches in typical code Across basic blocks? how?

    13. 4/29/2012 GMU, ECE 511, Microprocessors 13 How do we expose ILP? by moving instructions around. How??

    14. 4/29/2012 GMU, ECE 511, Microprocessors 14 How do we expose ILP? by moving instructions around. How?? software Hardware

    15. 4/29/2012 GMU, ECE 511, Microprocessors 15 Exposing ILP in software instruction scheduling (changes ILP within a basic block) loop unrolling (allows ILP across iterations by putting instructions from multiple iterations in the same basic block) Others (trace scheduling, software pipelining)

    16. 4/29/2012 GMU, ECE 511, Microprocessors 16 Key Points You can find, create, and exploit Instruction Level Parallelism in SW or HW Loop level parallelism is usually easiest to see Dependencies exist in a program, and become hazards if HW cannot resolve SW dependencies/compiler sophistication determine if compiler can/should unroll loops

    17. 4/29/2012 GMU, ECE 511, Microprocessors 17 First HW ILP Technique: Out-of-order Issue/Dynamic Scheduling Problem -- need to get stalled instructions out of the ID stage, so that subsequent instructions can begin execution. Must separate detection of structural hazards from detection of data hazards Must split ID operation into two: Issue (decode, check for structural hazards) Read operands (read operands when NO DATA HAZARDS) i.e., must be able to issue even when a data hazard exists instructions issue in-order, but proceed to EX out-of-order

    18. 4/29/2012 GMU, ECE 511, Microprocessors 18 HW Schemes: Instruction Parallelism Why in HW at run time? Works when can’t know real dependence at compile time Compiler simpler Code for one machine runs well on another Key idea: Allow instructions behind stall to proceed DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F8,F14 Enables out-of-order execution => out-of-order completion

    19. 4/29/2012 GMU, ECE 511, Microprocessors 19 HW Schemes: Instruction Parallelism Out-of-order execution divides ID stage: 1. Issue—decode instructions, check for structural hazards 2. Read operands—wait until no data hazards, then read operands Scoreboards allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions CDC 6600: In order issue, out of order execution, out of order commit ( also called completion)

    20. 4/29/2012 GMU, ECE 511, Microprocessors 20 Another Dynamic Algorithm: Tomasulo Algorithm For IBM 360/91 about 3 years after CDC 6600 (1966) Goal: High Performance without special compilers Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, …

    21. 4/29/2012 GMU, ECE 511, Microprocessors 21 Tomasulo Algorithm Control & buffers distributed with Function Units (FU) FU buffers called “reservation stations”; have pending operands Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming ; avoids WAR, WAW hazards More reservation stations than registers, so can do optimizations compilers can’t Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs Load and Stores treated as FUs with RSs as well Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue

    22. 4/29/2012 GMU, ECE 511, Microprocessors 22 Tomasulo Organization Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallelResolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel

    23. 4/29/2012 GMU, ECE 511, Microprocessors 23 Reservation Station Components Op—Operation to perform in the unit (e.g., + or –) Vj, Vk—Value of Source operands Store buffers have V field, result to be stored Qj, Qk—Reservation stations producing source registers (value to be written) Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready Store buffers only have Qi for RS producing result Busy—Indicates reservation station or FU is busy Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register. What you might have thought 1. 4 stages of instruction executino 2.Status of FU: Normal things to keep track of (RAW & structura for busyl): Fi from instruction format of the mahine (Fi is dest) Add unit can Add or Sub Rj, Rk - status of registers (Yes means ready) Qj,Qk - If a no in Rj, Rk, means waiting for a FU to write result; Qj, Qk means wihch FU waiting for it 3.Status of register result (WAW &WAR)s: which FU is going to write into registers Scoreboard on 6600 = size of FU 6.7, 6.8, 6.9, 6.12, 6.13, 6.16, 6.17 FU latencies: Add 2, Mult 10, Div 40 clocksWhat you might have thought 1. 4 stages of instruction executino 2.Status of FU: Normal things to keep track of (RAW & structura for busyl): Fi from instruction format of the mahine (Fi is dest) Add unit can Add or Sub Rj, Rk - status of registers (Yes means ready) Qj,Qk - If a no in Rj, Rk, means waiting for a FU to write result; Qj, Qk means wihch FU waiting for it 3.Status of register result (WAW &WAR)s: which FU is going to write into registers Scoreboard on 6600 = size of FU 6.7, 6.8, 6.9, 6.12, 6.13, 6.16, 6.17 FU latencies: Add 2, Mult 10, Div 40 clocks

    24. 4/29/2012 GMU, ECE 511, Microprocessors 24 Three Stages of Tomasulo Algorithm 1. Issue—get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instruction & sends operands (renames registers). 2. Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available Normal data bus: data + destination (“go to” bus) Common data bus: data + source (“come from” bus) 64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does the broadcast

    25. 4/29/2012 GMU, ECE 511, Microprocessors 25 Tomasulo Example Cycle 0

    26. 4/29/2012 GMU, ECE 511, Microprocessors 26 Tomasulo Example Cycle 1

    27. 4/29/2012 GMU, ECE 511, Microprocessors 27 Tomasulo Example Cycle 2

    28. 4/29/2012 GMU, ECE 511, Microprocessors 28 Tomasulo Example Cycle 3

    29. 4/29/2012 GMU, ECE 511, Microprocessors 29 Tomasulo Example Cycle 4

    30. 4/29/2012 GMU, ECE 511, Microprocessors 30 Tomasulo Example Cycle 5

    31. 4/29/2012 GMU, ECE 511, Microprocessors 31 Tomasulo Example Cycle 6

    32. 4/29/2012 GMU, ECE 511, Microprocessors 32 Tomasulo Example Cycle 7

    33. 4/29/2012 GMU, ECE 511, Microprocessors 33 Tomasulo Example Cycle 8

    34. 4/29/2012 GMU, ECE 511, Microprocessors 34 Tomasulo Example Cycle 9

    35. 4/29/2012 GMU, ECE 511, Microprocessors 35 Tomasulo Example Cycle 10

    36. 4/29/2012 GMU, ECE 511, Microprocessors 36 Tomasulo Example Cycle 11

    37. 4/29/2012 GMU, ECE 511, Microprocessors 37 Tomasulo Example Cycle 12

    38. 4/29/2012 GMU, ECE 511, Microprocessors 38 Tomasulo Example Cycle 13

    39. 4/29/2012 GMU, ECE 511, Microprocessors 39 Tomasulo Example Cycle 14

    40. 4/29/2012 GMU, ECE 511, Microprocessors 40 Tomasulo Example Cycle 15

    41. 4/29/2012 GMU, ECE 511, Microprocessors 41 Tomasulo Example Cycle 16

    42. 4/29/2012 GMU, ECE 511, Microprocessors 42 Tomasulo Example Cycle 55

    43. 4/29/2012 GMU, ECE 511, Microprocessors 43 Tomasulo Example Cycle 56

    44. 4/29/2012 GMU, ECE 511, Microprocessors 44 Tomasulo Example Cycle 57

    45. 4/29/2012 GMU, ECE 511, Microprocessors 45 Tomasulo Drawbacks Complexity delays of 360/91, MIPS 10000, IBM 620? Many associative stores (CDB) at high speed Performance limited by Common Data Bus Multiple CDBs => more FU logic for parallel assoc stores

    46. 4/29/2012 GMU, ECE 511, Microprocessors 46 Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1 SUBI R1 R1 #8 BNEZ R1 Loop Assume Multiply takes 4 clocks Assume first load takes 8 clocks (cache miss?), second load takes 4 clocks (hit) To be clear, will show clocks for SUBI, BNEZ Reality, integer instructions ahead

    47. 4/29/2012 GMU, ECE 511, Microprocessors 47 Loop Example Cycle 0

    48. 4/29/2012 GMU, ECE 511, Microprocessors 48 Loop Example Cycle 1

    49. 4/29/2012 GMU, ECE 511, Microprocessors 49 Loop Example Cycle 2

    50. 4/29/2012 GMU, ECE 511, Microprocessors 50 Loop Example Cycle 3

    51. 4/29/2012 GMU, ECE 511, Microprocessors 51 Loop Example Cycle 4

    52. 4/29/2012 GMU, ECE 511, Microprocessors 52 Loop Example Cycle 5

    53. 4/29/2012 GMU, ECE 511, Microprocessors 53 Loop Example Cycle 6

    54. 4/29/2012 GMU, ECE 511, Microprocessors 54 Loop Example Cycle 7

    55. 4/29/2012 GMU, ECE 511, Microprocessors 55 Loop Example Cycle 8

    56. 4/29/2012 GMU, ECE 511, Microprocessors 56 Loop Example Cycle 9

    57. 4/29/2012 GMU, ECE 511, Microprocessors 57 Loop Example Cycle 10

    58. 4/29/2012 GMU, ECE 511, Microprocessors 58 Loop Example Cycle 11

    59. 4/29/2012 GMU, ECE 511, Microprocessors 59 Loop Example Cycle 12

    60. 4/29/2012 GMU, ECE 511, Microprocessors 60 Loop Example Cycle 13

    61. 4/29/2012 GMU, ECE 511, Microprocessors 61 Loop Example Cycle 14

    62. 4/29/2012 GMU, ECE 511, Microprocessors 62 Loop Example Cycle 15

    63. 4/29/2012 GMU, ECE 511, Microprocessors 63 Loop Example Cycle 16

    64. 4/29/2012 GMU, ECE 511, Microprocessors 64 Loop Example Cycle 17

    65. 4/29/2012 GMU, ECE 511, Microprocessors 65 Loop Example Cycle 18

    66. 4/29/2012 GMU, ECE 511, Microprocessors 66 Loop Example Cycle 19

    67. 4/29/2012 GMU, ECE 511, Microprocessors 67 Loop Example Cycle 20

    68. 4/29/2012 GMU, ECE 511, Microprocessors 68 Loop Example Cycle 21

    69. 4/29/2012 GMU, ECE 511, Microprocessors 69 Tomasulo Summary Reservations stations: renaming to larger set of registers + buffering source operands Prevents registers as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows loop unrolling in HW Not limited to basic blocks (integer units gets ahead, beyond branches) Helps cache misses as well Lasting Contributions Dynamic scheduling Register renaming Load/store disambiguation 360/91 descendants are Pentium II; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264

    70. Branch Prediction Next

More Related