690 likes | 908 Views
4/29/2012. GMU, ECE 511, Microprocessors. 2. Objectives . Understand why we care about parallelismUnderstand Instruction Level Parallelism (ILP)Terms you should know:Scalar (issue and execution)Single In-Order Issue ?In-Order executionSingle Out-Of-Order Issue ?Out-Of-Order executionSuperScala
E N D
1. ILP: Beyond Pipelining Subbaiah Venkata
2. 4/29/2012 GMU, ECE 511, Microprocessors 2 Objectives Understand why we care about parallelism
Understand Instruction Level Parallelism (ILP)
Terms you should know:
Scalar (issue and execution)
Single In-Order Issue ?In-Order execution
Single Out-Of-Order Issue ?Out-Of-Order execution
SuperScalar (issue and execution)
Single In-Order issue ?In-Order / Out-Of-Order execution
Multiple In-Order issue ?In-Order / Out-Of-Order execution
Single Out-Of-Order Issue ?Out-Of-Order execution
Multiple Out-Of-Order issue ?Out-Of-Order execution
Out-Of-Order Issue or Dynamic Scheduling (Hardware based)
Register renaming
Scoreboarding
Tomasulo Algorithm
. . .
Static Scheduling (Software based)
Code Movement
Loop Unrolling
VLIW processors
. . .
3. 4/29/2012 GMU, ECE 511, Microprocessors 3 Ideas To Reduce Stalls
4. 4/29/2012 GMU, ECE 511, Microprocessors 4 Parallelism Basic concept: technology limits how fast we can execute a single instruction/operation
Do many things at the same time to make program execute faster
Can also do parallelism for throughput: increasing the number of tasks that complete in a given amount of time
5. 4/29/2012 GMU, ECE 511, Microprocessors 5 Why is Parallelism Hard? Dependencies
Can’t perform dependent computations at the same time
The dependencies in a program create an upper limit on what can be done in parallel
Communication
Processing units may need to share data
Synchronization
Communication required to enforce the ordering imposed by dependencies
Programming/Machine Languages
Sequential languages hide parallelism
6. 4/29/2012 GMU, ECE 511, Microprocessors 6 Sequential Languages Hide Parallelism Algorithms often have parallel structure
Example: Vector add
For correctness, need to impose a partial order on computations
Example: Finish one vector add before you do anything with the result
Sequential programming/machine languages impose a total order on computation
For (I = 0; I < N; I++){
C[I] = A[I] + B[I];
}
7. 4/29/2012 GMU, ECE 511, Microprocessors 7 Exploiting Parallelism What do we need?
Some way to do multiple computations at the same time
Some way to communicate data between the units that do the computation
Some way to synchronize the different units to enforce ordering when necessary
Some way to tell when operations can and cannot be done in parallel
Programmer
Compiler
Hardware
All of the parallel architectures we’ll see can be characterized by how they provide these items
8. 4/29/2012 GMU, ECE 511, Microprocessors 8 What is ILP? The characteristic of a program that certain instructions are independent, and can potentially be executed in parallel.
Any mechanism that creates, identifies, or exploits the independence of instructions, allowing them to be executed in parallel.
9. 4/29/2012 GMU, ECE 511, Microprocessors 9 What is ILP? The characteristic of a program that certain instructions are independent, and can potentially be executed in parallel.
Any mechanism that creates, identifies, or exploits the independence of instructions, allowing them to be executed in parallel.
Why do we want/need ILP?
In a superscalar architecture?
What about a scalar architecture?
10. 4/29/2012 GMU, ECE 511, Microprocessors 10 Instruction-Level Parallelism Basic Idea: Take a sequential program and execute individual instructions in parallel
Execution pipelines (multiple) are the resource that does work in parallel
Register file is the communication mechanism between parallel instructions
Hardware or compiler can be responsible for detecting work that can be done in parallel
Instruction issue logic (scoreboard) generally the mechanism that enforces synchronization
11. 4/29/2012 GMU, ECE 511, Microprocessors 11 Instruction-Level Parallel Processor
12. 4/29/2012 GMU, ECE 511, Microprocessors 12 Where do we find ILP? In basic blocks?
15-20% of (dynamic) instructions are branches in typical code
Across basic blocks?
how?
13. 4/29/2012 GMU, ECE 511, Microprocessors 13 How do we expose ILP? by moving instructions around.
How??
14. 4/29/2012 GMU, ECE 511, Microprocessors 14 How do we expose ILP? by moving instructions around.
How??
software
Hardware
15. 4/29/2012 GMU, ECE 511, Microprocessors 15 Exposing ILP in software instruction scheduling (changes ILP within a basic block)
loop unrolling (allows ILP across iterations by putting instructions from multiple iterations in the same basic block)
Others (trace scheduling, software pipelining)
16. 4/29/2012 GMU, ECE 511, Microprocessors 16 Key Points You can find, create, and exploit Instruction Level Parallelism in SW or HW
Loop level parallelism is usually easiest to see
Dependencies exist in a program, and become hazards if HW cannot resolve
SW dependencies/compiler sophistication determine if compiler can/should unroll loops
17. 4/29/2012 GMU, ECE 511, Microprocessors 17 First HW ILP Technique:Out-of-order Issue/Dynamic Scheduling Problem -- need to get stalled instructions out of the ID stage, so that subsequent instructions can begin execution.
Must separate detection of structural hazards from detection of data hazards
Must split ID operation into two:
Issue (decode, check for structural hazards)
Read operands (read operands when NO DATA HAZARDS)
i.e., must be able to issue even when a data hazard exists
instructions issue in-order, but proceed to EX out-of-order
18. 4/29/2012 GMU, ECE 511, Microprocessors 18 HW Schemes: Instruction Parallelism Why in HW at run time?
Works when can’t know real dependence at compile time
Compiler simpler
Code for one machine runs well on another
Key idea: Allow instructions behind stall to proceed
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F12,F8,F14
Enables out-of-order execution => out-of-order completion
19. 4/29/2012 GMU, ECE 511, Microprocessors 19 HW Schemes: Instruction Parallelism Out-of-order execution divides ID stage:
1. Issue—decode instructions, check for structural hazards
2. Read operands—wait until no data hazards, then read operands
Scoreboards allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions
CDC 6600: In order issue, out of order execution, out of order commit ( also called completion)
20. 4/29/2012 GMU, ECE 511, Microprocessors 20 Another Dynamic Algorithm: Tomasulo Algorithm For IBM 360/91 about 3 years after CDC 6600 (1966)
Goal: High Performance without special compilers
Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, …
21. 4/29/2012 GMU, ECE 511, Microprocessors 21 Tomasulo Algorithm Control & buffers distributed with Function Units (FU)
FU buffers called “reservation stations”; have pending operands
Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming ;
avoids WAR, WAW hazards
More reservation stations than registers, so can do optimizations compilers can’t
Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs
Load and Stores treated as FUs with RSs as well
Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue
22. 4/29/2012 GMU, ECE 511, Microprocessors 22 Tomasulo Organization Resolve RAW memory conflict? (address in memory buffers)
Integer unit executes in parallelResolve RAW memory conflict? (address in memory buffers)
Integer unit executes in parallel
23. 4/29/2012 GMU, ECE 511, Microprocessors 23 Reservation Station Components Op—Operation to perform in the unit (e.g., + or –)
Vj, Vk—Value of Source operands
Store buffers have V field, result to be stored
Qj, Qk—Reservation stations producing source registers (value to be written)
Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready
Store buffers only have Qi for RS producing result
Busy—Indicates reservation station or FU is busy
Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register. What you might have thought
1. 4 stages of instruction executino
2.Status of FU: Normal things to keep track of (RAW & structura for busyl):
Fi from instruction format of the mahine (Fi is dest)
Add unit can Add or Sub
Rj, Rk - status of registers (Yes means ready)
Qj,Qk - If a no in Rj, Rk, means waiting for a FU to write result; Qj, Qk means wihch FU waiting for it
3.Status of register result (WAW &WAR)s:
which FU is going to write into registers
Scoreboard on 6600 = size of FU
6.7, 6.8, 6.9, 6.12, 6.13, 6.16, 6.17
FU latencies: Add 2, Mult 10, Div 40 clocksWhat you might have thought
1. 4 stages of instruction executino
2.Status of FU: Normal things to keep track of (RAW & structura for busyl):
Fi from instruction format of the mahine (Fi is dest)
Add unit can Add or Sub
Rj, Rk - status of registers (Yes means ready)
Qj,Qk - If a no in Rj, Rk, means waiting for a FU to write result; Qj, Qk means wihch FU waiting for it
3.Status of register result (WAW &WAR)s:
which FU is going to write into registers
Scoreboard on 6600 = size of FU
6.7, 6.8, 6.9, 6.12, 6.13, 6.16, 6.17
FU latencies: Add 2, Mult 10, Div 40 clocks
24. 4/29/2012 GMU, ECE 511, Microprocessors 24 Three Stages of Tomasulo Algorithm 1. Issue—get instruction from FP Op Queue
If reservation station free (no structural hazard), control issues instruction & sends operands (renames registers).
2. Execution—operate on operands (EX)
When both operands ready then execute; if not ready, watch Common Data Bus for result
3. Write result—finish execution (WB)
Write on Common Data Bus to all awaiting units; mark reservation station available
Normal data bus: data + destination (“go to” bus)
Common data bus: data + source (“come from” bus)
64 bits of data + 4 bits of Functional Unit source address
Write if matches expected Functional Unit (produces result)
Does the broadcast
25. 4/29/2012 GMU, ECE 511, Microprocessors 25 Tomasulo Example Cycle 0
26. 4/29/2012 GMU, ECE 511, Microprocessors 26 Tomasulo Example Cycle 1
27. 4/29/2012 GMU, ECE 511, Microprocessors 27 Tomasulo Example Cycle 2
28. 4/29/2012 GMU, ECE 511, Microprocessors 28 Tomasulo Example Cycle 3
29. 4/29/2012 GMU, ECE 511, Microprocessors 29 Tomasulo Example Cycle 4
30. 4/29/2012 GMU, ECE 511, Microprocessors 30 Tomasulo Example Cycle 5
31. 4/29/2012 GMU, ECE 511, Microprocessors 31 Tomasulo Example Cycle 6
32. 4/29/2012 GMU, ECE 511, Microprocessors 32 Tomasulo Example Cycle 7
33. 4/29/2012 GMU, ECE 511, Microprocessors 33 Tomasulo Example Cycle 8
34. 4/29/2012 GMU, ECE 511, Microprocessors 34 Tomasulo Example Cycle 9
35. 4/29/2012 GMU, ECE 511, Microprocessors 35 Tomasulo Example Cycle 10
36. 4/29/2012 GMU, ECE 511, Microprocessors 36 Tomasulo Example Cycle 11
37. 4/29/2012 GMU, ECE 511, Microprocessors 37 Tomasulo Example Cycle 12
38. 4/29/2012 GMU, ECE 511, Microprocessors 38 Tomasulo Example Cycle 13
39. 4/29/2012 GMU, ECE 511, Microprocessors 39 Tomasulo Example Cycle 14
40. 4/29/2012 GMU, ECE 511, Microprocessors 40 Tomasulo Example Cycle 15
41. 4/29/2012 GMU, ECE 511, Microprocessors 41 Tomasulo Example Cycle 16
42. 4/29/2012 GMU, ECE 511, Microprocessors 42 Tomasulo Example Cycle 55
43. 4/29/2012 GMU, ECE 511, Microprocessors 43 Tomasulo Example Cycle 56
44. 4/29/2012 GMU, ECE 511, Microprocessors 44 Tomasulo Example Cycle 57
45. 4/29/2012 GMU, ECE 511, Microprocessors 45 Tomasulo Drawbacks Complexity
delays of 360/91, MIPS 10000, IBM 620?
Many associative stores (CDB) at high speed
Performance limited by Common Data Bus
Multiple CDBs => more FU logic for parallel assoc stores
46. 4/29/2012 GMU, ECE 511, Microprocessors 46 Tomasulo Loop Example Loop: LD F0 0 R1
MULTD F4 F0 F2
SD F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
Assume Multiply takes 4 clocks
Assume first load takes 8 clocks (cache miss?), second load takes 4 clocks (hit)
To be clear, will show clocks for SUBI, BNEZ
Reality, integer instructions ahead
47. 4/29/2012 GMU, ECE 511, Microprocessors 47 Loop Example Cycle 0
48. 4/29/2012 GMU, ECE 511, Microprocessors 48 Loop Example Cycle 1
49. 4/29/2012 GMU, ECE 511, Microprocessors 49 Loop Example Cycle 2
50. 4/29/2012 GMU, ECE 511, Microprocessors 50 Loop Example Cycle 3
51. 4/29/2012 GMU, ECE 511, Microprocessors 51 Loop Example Cycle 4
52. 4/29/2012 GMU, ECE 511, Microprocessors 52 Loop Example Cycle 5
53. 4/29/2012 GMU, ECE 511, Microprocessors 53 Loop Example Cycle 6
54. 4/29/2012 GMU, ECE 511, Microprocessors 54 Loop Example Cycle 7
55. 4/29/2012 GMU, ECE 511, Microprocessors 55 Loop Example Cycle 8
56. 4/29/2012 GMU, ECE 511, Microprocessors 56 Loop Example Cycle 9
57. 4/29/2012 GMU, ECE 511, Microprocessors 57 Loop Example Cycle 10
58. 4/29/2012 GMU, ECE 511, Microprocessors 58 Loop Example Cycle 11
59. 4/29/2012 GMU, ECE 511, Microprocessors 59 Loop Example Cycle 12
60. 4/29/2012 GMU, ECE 511, Microprocessors 60 Loop Example Cycle 13
61. 4/29/2012 GMU, ECE 511, Microprocessors 61 Loop Example Cycle 14
62. 4/29/2012 GMU, ECE 511, Microprocessors 62 Loop Example Cycle 15
63. 4/29/2012 GMU, ECE 511, Microprocessors 63 Loop Example Cycle 16
64. 4/29/2012 GMU, ECE 511, Microprocessors 64 Loop Example Cycle 17
65. 4/29/2012 GMU, ECE 511, Microprocessors 65 Loop Example Cycle 18
66. 4/29/2012 GMU, ECE 511, Microprocessors 66 Loop Example Cycle 19
67. 4/29/2012 GMU, ECE 511, Microprocessors 67 Loop Example Cycle 20
68. 4/29/2012 GMU, ECE 511, Microprocessors 68 Loop Example Cycle 21
69. 4/29/2012 GMU, ECE 511, Microprocessors 69 Tomasulo Summary Reservations stations: renaming to larger set of registers + buffering source operands
Prevents registers as bottleneck
Avoids WAR, WAW hazards of Scoreboard
Allows loop unrolling in HW
Not limited to basic blocks (integer units gets ahead, beyond branches)
Helps cache misses as well
Lasting Contributions
Dynamic scheduling
Register renaming
Load/store disambiguation
360/91 descendants are Pentium II; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264
70. Branch Prediction Next