250 likes | 675 Views
Dynamic Scheduling to Minimize Stalls Tomasulo Algorithm To: Dr. TenEyck Submitted by Yanyu Liu Teammate: Monaco. Pipelining vs... Out-of-order Execution. Pipelining
E N D
Dynamic Scheduling to Minimize Stalls Tomasulo Algorithm To: Dr. TenEyck Submitted by Yanyu Liu Teammate: Monaco
Pipelining vs... Out-of-order Execution • Pipelining • Pipelining implies in-order execution, the execution of the subsequent instructions is done in program order.For example,consider the following code sequence: • I1: DIVD R1, R2 ,R3 • I2: MULT R4 R1,R1 • I3: ADDD R1, R8,R9 • instruction I1 blocks the execute stage, since the division function unit has a long latency. Instruction I2has to be stalled upon the begin of its execution, since the execution stage is blocked by I1and since it requires the result of I1 (data dependence).
Out-of-Order Execution Data dependencies and different latencies of the function units can cause additional delays which reduce performance. In order to eliminate these delays, we use out-of-order execution. We depicts the execution of I1 to I3 on an out-of-order CPU. Instruction I3is now able to enter the execution stage even before I1 does, since I3does not depend on any result of the preceding instructions. It even terminates before I1, which causes a (WAW) data hazard in R1 .Furthermore, I2 tries to read R1 before I2 writes it. Thus, there is (RAW) data hazard. Since I3writes R1 before I2 reads it, there is also a (WAR) hazard.
Static Scheduling vs. Dynamic Scheduling • Compiler-base static scheduling can separate the dependent instructions minimizing actual hazards and stalls in scheduled code. Dynamic Scheduling use a hardware-based mechanism to rearrange instruction execution order to reduce stalls at runtime. It has two advantages: • 1.Enable handling some cases where dependencies are unknown at compile time. • 2.can not remove true data dependencies,but tries to avoid stalling. • There are two dynamic scheduling methods. One is Tomasulo algorithm,the other is Scoreboard. Here, we just discuss Tomasulo algorithm.
Tomasulo's Algorithm • This scheme was invented by RobertTomasulo, and was first • used in the IBM 360/91. it uses register renaming to eliminate • output and anti-dependencies, i.e. WAW and WAR hazards. • Output and anti-dependencies are just name dependencies, there • is no actual data dependence. Tomasulo's algorithm implements • register renaming through the use of what are called reservation • stations. Reservation stations are buffers which fetch and store • instruction operands as soon as they're available.
Reservation stations • Each reservation station holds exactly one instruction and its operands and has the following components: • Op Operation to perform in the unit (e.g., + or –) • Vj, VkValue of Source operands • Store buffers have a single V field indicating result to be stored. • Qj, Qk Reservation stations producing source registers. (value • to be written). • Busy: Indicates reservation station or FU is busy. • Register result status: Indicates which functional unit will write • each register, if one exists. • The load and store buffers each require a busy field.
Three Steps in Tomasolu Algorithms • 1.Issue: Get instruction from pending Instruction Queue. • Instruction issued to a free reservation station (no structural hazard). • Selected RS is marked busy. • Control sends available instruction operands to assigned RS. (renaming registers). • 2.Execution (EX): Operate on operands. • When both operands are ready then start executing on assigned FU. • If all operands are not ready, watch Common Data Bus (CDB) for needed result. • 3.Write result (WB): Finish execution. • Write result on Common Data Bus to all awaiting units • Mark reservation station as available. • Uses Common Data Bus (CDB) for forwarding.
Example of Tomasulo Algorithm Using the following code to consider Tomasulo approach.The code is run on the DLX. # of RSs EX Latency Integer 1 0 cycle Floating Point Multiply/divide 2 10/40 cycles Floating Point add 3 2 cycles LD F6, 34(R2) LD F2, 45(R3) MULTD F0, F2, F4 SUBD F8, F6, F2 DIVD F10, F0, F6 ADDD F6, F8, F2
Tomasulo Example Cycle 1 Instruction status Execution Write Instruction j k Issue complete Result Busy Address Yes LD F6 34+ R2 1 Load1 No 34+R2 LD F2 45+ R3 Load2 No MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 0 Add2 No Add3 No 0 Mult1 No 0 Mult2 No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 1 FU Load1
Cycle 2 Instruction status Execution Write Instruction j k Issue complete Result Busy Address LD F6 34+ R2 1 Load1 Yes 34+R2 LD F2 45+ R3 2 Load2 Yes 45+R3 MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 0 Add2 No Add3 No 0 Mult1 No 0 Mult2 No Register result status F0 F2 F4 F6 F8 F10 F12 ... F30 Clock 2 FU Load2 Load1
Cycle 3 Instruction status Execution Write Instruction j k Issue complete Result Busy Address LD F6 34+ R2 1 3 Load1 Yes 34+R2 LD F2 45+ R3 2 Load2 Yes 45+R3 MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 0 Add2 No Add3 No 0 Mult1 Yes MULTD R(F4) Load2 0 Mult2 No Register result status Clock 3 F0 F2 F4 F6 F8 F10 F12 ... F30 Load2 FU Mult1 Load1
Cycle 4 • Load2 completing;
Cycle 7 Instruction status Execution Write Instruction j k Issue complete Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0 Add1 Yes SUBD M(34+R2) M(45+R3) 0 Add2 Yes ADDD M(45+R3) Add1 Add3 No 8 Mult1 Yes MULTD M(45+R3) R(F4) 0 Mult2 Yes DIVD M(34+R2) Mult1 Register result status F0 F2 F4 F6 F8 F10 F12 ... F30 Clock 7 FU Mult1 M(45+R3) Add2 Add1 Mult2 • RS Add1 completing
Cycle 10 Instruction status Execution Write Instruction j k Issue complete Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 0 Add2 Yes ADDD M()–M() M(45+R3) 0 Add3 No 5 Mult1 Yes MULTD M(45+R3) R(F4) 0 Mult2 Yes DIVD M(34+R2) Mult1 Register result status F0 F2 F4 F6 F8 F10 F12 ... F30 Clock 10 FU Mult1 M(45+R3) Add2 M()–M() Mult2 • RS Add2 completing
Cycle 11 Instruction status Execution Write Instruction j k Issue complete Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 0 Add2 No 0 Add3 No 4 Mult1 Yes MULTD M(45+R3) R(F4) 0 Mult2 Yes DIVD M(34+R2) Mult1 Register result status F0 F2 F4 F6 F8 F10 F12 ... F30 Clock 11 FU Mult1 M(45+R3) (M-M)+M() M()–M() Mult2 • Write back result of ADDD in this cycle
Cycle 15 • Mult1 completing
Cycle 16 Only Divide instruction remains
Cycle 57 Instruction Block done • Again we have: • In-order issue, • Out-of-order execution, completion
Tomasulo Approach Example: Reservation Stations and Register Tags.
Tomasulo Approach Example: Multiply and divide are the only instructions not finished.
References http://www.cs.umd.edu/class/fall1998/cmsc411/projects/dynamic/tomasulo.html http://www.csse.monash.edu.au/~davida/teaching/cse3304/Web/Chapter7/index.htm http://www.crhc.uiuc.edu/ece411/Slides/issue_lect.pdf http://www-wjp.cs.uni-sb.de/~kroening/tomasulo/diplom/main002.html http://www-classes.usc.edu/engr/ee-s/557de/tomasulo.pdf http://www.dpi.ufv.br/disciplinas/mirror/ee282/Handouts/Lecture_9.pdf http://www.ece.umd.edu/class/enee759m.S2000/midterm-2000-solutions.pdf http://goethe.ira.uka.de/ungerer/Prozessorarchitektur/PrA-Folien-10-VL.pdf http://goethe.ira.uka.de/ungerer/prozarch/procarch98-99/pr98-7.pdf http://meseec.ce.rit.edu/eecc551-winter2000/