Introduction to Dynamic Instruction Scheduling Using the Tomasulo Algorithm

Out of Order (OoO) Execution EE457 Introduction to Dynamic Scheduling of Instructions(The Tomasulo Algorithm)By Gandhi Puvvada

References • EE557 Textbook • Prof. Dubois’ EE557 Classnotes • Prof. Annavaram’s slides • Prof. Patterson’s Lecture slides

Programs often have several small fragments of code, which can be executed in any order.

OoO (Out of Order) execution Io = In order”Execution” here means producing the results.Completion means committing results. (writing into register file or memory).IoI (IoD)  OoE  IoCIn order Issue/Dispatch, Out of order Execution and finally In order completion/commitment

IoC or OoC? IoI (IoD)  OoE  IoCIoC (In order completion) is necessary to support exceptions (ex: page fault).Here we present firstIoI (IoD)  OoE  OoCand then (at the end)IoI (IoD)  OoE  IoC

OoC? But branches .. OoC? Hope you are not executing instruction beyond a branch and committing them!Well we dispatch a branch and suspend dispatching and wait until the branch is resolved. Then we resume dispatching instructions beyond the branch at either the fall-through area or at the target area.

Instruction Scheduling(Re-ordering of instructions) • Basic block = a straight-line code sequence with no branches. • Compiler can perform static instruction scheduling. • Tomasulo Algorithm lets us schedule instructions dynamically (in hardware). • Branch prediction and speculative execution beyond a branch (of course with ability to flush wrong-path instructions on misprediction) will be covered later (and implemented on FPGA in EE560).

Register renaming to allow later instructions to proceed lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); add $8, $8, $8; sw $8, 60($3); lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $48, 60($3); add $48, $48, $48; sw $48, 60($3);

Static Scheduling (based on Prof. Dubois slide) • Strengths-- Hardware simplicity-- Compiler has a global view of the code (does not help the hardware much) • Weaknesses-- can not be CPU-implementation specific-- can not foresee dynamic events -- cache misses -- data-dependent delays -- conditional branches can only reschedule instructions in a basic block (basic block = a straight-line code sequence with no branches)-- can not pre-compute memory addresses

Simple 5-stage pipelineIn-order executionRAW dependencySolve it by forwarding, if not, by stalling Dependent instructions are stalled in the ID stage IM DM IF ID EX M WB

Simple 5-stage pipeline: Dependent instructions are stalled in the ID stage and lw

Simple 5-stage pipeline: Dependent instructions can not be stalled in the EX stage. Why? and lw

DM Provide multiple functional units(for simplicity, we avoid talking about floating point execution unit and floating point register file)Stall, after decoding, in queues Divide Multiply IM Integer Load/Store IF ID WB Queues andFunctional unit

Why junior instructions carry their source register IDs into EX stage? Well they need to get help from Senior #1 or Senior #2 in EX stage under the control of the FU.No more of that. There may be 40 seniors in front of you. So I, the dispatch unit, will tell you from which senior you need to get help for which source register. rs, rt (IDs) are carried into EX

Tomasulo’s plan • OoO Out of order execution • Multiple functional units(say, Integer, DM, Multiplier, Divider) • Queues between ID and EX stages(in place of ID/EX register)

Out of order execution ?!Problems all over ??!! • For the time, no branch prediction, no speculative execution beyondbranches, just stall on a conditional branch • No support for precise exceptions for the time Even then, …

RAW, WAR, and WAW RAW = Read After Write lw $8, 40($2); add $9, $8, $7;WAR = Write after Read add $9, $8, $6; lw $8, 40($2); WAW = Write after Write add $9, $8, $6; lw $9, 40($2); WAW ? How is it possible?Consider a printer or a FIFO Why would anyone produce some result in $9 and without utilizing that result, why would he overwrite it with another result?

WAW can easily occur! WAW ? How is it possible?In out of order execution, instructions before the branch and instruction after the branch can co-exist.For example, multiple iterations of this loop can coexist in the execution area. So, what? Loop: LW $2, 40($1); MULT $4 $2, $3; SW $4, 40($1); ADDI $1, $1, -4; BNE $1, $0, Loop;

Say a company gives standard bonus to most of the employees and a higher bonus to the managers.So you load into $3 standard bonus from the stdbonus location in memory. And then you check to see if it is a case of a manager, and then load into $3 again (overwriting the earlier $3) the special bonus from the special location in memory. LW $3 stdbonus ($0) BNE $1, $2, SKIP LW $3 special ($0)

RAW, WAR, and WAW(some terminology to remember) RAW = Read After Write lw $8, 40($2); add $9, $8, $7;WAR = Write after Read add $9, $8, $6; lw $8, 40($2); WAW = Write after Write add $9, $8, $6; lw $9, 40($2); RAW A true dependency WAR An anti-dependency Name Dependences WAW An output dependency

RAW, WAR, and WAW • In-order execution: We need to deal with RAW only. • Out of order execution:Now we need to deal with WAR and WAW besides RAW.

Limited Architectural RegistersMore Physical RegistersRegister Renaming lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); add $8, $8, $8; sw $8, 60($3); It is clear that compiler is using $8 as a temporary register.If there is a delay in obtaining $2, the first part of the code can not proceed. Unfortunately, the second part of the code can not proceed because of name dependency for $8.

If we had 64 registers instead of 32 registers, then perhaps compiler might have used $48 instead of $8 and we could have executed the second part of the code before the first part! lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $48, 60($3); add $48, $48, $48; sw $48, 60($3); This is an example of name dependency.

Four different temporary registers can be used here as shown: $8, $18, $28, and $48(or called with coded names, LION, TIGER, CAT, and ANT). lw $8, 40($2); add $18, $8, $8; sw $18, 40($2); lw $28, 60($3); add $48, $28, $28; sw $48, 60($3); lw LION, 40($2); add TIGER, LION, LION; sw TIGER, 40($2); lw CAT, 60($3); add ANT, CAT, CAT; sw ANT, 60($3);

Can a later implementation provide 64 registers (instead of 32) while maintaining binary compatibility with previously compiled codes? Answer: Yes / No Why?

Answer: Can not change the number of Architectural RegistersRegister Renaming Through Tagging RegistersThis solves name dependency problems (WAR and WAW) while attending to true dependency (RAW) through waiting in queues.

RST RF square_root $2, $10; $1$2$3 $4$5$6$7$8.. .$31 $1$2$3 $4$5$6$7$8.. .$31 lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); add $8, $8, $8; sw $8, 60($3); dependentsource destination RST = Register Status TableRF = Register File

RST RF square_root $2, $10; $1$2$3 $4$5$6$7$8.. .$31 $1$2$3 $4$5$6$7$8.. .$31 lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); add $8, $8, $8; sw $8, 60($3);

Dispatch unit decodes and dispatches instructions. For destination operand, an instruction carries a TAG (but not the actual register name)! For source operands, an instruction carries either the values or TAGs of the operands (but not the actual register names)! square_root $2, $10; lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); add $8, $8, $8; sw $8, 60($3);

Register Renaming

TAGs for destinations or sources or for both? • A new tag is assigned to the destination register of the instruction being dispatched. • For each of the source registers (source operands) of the instruction being dispatched, either the value of the source register (if it has not been previously tagged) or the existing tag associated with the source register (if it has been tagged already) is conveyed to the instruction. • If a tag is conveyed for a source, then the instruction needs to wait for the original instruction with that destination tag to go on to the CDB and announce the value.

Unique TAG 4 4 • Like SSN, we need a unique TAG • SSNs are reused. • Similarly TAGs can be reused. • TAGs are similar to the number TOKENs.

Take a number vs. Take a token 4 In State Bank of India, the cashier issues brass tokens to customers trying to draw money as an identification (and not at all to put them in any virtual queue). Token numbers are in random order. The cashier verifies the signature in the records room and returns with money, call the token number and issues the money. Tokens are reclaimed and reused. Helps to create a Virtual Queue.We do not need that here!

TAGs (= Tokens) 4 • How many Tokens should the bank cashier have to start with? • What happens if the tokens are run out? • Does he need to have any order in holding tokens and issuing tokens? • Does he have to collect tokens back?

wp 0 rp 2 1 2 63 63 Full TAG FIFO (FIFOs are taught in EE560) • To issue and collect Tokens (TAGs), use a circular FIFO (First-in-First-Out) unit.While the FIFO-order is not important here, a FIFO is the easiest to implement in hardware compared to a random order in a pile. • Filled with (say) 64 tokens (in any order)initially on reset. • Tokens return in out of order anyway. • Put tokens back in the FIFO and issue. wp 1 wp 2 rp rp 63 2 tokens issued 1 token returned

2 63 Block Diagram provided by Prof. Dubois Simplifiedfor EE457 TAG FIFO Int. Divider IntegerMultiplier Issue Unit CDB = Common Data Bus (compare it to a Public Announcing System)

Front-End & Back-End • IFQ Instruction Fetch Queue (a FIFO structure) • Dispatch unit (including RST, RF, Tag FIFO) • Load Store and other Issue Queues • Issue Unit • Functional units • CDB (Common Data Bus)

Bottle neck in the design • CDB = Common Data Bus Do all instructions use CDB? • sw ? • j (jump)? • beq

load store queue • Address calculation • Memory disambiguation Mr. Bruin: Let me take a guess!You will now propose to have a MST (Memory Status Table) (like the RST).And you will rename memory locations to solve WAW and WAR problems among memory locations, right?!

MST (Memory Status Table)? No way! It is too big!We will just ask the junior to stall and wait to solve his WAR and WAW problems with his seniors. MST RST Memory RF 01.. . $1$2$3 $4$5$6$7$8.. .$31 01 .. . $1$2$3 $4$5$6$7$8.. .$31

Address calculation for lw and sw EE557 approach for address calculation EE457/560 approach for address calculation Dedicated adder, to compute address, attached to the load-store queue.

Memory Disambiguation EE557

Memory Disambiguation RAWsw $2, 2000($0); lw $8, 2000($0); WAWsw $2, 2000($0); sw $8, 2000($0); WARlw $2, 2000($0); sw $8, 2000($0);

Memory Disambiguation RAWsw $2, 2000($0); lw $8, 2000($0); This later lw can proceed only if there is no store ahead of it with the same address. WAWsw $2, 2000($0); sw $8, 2000($0); This later sw can proceed only if there is no store ahead of it with the same address. WARlw $2, 2000($0); sw $8, 2000($0); This later sw can proceed only if there is no load ahead of it with the same address.

Maintaining instructions in the order of arrival (issue order/program order) in a queue Is it necessary or is it desirable?In the case of L-S Queue ? In the case of Integer and other queues (mult queue, div queue)?

Introduction to Dynamic Instruction Scheduling Using the Tomasulo Algorithm

Introduction to Dynamic Instruction Scheduling Using the Tomasulo Algorithm

Presentation Transcript

EE457 Discussion Fall 2006

EE457

WELCOME TO EE457 COMPUTER SYSTEMS ORGANIZATION