500 likes | 606 Views
Embedded Computer Architectures. Hennessy & Patterson Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation Gerard Smit (Zilverling 4102), smit@cs.utwente.nl André Kokkeler (Zilverling 4096), kokkeler@utwente.nl. Contents. Introduction Hazards <= dependencies
E N D
EmbeddedComputerArchitectures Hennessy & Patterson Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation Gerard Smit (Zilverling 4102), smit@cs.utwente.nl André Kokkeler (Zilverling 4096), kokkeler@utwente.nl
Contents • Introduction • Hazards <= dependencies • Instruction Level Parallelism; Tomasulo’s approach • Branch prediction
Dependencies • True Data dependency • Name dependency • Antidependency • Output dependency • Control dependency
Data Dependency Inst i Inst i+1 Inst i+2 Data Dep Result Data Dep Data Dep Result Two instructions are data dependent => risk of RAW hazard
Name Dependency • Antidependence • Output dependence Inst i register or memory location Inst j Read Write Two instructions are antidependent => risk of WAR hazard Inst i register or memory location Inst j Write Write Two instructions are antidependent => risk of WAW hazard
Control Dependency • Branch condition determines whether instruction i is executed => i is control dependent on the branch
Instruction Level Parallelism • Pipelining = ILP • Other approach: Dynamic scheduling => out of order execution • Instruction Decode stage split into • Issue (decode, check for structural hazards) • Read Operands
Instruction Level Parallelism • Scoreboard: • Sufficient resources • No data dependencies • Tomasulo’s approach • Minimize RAW hazards • Register renaming to minimize WAW and RAW hazards Read operands issue Reservation Station (park instructions while waiting for operands)
Tomasulo’s approach • Register Renaming Time Register F0 register use of instruction start of instruction • Read F0 • Write F0 • Read F0 • Write F0
Tomasulo’s approach • Register Renaming Time Register F0 • Read F0 • Write F0 • Read F0 • Write F0 Problems if arrows cross
Tomasulo’s approach • Register Renaming Time Register F0 • Read F0 • Write F0 • Read F0 • Write F0 Instr 2, 3,… will be stalled. Note that Instr 2 and 3 are stalled only because Instr 1 is not ready. If not for Instr 1, they could be executed earlier
Tomasulo’s approach • Register Renaming Instr 1.Register F0 Instr 3.Register F0 • Read F0 • Write F0 • Read F0 • Write F0 How is it arranged that value is written into Instr 3. Register F0 and not in Instr 1. Register F0?
Tomasulo’s approach • Register Renaming Instr 1.Register F0 Instr 1.F0Source Instr. k Instr 3.Register F0 Instr 3.F0Source Instr. 2 • Read F0 • Write F0 • Read F0 • Write F0 The result of Instr 2 is labelled with ‘Instr. 2’. Hardware checks whether there Is an instruction waiting for the result (checking the F0Source fields of instructions) And places the result in the correct place.
Tomasulo’s approach • Register Renaming Instr 3.Register F0 Instr 3.F0Source Instr. 2 • Read F0 • Write F0 • Read F0 • Write F0 operation (read) F0Data F0Source
Tomasulo’s approach • Register Renaming • Read F0 • Write F0 • Read F0 • Write F0 operation (read) F0Data F0Source operation (read) F0Data F0Source
Tomasulo’s approach • Register Renaming Filled during Issue Filled during execution • Read F0 • Write F0 • Read F0 • Write F0 operation (read) F0Data F0Source operation (write) operation (read) F0Data F0Source operation (write) Issue Reservation Station
Tomasulo’s approach • Effects • Register Renaming: prevents WAW and WAR hazards • Execution starts when operands are available (datafields are filled): prevents RAW
Tomasulo’s approach • Issue in more detail (issue is done sequentially) This is the only information you have: During issue, you have to keep track which instruction changed F0 last!!!! Format: label operation data source • Read F0 • Write F0 • Read F0 • Write F0 read1 read Empty ????? write1 write read2 read Empty Reservation Station
Tomasulo’s approach • Issue in more detail Keeping track of register status during issue is done for every register Format: label operation data source F0 ???? • Read F0 • Write F0 • Read F0 • Write F0 read1 read Empty ????? write1 write1 write write1 read2 read Empty Write1 write2 write2 write Reservation Station
Tomasulo’s approach • Definitions for the MIPS • For each reservation station:Name Busy Operation Vj Vk Qj Qk AName = labelBusy = in execution or notOperation = instructionV = operand valueQ = operand sourceA = memory address (Load, Store)
Tomasulo’s approach; hardware view From instruction queue Issue hardware Register Renaming Fill Reservation Stations Reservation Station “Reservation Fill Hardware” Puts data in correct place in reservation station Results + identification Of instruction producing the result Of which instructions are operands and corresponding execution units available? => Transport operands to executions unit “Execution Control Hardware” Execution Units Common Data Bus
Branch prediction • Data Hazards => Tomasulo’s approach • Branch (control) hazards => Branch prediction • Goal: Resolve outcome of branch early => prevent stalls because of control hazards
Branch prediction; 1 history bit • Example: Outerloop: … R=10 Innerloop: … R=R-1 BNZ R, Innerloop … … Branch Outerloop History bit History bit: is branch taken previously or not: - predict taken: fetch from ‘Innerloop’ - predict not taken: fetch next instr Actual outcome of branch: - taken: set history bit to ‘taken’ - not taken: set history bit to ‘not taken’ In this situation: Correct prediction in 80 % of branch evaluations
Branch prediction; 2 history bits • Example: Outerloop: … R=10 Innerloop: … R=R-1 BNZ R, Innerloop … … Branch Outerloop 2 history bits Not taken Predict taken Predict taken In this application: correct prediction in 90 % of branch evaluations taken taken Not taken Not taken Predict not taken Predict not taken taken
Branch prediction; Correlating branch predictors If (aa == 2) aa=0; If (bb == 2) bb=0; If (aa != bb) Results of these branches are used in prediction of this branch Example: suppose aa == 2 and bb == 2 then condition for last ‘if’ is always false => if previous two branches are not taken, last branch is taken.
Branch prediction; Correlating branch predictors • Mechanism: Suppose result of 3 previous branches is used to influence decision. • 8 possible sequences: br-3 br-2 br-1 br NT NT NT T NT NT T NT …. …. …. …. T T T T • Dependent on outcome of branch under consideration prediction is changed: • 1 bit history: (3,1) predictor Branch under consideration For the sequence (NT NT NT) the prediction is that the branch will be taken => Fetches from branchdestination
Branch prediction; Correlating branch predictors • Mechanism: Suppose result of 3 previous branches is used to influence decision. • 8 possible sequences: br-3 br-2 br-1 br NT NT NT T NT NT T NT …. …. …. …. T T T T • Dependent on outcome of branch under consideration prediction is changed: • 1 bit history: (3,1) predictor • 2 bit history: (3,2) predictor Branch under consideration For the sequence (NT NT NT) the prediction is that the branch will be taken => Fetches from branchdestination • Represented by 2 bits • 2 combinations indicate:predict taken • 2 combinations indicate: predict non taken • Updated by means of statemachine
Branch Target Buffer • Solutions: • Delayed Branch • Branch Target buffer Even with a good prediction, we don’t know where to branch too until here and we’ve already retrieved the next instruction
Branch Target Buffer From Instruction Decode hardware Memory (Instruction cache) Addresses of branch instructions Program Counter Corresponding Branch Targets Select Address Branch Target Hit? After IF stage, branch address already in PC
Branch Folding Memory (Instruction cache) Addresses of branch instructions Program Counter Corresponding Instructions at Branch Targets Unconditional Branches: Effectively removing Branch instruction (penalty of -1) Address Instruction at target Hit?
Return Address Predictors • Indirect branches: branch address known at run time. • 80% of time: return instructions. • Small fast stack: Procedure Call Procedure Return RET RET
Multiple Issue Processors Goal: Issue multiple instructions in a clockcycle • Superscalarissue varying number of instructions per clock • Statically scheduled • Dynamically scheduled • VLIWissue fixed number of instructions per clock • Statically scheduled
Multiple Issue Processors • Example
Hardware Based Speculation • Multiple Issue Processors => nearly 1 branch every clock cycle • Dynamic scheduling + branch prediction:fetch+issue • Dynamic scheduling + branch speculation:fetch+issue+execution • KEY: Do not perform updates that cannot be undone until you’re sure the corresponding operation really should be executed.
Hardware Based Speculation • Tomasulo: Operations beyond this point are finished Register File Operation i Branch (Predict Not Taken) Issued Operation k • Operation k: • Operand available • Execution postponed until clear whether branch is taken
Hardware Based Speculation • Tomasulo: Register File Operation i Finished Branch (Predict Not Taken) Operation k • Dependent on outcome branch: • Flush reservation stations • Start execution Issued
Hardware Based Speculation • Speculation: Results of operations beyond this point are committed (from reorder buffer to register file) Register File Operation i Commit: sequentially Branch (Predict Not Taken) Reorder Buffer Issued Operation k • Operation k: • Operand available and executed
Hardware Based Speculation • Speculation: Register File Operation i Committed Commit: sequentially Branch (Predict Not Taken) Reorder Buffer Operation k • Operation k: • Operand available and executed Issued
Hardware Based Speculation • Speculation: Register File Operation i Commit: sequentially Branch (Predict Not Taken) Reorder Buffer Committed Operation k • Operation k: • Operand available and executed
Hardware Based Speculation • Speculation: Register File Operation i Commit: sequentially Branch (Predict Not Taken) Reorder Buffer Committed Operation k • Operation k: • Operand available and executed
Hardware Based Speculation • Some aspects • Instructions causing a lot of work should not have been executed => restrict allowed actions in speculative mode • ILP of a program is limited • Realistic branch predictions: easier to implement => less efficient
Pentium Pro Implementation • Pentium Family
Pentium Pro Implementation • I486: CISC => problems with pipelining • 2 observations • Translation CISC instructions into sequence of microinstructions • Microinstruction is of equal length • Solution: pipelining microinstructions
Pentium Pro Implementation ... Jump to Indirect or Execute Fetch cycle routine ... Jump to Execute Indirect Cycle routine ... Jump to Fetch Interrupt cycle routine Jump to Op code routine Execute cycle begin ... Jump to Fetch or Interrupt AND routine ... Jump to Fetch or Interrupt ADD routine Note: each micro-program ends with a branch to the Fetch, Interrupt, Indirect or Execute micro-program
Pentium Pro Implementation • All RISC features are implemented on the execution of microinstructions instead of machine instructions • Microinstruction-level pipeline with dynamically scheduled microoperations • Fetch machine instruction (3 stages) • Decode machine instruction into microinstructions (2 stages) • Issue microinstructions (2 stages, register renaming, reorder buffer allocation performed here) • Execute of microinstructions (1 stage, floating point units pipelined, execution takes between 1 and 32 cycles) • Write back (3 stages) • Commit (3 stages) • Superscalar can issue up to 3 microoperations per clock cycle • Reservation stations (20 of them) and multiple functional units (5 of them) • Reorder buffer (40 entries) and speculation used
Pentium Pro Implementation • Execution Units have the following stages • Integer ALU 1 • Integer Load 3 • Integer Multiply 4 • FP add 3 • FP multiply 5 (partially pipelined –multiplies can start every other cycle) • FP divide 32 (not pipelined)
Thread-Level Parallelism • ILP: on instruction level • Thread-Level Parallelism: on a higher level • Server applications • Database queries • Thread: has all information (instructions, data, PC register state etc) to allow it to execute • On a separate processer • As a process on a single process.
Thread-Level Parallelism • Potentially high efficiency • Desktop applications: • Costly to switch to ‘thread-level reprogrammed’ applications. • Thread level parallelism often hard to find=> ILP continues to be focus for desktop-oriented processors (for embedded processors, the situation is different)