200 likes | 389 Views
Pipelining Difficulties and MIPS R4000. Vincent H. Berk October 3, 2005 Reading for today: 3.1, A.4-A6, article: Yeager Reading for Wednesday: A7, A9-A11, article: Smith&Pleszkun. Exception Characterization. Synchronous vs. Asynchronous Synchronous: event occurs same place every time
E N D
ENGS 116 Lecture 6 Pipelining Difficulties and MIPS R4000 • Vincent H. Berk • October 3, 2005 • Reading for today: 3.1, A.4-A6, article: Yeager • Reading for Wednesday: A7, A9-A11, article: Smith&Pleszkun
ENGS 116 Lecture 6 Exception Characterization • Synchronous vs. Asynchronous • Synchronous: event occurs same place every time • Asynchronous: caused by devices external to CPU & memory, also hw malfunctions • User requested vs. user coerced • Requested: user task asks for it • Coerced: hw event not under control of user program • User maskable vs. user nonmaskable • Maskable: event that can be disabled by user task • Within vs. between instructions • Within: during execution of task, hard to handle, usually synchronous since instruction is trigger • Resume vs. terminate • Terminating: execution always stops after the interrupt
ENGS 116 Lecture 6 Exception Handling • Table of Interrupt vector addresses • Base register of this table stored in CPU by OS • Addresses of Interrupt handling routines are stored in table • On interrupt, CPU jumps to: base + 4 * int_num • Usually 16 or 32 interrupts • Physical pins on CPU, as well as software calls
ENGS 116 Lecture 6 Exception Examples(see also: figure A.27) • I/O request: device requests attention from CPU • System call or Supervisor call from software • Breakpoint or instruction tracing: software debugging, single-step • Arithmetic: Integer or FP, overflow, underflow, division by zero • Page fault: requested virtual address was not present in main memory • Misaligned address: bus error • Memory protection: read/write/execute forbidden on requested address • Invalid opcode: CPU was given an wrongly formatted instruction • Hardware malfunction: CRC errors, component failure
ENGS 116 Lecture 6 Pipelining Complications • • Exceptions: 5 instructions executing in 5-stage pipeline • – How to stop the pipeline? • – How to restart the pipeline? • – Who caused the exception? • StageProblem exceptions occurring • IF Page fault on instruction fetch; misaligned memory • access; memory-protection violation • ID Undefined or illegal opcode • EX Arithmetic interrupt • MEMPage fault on data fetch; misaligned memory access; • memory-protection violation
ENGS 116 Lecture 6 Pipelining Complications • • Simultaneous exceptions in more than one pipeline stage, e.g., • – Load with data page fault in MEM stage • – Add with instruction page fault in IF stage • – Add fault will happen BEFORE load fault • • Solution #1 • – Interrupt status vector per instruction • – Defer check till last stage, kill state update if exception • • Solution #2 • – Interrupt ASAP • – Restart everything that is incomplete • Another advantage for state update late in pipeline!
ENGS 116 Lecture 6 Pipelining Complications • • Complex addressing modes and instructions • • Address modes: Autoincrement causes register change during instruction execution • – Interrupts? Need to restore register state • – Adds WAR and WAW hazards since writes no longer in last stage • • Memory-memory move instructions • – Must be able to handle multiple page faults • – Long-lived instructions: partial state save on interrupt • • Floating point: long execution time; out of order completion
ENGS 116 Lecture 6 Stopping and Starting Execution • Most difficult exception occurrences have 2 properties • They occur within instructions • They must be restartable • The pipeline must be shut down safely and the state must be saved for correct restarting • Restarting is usually done by saving PC of instruction at which to start • Branches and delayed branches require special treatment • Precise exceptions allow instructions just before the exception to be completed, while restarting instructions after the exception
ENGS 116 Lecture 6 EX Integer unit EX FP/Integermultiply MEM WB IF ID EX FP adder EX FP/Integerdivider Figure A.29 The MIPS pipeline with three additional unpipelined, floating-point, functional units.
ENGS 116 Lecture 6 Integer unit EX FP/integer multiply M2 M3 M4 M5 M6 M7 M1 MEM WB IF ID FP adder A1 A2 A-3 A4 FP/integer divider DIV Figure A.31 A pipeline that supports multiple outstanding FP operations
ENGS 116 Lecture 6 Figure A.33 A typical FP code sequence showing the stalls arising from RAW hazards.
ENGS 116 Lecture 6 Case Study: MIPS R4000(100 MHz to 200 MHz) • • 8 Stage Pipeline: • IF – first half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access. • IS – second half of access to instruction cache. • RF – instruction decode and register fetch, hazard checking and also instruction cache hit detection. • EX – execution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation. • DF – data fetch, first half of access to data cache. • DS – second half of access to data cache. • TC – tag check, determine whether the data cache access hit. • WB – write back for loads and register-register operations. • • 8 Stages: What is impact on Load delay? Branch delay? Why?
ENGS 116 Lecture 6 IF IS RF EX DF DS TC WB Instruction memory Reg Data memory Reg ALU Figure A.37 The eight-stage pipeline structure of the R4000 uses pipelined instruction and data cache accesses.
ENGS 116 Lecture 6 Case Study: MIPS R4000 TWO Cycle Load Latency IF IS IF RF IS IF EX RF IS IF DF EX RF IS IF DS DF EX RF IS IF TC DS DF EX RF IS IF WB TC DS DF EX RF IS IF IF IS IF RF IS IF EX RF IS IF DF EX RF IS IF DS DF EX RF IS IF TC DS DF EX RF IS IF WB TC DS DF EX RF IS IF THREE Cycle Branch Latency (conditions evaluated during EX phase) Delay slot plus two stalls Branch likely cancels delay slot if not taken
ENGS 116 Lecture 6 MIPS R4000 Floating Point • FP Adder, FP Multiplier, FP Divider • Last step of FP Multiplier/Divider uses FP Adder HW • 8 kinds of stages in FP units: Stage Functional unit Description A FP adder Mantissa ADD stage D FP divider Divide pipeline stage E FP multiplier Exception test stage M FP multiplier First stage of multiplier N FP multiplier Second stage of multiplier R FP adder Rounding stage S FP adder Operand shift stage U Unpack FP numbers
ENGS 116 Lecture 6 R4000 Performance • Not ideal CPI of 1: • Load stalls (1 or 2 clock cycles) • Branch stalls (2 cycles + unfilled slots) • FP result stalls: RAW data hazard (latency) • FP structural stalls: Not enough FP hardware (parallelism)
ENGS 116 Lecture 6 Instruction Level Parallelism • Want to exploit parallelism among instruction sequences • Branches interfere with parallelism - gcc has branch every 5 or 6 instructions (on average) • Need to find sequences of unrelated instructions that can be overlapped • Often see loop-level parallelism • for (i = 0; i < 100; i = i +1) • x[i] = x[i] + y[i] • Want to convert loop-level parallelism to instruction-level parallelism
ENGS 116 Lecture 6 FP Loop: Where are the Hazards? Loop: LD F0, 0(R1) ; F0=vector element ADDD F4, F0, F2 ; add scalar in F2 SD 0 (R1), F4 ; store result SUBI R1, R1, #8 ; decrement pointer 8 bytes (DW) BNEZ R1, Loop ; branch R1!=zero NOP ; delayed branch slot
ENGS 116 Lecture 6 FP Loop Hazards Loop: LD F0, 0(R1) ; F0=vector element ADDD F4, F0, F2 ; add scalar in F2 SD 0 (R1), F4 ; store result SUBI R1, R1, #8 ; decrement pointer 8 bytes (DW) BNEZ R1, Loop ; branch R1! = zero NOP ; delayed branch slot • Where are the stalls?
ENGS 116 Lecture 6 FP Loop Showing Stalls • Rewrite code to minimize stalls?