250 likes | 371 Views
Exceptions and Interrupts. When an exception (overflow, page-fault) occurs there are several instructions in the pipeline. The offending instruction must be captured (in the EPC and Cause registers), earlier instructions must be completed, later instructions must be flushed.
E N D
Exceptions and Interrupts • When an exception (overflow, page-fault) occurs there are several instructions in the pipeline. • The offending instruction must be captured (in the EPC and Cause registers), earlier instructions must be completed, later instructions must be flushed. • When there is an interrupt (context-switch, I/O signal) the processor has more freedom. It will usually “drain” the pipeline and then transfer control to the interrupt handler. • A new control line called EX.Flush zeros the control lines of the MEM and WB stages. Computer Architecture- Superscalar Processors 1/18
Exception Flushing • The exception and interrupt handling is called precise exceptions or precise interrupts, all instructions before the offending instruction are completed all instructions after the offending instruction are flushed. Computer Architecture- Superscalar Processors 2/18
Goal: Reduce the Cycle Time • Execution Time (ET) = (# instructions)*(average CPI)*(cycle time) • Number of Instructions - Mainly dependent on the program itself and compiler technology. • Average CPI - Depends on the architecture of the processor. In a pipelined datapath CPI -> 1.0 • Cycle Time - Is a physical trait (תכונה) of the chip. Is independent of the processors architecture.Right?Wrong!!! Computer Architecture- Superscalar Processors 3/18
Execution Time (ET) = (# instructions)*(average CPI)*(cycle time)The execution time of a program is a product of the above 3 factors. • Number of Instructions - Mainly dependent on the program itself and compiler technology. Depends also on the architecture. A RISC processor results in more instructions. • Average CPI - Depends on the architecture of the processor. A pipelined datapath with forwarding, hazard detection, branch prediction, and an optimizing compiler that can reorder code can reduce the CPI to almost 1.0 . A RISC processor executes more instructions but usually has a lower CPI. • Cycle Time - The smaller the microprocessor is the faster the clock can tick. This is a physical trait (תכונה) of the processor. The architecture of the chip is independent of the cycle time. Right? Wrong!!! Computer Architecture- Superscalar Processors
Shorter Stages • In our original multiple-cycle datapath we assumed that the time to access memory or use the ALU is 2ns, accessing the RF takes 1ns. • Can this be shortened? Read 16 bits from memory, add only two 16-bit numbers, should take less. • Problem: Our processor uses 32-bit words. • Solution: Read from memory or perform addition in 2 cycles. • Problem: What’s the advantage? • Solution: Pipeline the IF, EX, and MEM stages. Computer Architecture- Superscalar Processors 4/18
Super-Pipelining • Original 5 stage pipeline • New 8 stage pipeline IF1 IF2 ID EX1 EX2 M1 M2 WB • Splitting pipeline stages is called super-pipelining each instruction is executed in 8 stages. • The CPI is the same: 1 instruction every cycle. • However the cycle time is shorter. Computer Architecture- Superscalar Processors 5/18
Super-Pipelined Execution Time • A program executes 1,000,000 instructions. • It is executed on a 5-stage pipeline with a cycle time of 2ns, and on a 8-stage super-pipeline with a cycle time of 1.3ns. Where does it execute faster? (assume no hazards). • 5-stage: 1,000,000*1*2ns = 2,000,000ns = 2ms • 8-stage: 1,000,000*1*1.3ns = 1,300,000ns = 1.3ms • Speedup: 2/1.3 = 1.53 • Problems with super-pipelining: • Splitting stages isn’t that simple. • More instructions flushed on branch mispredictions. Computer Architecture- Superscalar Processors 6/18
Goal: Lowering the CPI • Is a CPI of 1.0 the lowest CPI achievable? addi $t2,$t0,4 sw $s0,0($t5) subi $t3,$t1,-4 sw $s1,0($t7) • Every pair of instructions in the code above is independent. • What if we had 2 pipelines? One that performs R-types and branches and one that performs loads and stores? • A CPI of 0.5 is theoretically possible. Computer Architecture- Superscalar Processors 7/18
A Superscalar Pipeline • A processor that can fetch more than one instruction in a cycle is called a superscalar or multiple-issue processor. addi $t2,$t0,4 sw $s0,0($t5) subi $t3,$t1,-4 sw $s1,0($t7) Computer Architecture- Superscalar Processors 8/18
A scalar is a single data value. • A vector is an array of data values. • Super computers (מחשבי על) of the 70s and 80s (like the Cray X-MP) could perform operations on scalars or on vectors. • The code:for(i=0;i<256;i++) c[i]=a[i]*b[i];could be performed in a single cycle. This is called a vector operation. A “regular” operation (c=a*b;) is called a scalar operation. The early Crays could perform 1 scalar or 1 vector operation per cycle. • Processors the could perform more that 1 scalar operation per cycle were dubbed (כונו) super-scalar processors. • Multiple-issue means that multiple instructions are issued to the processor per cycle. Computer Architecture- Superscalar Processors
Superscalar Datapath Computer Architecture- Superscalar Processors 9/18
In order to implement the superscalar datapath we have to add the following capabilities: • Another read port for the instruction memory, two instructions are fetched each cycle. • Another 2 read ports and 1 write port for the register file. • An additional ALU for effective address calculation. • An additional sign-extender. • Of course the real world isn’t a perfect world. Not all instructions will arrive paired off as expected. • The IF stage must detect when two instructions can’t be issued each cycle and stall one of them. Or it can flip their order if the Load/Store instruction is the first of the pair. • In the drawing on the previous page there is a mistake. 8 must be added to the current PC. • We now have a new problem. Instructions using the value of a load must stall one cycle. Now the next two instructions can’t use the loaded value without stalling. Computer Architecture- Superscalar Processors
Superscalar Code Scheduling Loop: lw $t0,0($s1) #$t0=array element addu $t0,$t0,$s2 #$t0=$t0+$s2 sw $t0,0($s1) #store result addi $s1,$s1,-4 #decrement pointer bne $s1,$zero,Loop #branch if $s1!=0 • Reorder the code to avoid pipeline stalls in a superscalar processor: Loop: lw $t0,0($s1) addi $s1,$s1,-4 addu $t0,$t0,$s2 bne $s1,$zero,Loop sw $t0,0($s1) • 4 clock cycles to execute 5 instructions: CPI = 0.8, not 0.5 Computer Architecture- Superscalar Processors 10/18
Loop Unrolling • Loop unrolling, multiple copies of the loop body are made: Loop: addi $s1,$s1,-16 lw $t0,0($s1) lw $t1,12($s1) addu $t0,$t0,$s2 lw $t2,8($s1) addu $t1,$t1,$s2 lw $t3,4($s1) addu $t2,$t2,$s2 sw $t0,0($s1) addu $t3,$t3,$s2 sw $t1,12($s1) sw $t2,8($s1) bne $s1,$zero,Loop sw $t3,4($s1) • The overhead of the loop is reduced and more instructions can be scheduled in parallel. • CPI = 14 instructions in 8 cycles: 8/14 = 0.57 Computer Architecture- Superscalar Processors 11/18
In Order Execution lw $t0,20($s2) addu $t1,$t0,$t2 sub $s4,$s4,$t3 slti $t5,$s4,20 • The above code is executed in program order, a cache-miss will stall the pipeline until the data is read from memory. • But the 3rd and 4th instructions are independent of the first two. Why wait? Lets execute them. • This is called Dynamic Pipelining or Out-Of-Order Execution (OOO Execution). Computer Architecture- Superscalar Processors 12/18
In order to implement Out-Of-Order (OOO) execution the instructions that are stalled must “wait” somewhere. If not they will be overwritten by the following instructions. • The IBM/Motorola Power PC family and the Intel Pentium Pro family of processors use Reservation Stations. Instructions are decoded and sent to the reservation station of the Functional Unit (FU) that will execute it: • ALU - executes most integer operations. • Integer multiplier/divider - some processors have separate units. • Memory Access Unit (MAU) - loads/stores from memory, usually has its own ALU for EA computation. • Branch Unit (BU) - checks conditions and computes the target PC. • FP add unit - performs FP additions, subtractions, negations … • FP multiplier/divider/sqrt - some processors have separate units. • If the operands of the instruction are ready and the unit is free (no other instruction is using it) the instruction is executed. • If not it waits until data and structural hazards are resolved. Computer Architecture- Superscalar Processors
Reservation Stations • IF • ID • Issue • EX/MEM • WB (Commit) • Several instructions (2-4) are fetched every cycle. • The MEM stage is now bypassed by most instructions. Computer Architecture- Superscalar Processors 13/18
OOO Execution, In-Order Commit • Allowing instructions to commit (להתחייב) out-of-order results in imprecise (לא מדויק) interrupts. • An interrupted program resumes execution from the current PC when the interrupt occurred. Instructions that were in the pipeline are re-executed, so what? • It is possible that a later instruction has already executed and written a value to the RF or memory. The program is now re-executed with updated register values. This can cause the program to execute wrongly. Computer Architecture- Superscalar Processors
OOO Execution, In-Order Commit • lw $t0,0($s0) # $s0 holds 100addi $t1,$t1,1 # $t0 holds 50 add $t2,$t0,$t1 # $t2 = 151 • addi executes before the load. The load causes a page-fault, an interrupt occurs. But addi has already committed (WB stage), $t0 contains 51. • When returning from the interrupt the code will be re-executed. $t2 will contain 152. • This is called an imprecise interrupt. To avoid this, commit is done in program order. Instructions wait in the commit buffer until their “turn” arrives. Computer Architecture- Superscalar Processors 14/18
Problems with In-Order Commit • In-order commit might be slower that OOO commit but is safer and is used by all modern processors. • div $1,$2,$3add $2,$3,$4sub $5,$3,$7addi $3,$2,100 • The instructions following div don’t depend on $1. • add can’t change $2 until div reads it. They can’t be issued in the same cycle (same problem for $3). • Where does add save the result in $2? • Where are temporary results saved? Computer Architecture- Superscalar Processors 15/18
OOO Execution, In-Order Commit • The 32 registers of the ISA (Instruction Set Architecture) are called the logical registers. They are updated only in-order at the commit stage. In the case of an exception or interrupt they hold the state of the program. • Instructions operate on physical registers. The MIPS R10000 has 64 physical registers. During decode each logical register is mapped to a physical one. • Thus several instructions can hold the same value of a logical register without worrying that it will be updated before it is used. • Instructions write to physical registers at the execute stage, thus the results are visible to instructions in the next cycle. • At the commit stage the physical register is written into the logical register. • Of course it is still impossible for instructions with dependencies to execute in the same cycle. Computer Architecture- Superscalar Processors
Logical and Physical Registers • The 32 registers of the ISA are called logical registers. • They are mapped into physical registers. • div $p1,$p20,$p30 # cycle 1add $p21,$p31,$p4 # cycle 1sub $p5,$p32,$p7 # cycle 1 or 2addi $p33,$p21,100 # cycle 2 Computer Architecture- Superscalar Processors 16/18
PPC and Pentium III Diagram Computer Architecture- Superscalar Processors 17/18
OOO Execution, In-Order Commit • The MIPS architecture doesn’t use reservation stations. It uses structures called Instruction Queues. • An instruction waits in the Integer, Address, or Floating Point queue until its dependencies are satisfied (the operand values are obtained) and a FU is available. • The advantage is that there is more room in the queues for more instructions. The R10000 queues contain 16 instructions each. The PPC reservation stations contain 2 instructions each. • Another advantage is that an instruction can be executed on any FU that becomes free (in the case of several FUs of the same type). An instruction in a reservation station is “stuck” to that unit. • The disadvantage is that a queue may become a bottleneck and that the control is centralized. The control of reservation stations is distributed. Computer Architecture- Superscalar Processors
MIPS R10K Computer Architecture- Superscalar Processors 18/18