200 likes | 218 Views
Advanced Topics in Pipelining. Two methods to exploit instruction-level parallelism Superpipelining : longer (deeper) pipelines. The ideal speedup is equal to the number of pipeline stages. 8 or more pipeline stages are common in modern processors. Superscalar :
E N D
Advanced Topics in Pipelining • Two methods to exploit instruction-level parallelism • Superpipelining: longer (deeper) pipelines. • The ideal speedup is equal to the number of pipeline stages. • 8 or more pipeline stages are common in modern processors. • Superscalar: • multiple issue (CPI can be less than one) • Instruction execution rate exceeds the clock rate. • 6 GHz four-way multiple issue CPI = 0.25, IPC = 4 • 24 billion instructions/second
Static Multiple Issue • Two-issue 5-stage MIPS processor • (R-type or branch) AND (Load or Store) • VLIW concept • Compiler to remove dependencies between instruction pairs
Example: Static Two-Issue MIPS 1/2 • Extra reading and writing ports to register file. • Data dependencies results in more serious stalls • In superscalar pipeline, the next two instructions cannot use the result of lw instruction without stalling. • Example:Loop: lw $t0, 0($s1) addu $t0, $t0, $s2 sw $t0, 0($s1) addi $s1, $s1, -4 bne $s1, $zero, Loop • reorder the instructions to avoid as many pipeline stalls as possible
Example: Static Two-Issue MIPS 1/2 Loop: lw$t0, 0($s1) addu $t0, $t0, $s2 sw $t0, 0($s1) addi $s1, $s1, -4 bne $s1, $zero, Loop • CPI = 4/5 = 0.8 IPC = 1.25
Loop Unrolling 1/2 Loop: lw $t0, 0($s1) addi $s1, $s1, -4 addu $t0, $t0, $s2 sw $t0, 4($s1) bne $s1, $zero, Loop Loop: lw $t0, 0($s1) addi $s1, $s1, -16 lw $t1, 12($s1) addu $t0, $t0, $s2 lw $t2, 8($s1) addu $t1, $t1, $s2 lw $t3, 4($s1) addu $t2, $t2, $s2 sw $t0, 16($s1) addu $t3, $t3, $s2 sw $t1, 12($s1) sw $t2, 8($s1) sw $t2, 8($s1) bne $s1,$zero, Loop Register Renaming
Loop Unrolling 2/2 • CPI = 8/14 0.57 IPC = 1.75
Speculation • Guessing, for example, a branch outcome and execute instructions based on this guessing • Can be done by the compiler or hardware • compiler to reorder the instructions • Recovery mechanism to fix up when the speculation turns out to be wrong • The results obtained from speculative execution are kept in temporary buffers until they are no longer speculative. • Committing them when speculation is correct • discarding them otherwise
IA-64 Architecture • RISC-style instruction set • almost like a MIPS 64 • differences • IA-64 has more registers (128 integer, 128 floating-point, 8 special registers for branch) • IA-64 places instructions into groups or bundles (VLIW) • IA-64 includes special capabilities for speculation and branch elimination • Predication – branch elimination • loop unrolling does not help in if-then-else statements
Predication in IA-64 • 64 1-bit predicate registers • Example: • CMP Ra, Rb JNE else MOV Ra, 0 JMP endelse MOV Ra, Rbend whatever • Code with predicates • CMPEQ Ra, Rb, P1/P2[P1] MOV Ra, 0[P2] MOV Ra, Rb • If the predicate is not true, the instruction becomes nop
Predicates in ARM • Almost all instructions can be conditionally executed. • Thirteen different predicates are available, • Each depending on the four flags Carry, Overflow, Zero, and Negative in some way. • The ARM's 16-bit Thumb instruction set has no branch predication, in order to save encoding space • every instruction reserves a bit-field for the predicate specifying whether that instruction should have an effect
IA-64 Characteristics Itanium :3.2 GFLOPS Itanium: 6.67 GFLOPS.
Dynamic Pipeline Scheduling • dynamic pipelining is a hardware mechanism to avoid pipeline stalls. • Example:lw $t0, 20($s2) addu $t1, $t0, $t2 sub $s4, $s4, $t3 slti $t5, $s4, 20 • Even though addu has to wait for lw to complete, the following two instructions can be started. • Out of order execution => more complicated pipeline control. • Dynamic pipeline scheduling goes past stalls to find later instructions to execute while waiting for the stall to be resolved.
Dynamic Pipeline Scheduling Instruction fetch and decode unit In-order issue Reservation Station Reservation Station Reservation Station Reservation Station Out-of-order execution Integer Integer Floating points Load/ Store Commit Unit In-order commit Reorder buffers
Dynamic Pipeline Scheduling • 5-10 functional units with reservation stations (RS)that hold the operands and the operation. • When the buffer contains all the operands and the unit is ready to execute, the result is calculated, • If necessary they are sent to other RS • The commit unit to decide when it is safe to put the result into the register file or into memory (committing). • Completion methods: • In-order completion and out-of-order completion.
Pentium 4 • After fetched, IA-32 instructions are translated into microoperations • Microoperations • dynamically scheduled • speculative pipelining • issue rate: three microoperations per cycle • deep pipelining • 20 stages • 7 functional units • support for 126 outstanding operations • trace cache
Pentium 4 Datapath instruction prefetch and decode branch prediction Trace cache Microoperation queue Register file Dispatch & register renaming Memory operation queue Integer and floating-point operation queue Complex Instruction Integer Floating Point Load Integer Store Commit Unit Data cache
Faster Clock rate Slower Slower Faster IPC Datapath Comparison 1/2 Deeply pipelined Multiple-issue deep pipelined Multiple-issue pipelined Multi-cycle Pipelined Single-cycle
Specialized Hardware Shared 1 Several Latency in instructions Datapath Comparison 2/2 Multiple-issue deep pipelined Multiple-issue pipelined Deeply pipelined Single-cycle Pipelined Multi-cycle