130 likes | 254 Views
Pipelined Design. תרגול 9. Latency and Throughput. Latency - Total time to perform a single operation from start to end. Throughput – operations / latency ratio or GOPS, giga-operations per second. The clock cycle is always bounded by the slowest operation. Picoseconds - 10^-12
E N D
Pipelined Design תרגול 9
Latency and Throughput • Latency - Total time to perform a single operation from start to end. • Throughput – operations / latency ratio or GOPS, giga-operations per second. • The clock cycle is always bounded by the slowest operation. • Picoseconds - 10^-12 • Nanoseconds - 10^-9
Latency and Throughput 80ps 30ps 60ps 50ps 70ps 10ps 20ps A B C D E F R E G • How to maximize throughput using an additional register? • Put it between C and D • How to maximize throughput using 2 additional registers? • AB, CD, EF • How to maximize throughput using 3 additional registers? • A, BC, D, EF • What is the cycle time, latency and throughput in that case? • 110ps, 440ps, (1/110) * 1000 = 9.09 • How to get max throughput with min stages? • 5 stages, since we cannot go lower than 100ps for a clock cycle
Pipeline registers • Select A, Since only jxx and call need valP at further stages (E and M resp.) • Predicted PC
Control Hazards • PC prediction (Fetch stage) • jxx , ret → ? • call, jmp → ValC • Other → ValP • Branch prediction strategies • Always-taken • Never-taken • Backward taken, forward not taken • Why are forward branches less common ? • How can we deal with stack prediction (ret) ? 5
Data Hazards • Data dependencies in Processors • Result from one instruction is being used as an operand for another (nop example) • Stalling / forwarding / load-use hazard 6
Data Dependencies: 2 Nop’s 1 2 3 4 5 6 7 8 9 10 # demo-h2.ys F F D D E E M M W W 0x000: irmovl $10,%edx F F D D E E M M W W 0x006: irmovl $3,%eax F F D D E E M M W W 0x00c: nop F F D D E E M M W W 0x00d: nop F F D D E E M M W W 0x00e: addl %edx,%eax F F D D E E M M W W 0x010: halt Cycle 6 W W W ← R[ ] R[ R[ 3 ] ] 3 3 %eax %eax %eax • • • • • • D D D ← valA valA valA R[ R[ R[ ] = 10 ] = 10 ] = 10 Error %edx %edx %edx valB valB valB R[ R[ R[ ] = 0 ] = 0 ] = 0 %eax %eax %eax ←
Stalling solution 1 2 3 4 5 6 7 8 9 10 # demo-h2.ys F D E M W 0x000: irmovl $10,%edx F D E M W 0x006: irmovl $3,%eax F D E M W 0x00c: nop F D E M W 0x00d: nop E M W bubble F D D E M W 0x00e: addl %edx,%eax F F D E M W 0x010: halt If instruction follows too closely after one that writes register, slow it down Hold instruction in decode Dynamically inject nop into execute stage
Data Forwarding Solution 1 2 3 4 5 6 7 8 9 # demo-h2.ys F F D D E E M M W W 0x000: irmovl $10,%edx F F D D E E M M W W 0x006: irmovl $3,%eax F F D D E E M M W W 0x00c: nop F F D D E E M M W W 0x00d: nop F F D D E E M M W W 0x00e: addl%edx,%eax F F D D E E M M W W 0x010: halt Cycle 6 in write- irmovl W back stage ← R[ ] 3 W_dstE= %eax %eax Destination value in W_valE = 3 W pipeline register • • • Forward as valB for D decode stage ← valA R[ ] = 10 srcA= %edx %edx ← srcB= %eax valB W_valE= 3
Limitation of Forwarding # demo-luh.ys 1 2 3 4 5 6 7 8 9 10 11 F D E M W 0x000: irmovl $128,%edx F D E M W 0x006: irmovl $3,%ecx F D E M W 0x00c: rmmovl %ecx, 0(%edx) F D E M W 0x012: irmovl $10,%ebx F F D D E E M M W W 0x018: mrmovl 0(%edx),%eax # Load %eax F D E M W 0x01e: addl%ebx,%eax # Use %eax F D E M W 0x020: halt Load-use dependency Cycle 7 Cycle 8 Value needed by end of M M decode stage in cycle 7 M_dstE= M_dstM= %ebx %eax ← M_valE= 10 m_valM M[128] = 3 Value read from memory in memory stage of cycle 8 • • • D D Error ← valA valA M_valE= 10 M_valE= 10 ← valB valB R[ R[ ] = 0 ] = 0 %eax %eax
Avoiding Load/Use Hazard # demo-luh.ys 1 2 3 4 5 6 7 8 9 10 11 F F D D E E M M W W 0x000: irmovl $128,%edx F F D D E E M M W W 0x006: irmovl $3,%ecx F F D D E E M M W W 0x00c: rmmovl %ecx, 0(%edx) F F D D E E M M W W 0x012: irmovl $10,%ebx F F F D D D E E E M M M W W W 0x018: mrmovl 0(%edx),%eax# Load %eax E M W bubble F D D E M W 0x01e:addl%ebx,%eax# Use %eax F F D E M W 0x020: halt Stall using instruction for Cycle 8 one cycle W W Can then pick up loaded W_dstE= W_dstE= %ebx %ebx value by forwarding from W_valE= 10 W_valE= 10 memory stage M M M_dstM= M_dstM= %eax %eax ← m_valM0M[128] = 3 m_valM M[128] = 3 • D D • valA valA W_valE= 10 W_valE= 10 ← valB valB m_valM= 3 m_valM= 3 ←
Control Logic Processing “ret” Must stall until instruction reaches write back Load/use hazard Must stall between read memory and use Mis-predicted branch removing instructions from the pipe 12
Implementing the forwarding logic • Note: 5 forwarding sources