CSL718 : Pipelined Processors

CSL718 : Pipelined Processors PipelineTimings 12th Jan, 2006 Anshul Kumar, CSE IITD

Pipelined Processors Parallel architectures Function-parallel Data-parallel Instr level (ILP) Thread level Process level • Intel’s terminology: • intra ILP • inter ILP Pipelined processors VLIWs Superscalar processors Anshul Kumar, CSE IITD

Processor Performance • MIPS and MFLOPS may not truly represent performance • Execution time of a program true measure of performance • SPEC rating acceptable Anshul Kumar, CSE IITD

Execution Time and Clock Period Instruction execution time = Tinst = CPI* t Program exec time = Tprog = N * Tinst = N * CPI * t N : Number of instructions CPI : Cycles per instruction(Av) t : Clock cycle time t IF D RF EX/AG M WB Anshul Kumar, CSE IITD

What influences clock period? Tprog = N * CPI * t Technology - t  Software - N  Architecture - N * CPI * t  Instruction set architecture (ISA) trade-off N vs CPI * t Micro architecture (A) trade-off CPI vs t Anshul Kumar, CSE IITD

Determining Clock Period Clock Period = t = Pmax Pmax = max propagation delay Comb Reg Reg Clock Pmax Anshul Kumar, CSE IITD

Ideal Pipelining Tinst S stages t = Tinst / S CPI = 1 Effective time per inst Teff = 1 * Tinst / S Anshul Kumar, CSE IITD

Pipelining with hazards Tinst S stages Frequency of interruptions - b t = Tinst / S CPI = 1 + (S - 1) * b Teff = (1 + (S - 1) * b) * Tinst / S Anshul Kumar, CSE IITD

Anshul Kumar, CSE IITD

A more realistic view t = Pmax + C Pmax = max propagation delay C = clocking overhead Comb Reg Reg Clock C Pmax Anshul Kumar, CSE IITD

Clocking Overhead • Fixed overhead c • Setup time • Output delay • Variable overhead (stretching factor) k • Clock skew t = Tinst / S + k * Tinst / S + c = (1 + k) * Tinst / S + c Anshul Kumar, CSE IITD

Pipelining with Clocking Overhead Teff = [1 + (S - 1) * b] * [(1 + k) * Tinst / S + c] Sopt =  [(1 - b) * (1 + k) * Tinst / (b * c)] Anshul Kumar, CSE IITD

Anshul Kumar, CSE IITD

Partitioning instruction into cycles with non-uniform stage times IF D RF AG T DF EX PA One action - one pipeline stage => large quantization overhead Multiple actions per stage? Multiple stages per action? Anshul Kumar, CSE IITD

Example Put Away 2 ns Execute 7+7+8 ns Data - ALU 3 ns Cache Data 10 ns Cache Dir 6 ns Addr - MAR 3 ns Gen Addr 9ns Decode 6+6 ns Data - IR 3 ns Cache Data 10 ns Cache Dir 6 ns Anshul Kumar, CSE IITD PC - MAR 4 ns

Optimal Pipelining Tinst = 4+6+10+3+12+9+3+6+10+3+22+2 = 90 ns b = 0.2 c = 4 ns k = 5% Sopt =  [(1 - b) * (1 + k) * Tinst / (b * c)] = 9.7  9 Tseg = 10 ns Anshul Kumar, CSE IITD

Example Put Away 2 ns Execute 7+7+8 ns Data - ALU 3 ns Tseg = 10 ns Cache Data 10 ns Cache Dir 6 ns Addr - MAR 3 ns Gen Addr 9ns S = 10 t = 14.5 ns S * t = 145 ns Decode 6+6 ns Data - IR 3 ns Cache Data 10 ns Cache Dir 6 ns Anshul Kumar, CSE IITD PC - MAR 4 ns

Example Put Away 2 ns Execute 7+7+8 ns Data - ALU 3 ns S = 9 Cache Data 10 ns Cache Dir 6 ns Addr - MAR 3 ns Gen Addr 9ns Tseg = 13 ns t = 17.65 ns S * t = 159 ns Decode 6+6 ns Data - IR 3 ns Cache Data 10 ns Cache Dir 6 ns Anshul Kumar, CSE IITD PC - MAR 4 ns

Example Put Away 2 ns Execute 7+7+8 ns Data - ALU 3 ns Tseg = 20 ns Cache Data 10 ns Cache Dir 6 ns Addr - MAR 3 ns Gen Addr 9ns S = 5 t = 25 ns S * t = 125 ns Decode 6+6 ns Data - IR 3 ns Cache Data 10 ns Cache Dir 6 ns Anshul Kumar, CSE IITD PC - MAR 4 ns

Comparison Anshul Kumar, CSE IITD

Cycle Quantization Delays are not integral multiple of clock period Total overhead = clocking overhead + quantization overhead S * t  Tinst + S * C (ignoring k) quantization overhead = S * (t - C) -Tinst reduces as clock period becomes small Anshul Kumar, CSE IITD

Other Timing Approaches • Self Timed Circuits • No centralized free running clock • An operation begins as soon as its inputs are available, that is, all its predecessors have completed • Higher speed, lower power consumption • Wave Pipelining • Omit inter-stage registers • Reduced clocking overhead Anshul Kumar, CSE IITD

Conventional Pipeline Registers separate adjoining stages Clock period > max prop delay Inter-stage data stored in registers Wave Pipeline No registers between adjoining stages Clock period less than max prop delay Waves of data propagate through combinational network (effectively, data is stored in the combinational circuit delay!) Conventional vs Wave Pipelining Anshul Kumar, CSE IITD

No pipelining Reg X X’ Reg Y Clock X X’ Y Anshul Kumar, CSE IITD

Conventional pipelining Reg X X’ Y Y’ Z Z’ Reg W Clock X X’ Y Y’ Z Z’ W

Wave pipelining Reg X Z’ Reg W Clock X Z’ Anshul Kumar, CSE IITD W

Timing Reg Reg Comb ckt X Y Clock T  p + s T clock period X Y p propagation delay s set-up time Anshul Kumar, CSE IITD

Timing with clock skew Reg Reg Comb ckt X Y Clock T Clock skew =  X Y p s   T  p + s + 2 Anshul Kumar, CSE IITD

Variation in propagation delay • Different delays in different paths • Delay variation due to process / temperature/ power variations • Data-dependent delay variations Anshul Kumar, CSE IITD

Timing for wave pipelining Reg Reg Comb ckt X Y Clock T  X p pmin Y pmax Anshul Kumar, CSE IITD T   p + s + 4

Timing for wave pipelining(expanded view) T X p Y nT (n-1) T pmin pmax pmin  (n-1) T + 2 nT  pmax + s + 2  T   p + s + 4 Anshul Kumar, CSE IITD

Conventional Pipeline T  pmax/n + s + 2 (plus cycle quantization overhead) nT  pmax + ns + 2n Wave Pipeline T   p + s + 4 nT  pmax + s + 2 Comparison Anshul Kumar, CSE IITD

Problems with wave pipelining • Need to balance delays • Narrow range of clock frequencies • Control difficult • Not very suitable for non-linear pipelines Anshul Kumar, CSE IITD

Additional Reading Wayne P. Burleson, Maciej Ciesielski, Fabian Klass, and Wentai Liu, “Wave-Pipelining: A Tutorial and Research Survey”, IEEE Trans. on VLSI Systems, vol. 6, no. 3, September 1998, pp. 464 – 474. Anshul Kumar, CSE IITD

CSL718 : Pipelined Processors