340 likes | 578 Views
CSL718 : Pipelined Processors. PipelineTimings 12th Jan, 2006. Pipelined Processors. Parallel architectures. Function-parallel. Data-parallel. Instr level (ILP). Thread level. Process level. Intel’s terminology: intra ILP inter ILP. Pipelined processors. VLIWs.
E N D
CSL718 : Pipelined Processors PipelineTimings 12th Jan, 2006 Anshul Kumar, CSE IITD
Pipelined Processors Parallel architectures Function-parallel Data-parallel Instr level (ILP) Thread level Process level • Intel’s terminology: • intra ILP • inter ILP Pipelined processors VLIWs Superscalar processors Anshul Kumar, CSE IITD
Processor Performance • MIPS and MFLOPS may not truly represent performance • Execution time of a program true measure of performance • SPEC rating acceptable Anshul Kumar, CSE IITD
Execution Time and Clock Period Instruction execution time = Tinst = CPI* t Program exec time = Tprog = N * Tinst = N * CPI * t N : Number of instructions CPI : Cycles per instruction(Av) t : Clock cycle time t IF D RF EX/AG M WB Anshul Kumar, CSE IITD
What influences clock period? Tprog = N * CPI * t Technology - t Software - N Architecture - N * CPI * t Instruction set architecture (ISA) trade-off N vs CPI * t Micro architecture (A) trade-off CPI vs t Anshul Kumar, CSE IITD
Determining Clock Period Clock Period = t = Pmax Pmax = max propagation delay Comb Reg Reg Clock Pmax Anshul Kumar, CSE IITD
Ideal Pipelining Tinst S stages t = Tinst / S CPI = 1 Effective time per inst Teff = 1 * Tinst / S Anshul Kumar, CSE IITD
Pipelining with hazards Tinst S stages Frequency of interruptions - b t = Tinst / S CPI = 1 + (S - 1) * b Teff = (1 + (S - 1) * b) * Tinst / S Anshul Kumar, CSE IITD
A more realistic view t = Pmax + C Pmax = max propagation delay C = clocking overhead Comb Reg Reg Clock C Pmax Anshul Kumar, CSE IITD
Clocking Overhead • Fixed overhead c • Setup time • Output delay • Variable overhead (stretching factor) k • Clock skew t = Tinst / S + k * Tinst / S + c = (1 + k) * Tinst / S + c Anshul Kumar, CSE IITD
Pipelining with Clocking Overhead Teff = [1 + (S - 1) * b] * [(1 + k) * Tinst / S + c] Sopt = [(1 - b) * (1 + k) * Tinst / (b * c)] Anshul Kumar, CSE IITD
Partitioning instruction into cycles with non-uniform stage times IF D RF AG T DF EX PA One action - one pipeline stage => large quantization overhead Multiple actions per stage? Multiple stages per action? Anshul Kumar, CSE IITD
Example Put Away 2 ns Execute 7+7+8 ns Data - ALU 3 ns Cache Data 10 ns Cache Dir 6 ns Addr - MAR 3 ns Gen Addr 9ns Decode 6+6 ns Data - IR 3 ns Cache Data 10 ns Cache Dir 6 ns Anshul Kumar, CSE IITD PC - MAR 4 ns
Optimal Pipelining Tinst = 4+6+10+3+12+9+3+6+10+3+22+2 = 90 ns b = 0.2 c = 4 ns k = 5% Sopt = [(1 - b) * (1 + k) * Tinst / (b * c)] = 9.7 9 Tseg = 10 ns Anshul Kumar, CSE IITD
Example Put Away 2 ns Execute 7+7+8 ns Data - ALU 3 ns Tseg = 10 ns Cache Data 10 ns Cache Dir 6 ns Addr - MAR 3 ns Gen Addr 9ns S = 10 t = 14.5 ns S * t = 145 ns Decode 6+6 ns Data - IR 3 ns Cache Data 10 ns Cache Dir 6 ns Anshul Kumar, CSE IITD PC - MAR 4 ns
Example Put Away 2 ns Execute 7+7+8 ns Data - ALU 3 ns S = 9 Cache Data 10 ns Cache Dir 6 ns Addr - MAR 3 ns Gen Addr 9ns Tseg = 13 ns t = 17.65 ns S * t = 159 ns Decode 6+6 ns Data - IR 3 ns Cache Data 10 ns Cache Dir 6 ns Anshul Kumar, CSE IITD PC - MAR 4 ns
Example Put Away 2 ns Execute 7+7+8 ns Data - ALU 3 ns Tseg = 20 ns Cache Data 10 ns Cache Dir 6 ns Addr - MAR 3 ns Gen Addr 9ns S = 5 t = 25 ns S * t = 125 ns Decode 6+6 ns Data - IR 3 ns Cache Data 10 ns Cache Dir 6 ns Anshul Kumar, CSE IITD PC - MAR 4 ns
Comparison Anshul Kumar, CSE IITD
Cycle Quantization Delays are not integral multiple of clock period Total overhead = clocking overhead + quantization overhead S * t Tinst + S * C (ignoring k) quantization overhead = S * (t - C) -Tinst reduces as clock period becomes small Anshul Kumar, CSE IITD
Other Timing Approaches • Self Timed Circuits • No centralized free running clock • An operation begins as soon as its inputs are available, that is, all its predecessors have completed • Higher speed, lower power consumption • Wave Pipelining • Omit inter-stage registers • Reduced clocking overhead Anshul Kumar, CSE IITD
Conventional Pipeline Registers separate adjoining stages Clock period > max prop delay Inter-stage data stored in registers Wave Pipeline No registers between adjoining stages Clock period less than max prop delay Waves of data propagate through combinational network (effectively, data is stored in the combinational circuit delay!) Conventional vs Wave Pipelining Anshul Kumar, CSE IITD
No pipelining Reg X X’ Reg Y Clock X X’ Y Anshul Kumar, CSE IITD
Conventional pipelining Reg X X’ Y Y’ Z Z’ Reg W Clock X X’ Y Y’ Z Z’ W
Wave pipelining Reg X Z’ Reg W Clock X Z’ Anshul Kumar, CSE IITD W
Timing Reg Reg Comb ckt X Y Clock T p + s T clock period X Y p propagation delay s set-up time Anshul Kumar, CSE IITD
Timing with clock skew Reg Reg Comb ckt X Y Clock T Clock skew = X Y p s T p + s + 2 Anshul Kumar, CSE IITD
Variation in propagation delay • Different delays in different paths • Delay variation due to process / temperature/ power variations • Data-dependent delay variations Anshul Kumar, CSE IITD
Timing for wave pipelining Reg Reg Comb ckt X Y Clock T X p pmin Y pmax Anshul Kumar, CSE IITD T p + s + 4
Timing for wave pipelining(expanded view) T X p Y nT (n-1) T pmin pmax pmin (n-1) T + 2 nT pmax + s + 2 T p + s + 4 Anshul Kumar, CSE IITD
Conventional Pipeline T pmax/n + s + 2 (plus cycle quantization overhead) nT pmax + ns + 2n Wave Pipeline T p + s + 4 nT pmax + s + 2 Comparison Anshul Kumar, CSE IITD
Problems with wave pipelining • Need to balance delays • Narrow range of clock frequencies • Control difficult • Not very suitable for non-linear pipelines Anshul Kumar, CSE IITD
Additional Reading Wayne P. Burleson, Maciej Ciesielski, Fabian Klass, and Wentai Liu, “Wave-Pipelining: A Tutorial and Research Survey”, IEEE Trans. on VLSI Systems, vol. 6, no. 3, September 1998, pp. 464 – 474. Anshul Kumar, CSE IITD