250 likes | 536 Views
It’s all about latency. Henk Neefs Dept. of Electronics and Information Systems (ELIS) University of Gent. Overview. Introduction of processor model Show importance of latency Techniques to handle latency Quantify memory latency effect Why consider optical interconnects?
E N D
It’s all about latency Henk Neefs Dept. of Electronics and Information Systems (ELIS) University of Gent
Overview • Introduction of processor model • Show importance of latency • Techniques to handle latency • Quantify memory latency effect • Why consider optical interconnects? • Latency of an optical interconnect • Conclusions
Out-of-order processor pipeline ‘future’ register file execution units instruction window I-cache fetch decode rename LD ST INT in-order retirement architectural register file
Branch latency ‘future’ register file execution units instruction window I-cache fetch decode rename LD ST INT BR ST XOR LD OR ADD BR ST XOR LD OR latency ... ... ... ... ... ... ... ... BR time
Eliminate branch latency • By prediction:predict outcome of branch => eliminate dependency (with a high probability) • By predication:convert control dependency to data dependency => eliminate control dependency
Load latency execution units LD while (pointer!=0) pointer = pointer.next; Loop: LD R1, R1(32) BNE R1, Loop LD BNE LD BNE load latency = 2 cycles branch latency = 1 cycle LD BNE CPI = 2 cycles/2 instructions = 1 cycle/instruction cycles
When longer load latency execution units • When L1-cache misses • and L2-cache hits: LD load latency = 2+6 cycles branch latency = 1 cycle LD BNE CPI = 8 cycles/2 instructions = 4 cycles/instruction • When L2-cache misses • and main memory hits: LD BNE load latency = 2+6+60 cycles CPI = 34 cycles/instruction cycles LD BNE
Memory hierarchy execution units register file L1 cache L2 cache main memory storage capacity and latency hard drive
L1 cache latency IPC = Instructions Per clock Cycle, 1 Ghz processor, spec95 programs
Main memory latency IPC = Instructions Per clock Cycle, 1 Ghz processor, spec95 programs
Performance and latency performance change = sensitivity * load latency change
Increase performance by • eliminating/reducing load latency: • By prefetching:predict the next miss and fetch the datato e.g. L1-cache • By address prediction:address known earlier=> load executed earlier=> data early in register file • or reducing sensitivity to load latency: • by fine-grain multithreading
Some prefetch techniques • Stride prefetching:search for pattern with constant stridee.g. walking through a matrix (row- or column-order) • Markov prefetching:recurring patterns of misses 20 31 42 53 64 stride: 11 miss history prediction 10 110 15 12 100 … ...
Stride prefetching IPC = Instructions Per clock Cycle, 1 Ghz processor, program: compress
Prefetching and sensitivity Factors of “performance sensitivity to latency” increase with stride-prefetching:
Latency is important:generalization to other processor architectures Consider schedule of program: Present in every program execution: • Latency of instruction execution • Latency of communication => latency importantwhatever processor architecture time
Optical interconnects (OI) • Mature components: • Vertical-Cavity Surface Emitting Lasers (VCSELs) • Light Emitting Diodes (LEDs) • Very high bandwidths • Are replacing electronic interconnects in telecom and networks • Useful for short inter-chip and even intra-chip interconnects?
OI in processor context • At levels close to processor core,latency is very important=> latency of OI determines how far OI penetrates in the memory hierarchy • What is the latency of an optical interconnect?
An optical link LED/VCSEL receiver diode fiber or light conductor buffer/modulation/bias transimpedance amplifier Total latency = buffer latency + VCSEL/LED latency + time of flight + receiver latency
VCSEL characteristics • A small semiconductor laser • Carrier density should be high enough for lasing action
Total VCSEL link latencyconsists of • Buffer latency • Parasitic capacitances and series resistances of VCSEL and pads • Threshold carrier density build up • From low optical output to final optical output (intrinsic latency) • Time of flight (TOF) • Receiver latency
Total optical link latency @ 1 mW CMOS: 0.6 m 0.25 m 0.6 m 0.25 m
Conclusions • When combining performance sensitivity and optical latency we conclude: • optical interconnects are feasible to main memory and for multiprocessors • for interconnects close to processor core, optical interconnects have too high latencywith present (telecom) devices, drivers and receivers => but now evolution to lower latency devices, drivers and receivers is taking place... For more information on the presented results: Henk Neefs, Latentiebeheersing in processors, PhD Universiteit Gent, January 2000 www.elis.rug.ac.be/~neefs