It’s all about latency

It’s all about latency Henk Neefs Dept. of Electronics and Information Systems (ELIS) University of Gent

Overview • Introduction of processor model • Show importance of latency • Techniques to handle latency • Quantify memory latency effect • Why consider optical interconnects? • Latency of an optical interconnect • Conclusions

Out-of-order processor pipeline ‘future’ register file execution units instruction window I-cache fetch decode rename LD ST INT in-order retirement architectural register file

Branch latency ‘future’ register file execution units instruction window I-cache fetch decode rename LD ST INT BR ST XOR LD OR ADD BR ST XOR LD OR latency ... ... ... ... ... ... ... ... BR time

Eliminate branch latency • By prediction:predict outcome of branch => eliminate dependency (with a high probability) • By predication:convert control dependency to data dependency => eliminate control dependency

Load latency execution units LD while (pointer!=0) pointer = pointer.next; Loop: LD R1, R1(32) BNE R1, Loop LD BNE LD BNE load latency = 2 cycles branch latency = 1 cycle LD BNE CPI = 2 cycles/2 instructions = 1 cycle/instruction cycles

When longer load latency execution units • When L1-cache misses • and L2-cache hits: LD load latency = 2+6 cycles branch latency = 1 cycle LD BNE CPI = 8 cycles/2 instructions = 4 cycles/instruction • When L2-cache misses • and main memory hits: LD BNE load latency = 2+6+60 cycles CPI = 34 cycles/instruction cycles LD BNE

Memory hierarchy execution units register file L1 cache L2 cache main memory storage capacity and latency hard drive

L1 cache latency IPC = Instructions Per clock Cycle, 1 Ghz processor, spec95 programs

Main memory latency IPC = Instructions Per clock Cycle, 1 Ghz processor, spec95 programs

Performance and latency performance change = sensitivity * load latency change

Increase performance by • eliminating/reducing load latency: • By prefetching:predict the next miss and fetch the datato e.g. L1-cache • By address prediction:address known earlier=> load executed earlier=> data early in register file • or reducing sensitivity to load latency: • by fine-grain multithreading

Some prefetch techniques • Stride prefetching:search for pattern with constant stridee.g. walking through a matrix (row- or column-order) • Markov prefetching:recurring patterns of misses 20 31 42 53 64 stride: 11 miss history prediction 10 110 15 12 100 … ...

Stride prefetching IPC = Instructions Per clock Cycle, 1 Ghz processor, program: compress

Prefetching and sensitivity Factors of “performance sensitivity to latency” increase with stride-prefetching:

Latency is important:generalization to other processor architectures Consider schedule of program: Present in every program execution: • Latency of instruction execution • Latency of communication => latency importantwhatever processor architecture time

Optical interconnects (OI) • Mature components: • Vertical-Cavity Surface Emitting Lasers (VCSELs) • Light Emitting Diodes (LEDs) • Very high bandwidths • Are replacing electronic interconnects in telecom and networks • Useful for short inter-chip and even intra-chip interconnects?

OI in processor context • At levels close to processor core,latency is very important=> latency of OI determines how far OI penetrates in the memory hierarchy • What is the latency of an optical interconnect?

An optical link LED/VCSEL receiver diode fiber or light conductor buffer/modulation/bias transimpedance amplifier Total latency = buffer latency + VCSEL/LED latency + time of flight + receiver latency

VCSEL characteristics • A small semiconductor laser • Carrier density should be high enough for lasing action

Total VCSEL link latencyconsists of • Buffer latency • Parasitic capacitances and series resistances of VCSEL and pads • Threshold carrier density build up • From low optical output to final optical output (intrinsic latency) • Time of flight (TOF) • Receiver latency

Total optical link latency @ 1 mW CMOS: 0.6 m 0.25 m 0.6 m 0.25 m

Latency as function of power

Conclusions • When combining performance sensitivity and optical latency we conclude: • optical interconnects are feasible to main memory and for multiprocessors • for interconnects close to processor core, optical interconnects have too high latencywith present (telecom) devices, drivers and receivers => but now evolution to lower latency devices, drivers and receivers is taking place... For more information on the presented results: Henk Neefs, Latentiebeheersing in processors, PhD Universiteit Gent, January 2000 www.elis.rug.ac.be/~neefs

It’s all about latency