780 likes | 932 Views
21.1 Efficient On-Chip Global Interconnects. Ron Ho, Ken Mai, Mark Horowitz Computer Systems Laboratory Department of Electrical Engineering Stanford University. On-chip wires and scaling. What do technology trends say? Local repeated wires keep up with gates Global repeated wires do not.
E N D
21.1 Efficient On-Chip Global Interconnects Ron Ho, Ken Mai, Mark Horowitz Computer Systems Laboratory Department of Electrical Engineering Stanford University
On-chip wires and scaling • What do technology trends say? • Local repeated wires keep up with gates • Global repeated wires do not
Scaled-length wires 1000 100 10 1 0.18 0.13 0.1 0.07 0.05 0.035 0.025 0.018 0.013 On-chip wires and scaling • What do technology trends say? • Local repeated wires keep up with gates • Global repeated wires do not Mid-layer metals Wire delay/gate delay Top-layer metals
Scaled-length wires Fixed-length wires 1000 1000 100 100 10 10 1 1 0.18 0.18 0.13 0.13 0.1 0.1 0.07 0.07 0.05 0.05 0.035 0.035 0.025 0.025 0.018 0.018 0.013 0.013 On-chip wires and scaling • What do technology trends say? • Local repeated wires keep up with gates • Global repeated wires do not Mid-layer metals Mid-layer metals Wire delay/gate delay Top-layer metals Top-layer metals
On-chip wires and scaling (2) • What are designers doing about this? • Answer: Modular architectures • High performance scaled compute cores • Long wire, hi-bandwidth global network • Multi-core CPUs: IBM Power, Mitsubishi Electric M32R, Cisco Toaster • Research architectures: MIT “RAW”, UT-Austin “GPA”, Stanford “Smart Memories”
5cm2 die On-chip wires and scaling (3) • Smart Memories on-chip wire network • Grid connects together computation tiles • 4.25mm-long 16B buses (in 0.1mm tech) • How do we build such a network?
5cm2 die 0.3cm2 quad On-chip wires and scaling (3) • Smart Memories on-chip wire network • Grid connects together computation tiles • 4.25mm-long 16B buses (in 0.1mm tech) • How do we build such a network?
SRAM CPU 0.075cm2 tile 5cm2 die 0.3cm2 quad On-chip wires and scaling (3) • Smart Memories on-chip wire network • Grid connects together computation tiles • 4.25mm-long 16B buses (in 0.1mm tech) • How do we build such a network?
Global Wire Grid SRAM CPU 0.075cm2 tile 5cm2 die 0.3cm2 quad On-chip wires and scaling (3) • Smart Memories on-chip wire network • Grid connects together computation tiles • 4.25mm-long 16B buses (in 0.1mm tech) • How do we build such a network?
Outline • Review of repeated wires • Global wire characteristics • Low-swing architecture • Basic components • Techniques for lower latency • Testchip and measured results • Summary
Repeated wires: review • Traditional solution for long wires • CMOS inverters as repeater stages • Makes delay linear with wire length • Improves latency and bandwidth • Pick Lwire, Wgate to minimize delay
Lwire Wgate Repeated wires: review • Traditional solution for long wires • CMOS inverters as repeater stages • Makes delay linear with wire length • Improves latency and bandwidth • Pick Lwire, Wgate to minimize delay
Repeated wires: power • The problem: wire and repeater power • For Smart Memories: > 100W at peak BW • Optimal delay != optimal power
1 W / Wopt 0.5 0.5 1 1.5 2 L/Lopt Repeated wires: power • The problem: wire and repeater power • For Smart Memories: > 100W at peak BW • Optimal delay != optimal power 2% Delay contours
1 W / Wopt 0.5 0.5 1 1.5 2 L/Lopt Repeated wires: power • The problem: wire and repeater power • For Smart Memories: > 100W at peak BW • Optimal delay != optimal power 2% Delay contours Best delay at (1,1)
1 W / Wopt 0.5 0.5 1 1.5 2 L/Lopt Repeated wires: power • The problem: wire and repeater power • For Smart Memories: > 100W at peak BW • Optimal delay != optimal power 2% Delay contours Add 5% energy contours
1 1 W / Wopt W / Wopt 0.5 0.5 0.5 1 1.5 2 0.5 1 1.5 2 L/Lopt L/Lopt Repeated wires: power • The problem: wire and repeater power • For Smart Memories: > 100W at peak BW • Optimal delay != optimal power 2% Delay contours 2% E*D contours
1 1 W / Wopt W / Wopt 0.5 0.5 0.5 1 1.5 2 0.5 1 1.5 2 L/Lopt L/Lopt Repeated wires: power • The problem: wire and repeater power • For Smart Memories: > 100W at peak BW • Optimal delay != optimal power 2% Delay contours 2% E*D contours Save 30% energy for 11% delay cost
Repeated wires: limitations • Energy savings is limited to ~30% • Risetime/reliability constraints • Repeaters also add layout complexity • Lots of 16B wide buses across chip • Via/wire and device congestion • Can we do better? • Save more energy and avoid complexity
Global wire characteristics • What can we exploit? • Wire regularity and homogeneity • Latency the same for each link • Design one, use everywhere • Very well-characterized wire environments • Global data synchronization to chip clock • Enables simple clock-based timing
Lowswing wires: architecture Vdd 0 bit bit F
Transmitter Equalization circuits bit bit F Lowswing wires: architecture Vdd 0
Differential, twisted wires bit bit F Lowswing wires: architecture Vdd 0
Clocked amplifying receiver bit bit F Lowswing wires: architecture Vdd 0
bit bit F Lowswing wires: architecture Vdd 0 Point-to-point link distance
bit bit F Lowswing wires: architecture Vdd 0 Save a lot of power: V2 Point-to-point link distance
bit bit F Lowswing wires: architecture Vdd 0 Save a lot of power: V2 Save a lot of layout complexity Point-to-point link distance
DV bit 8.1mm F wire bit 8.1mm Lowswing wires: drivers • Use good linear Res of NMOS devices • Lowswing, so we have small Vds • Devices sized for 10mm bus in 180nm • Emulates RC of 4.25mm bus in 100nm
DV bit 8.1mm F wire bit 8.1mm Lowswing wires: drivers • Use good linear Res of NMOS devices • Lowswing, so we have small Vds • Devices sized for 10mm bus in 180nm • Emulates RC of 4.25mm bus in 100nm Big challenge is driving long, lossy wires quickly
800mV 0.7t 600mV Voltage 400mV 200mV 0 0 0.5t t 1.5t 2t Time constants Lowswing wires: overdrive • Go faster: overdrive the transmitter voltage
Lowswing wires: overdrive • Go faster: overdrive the transmitter voltage 800mV 0.7t 0.4t 600mV Voltage 400mV 200mV 0 0 0.5t t 1.5t 2t Time constants
800mV 0.7t 0.4t 600mV 0.3t Voltage 400mV 200mV 0 0 0.5t t 1.5t 2t Time constants Lowswing wires: overdrive • Go faster: overdrive the transmitter voltage
800mV 0.7t 0.4t 600mV 0.3t Voltage 400mV 200mV 0 0 0.5t t 1.5t 2t Time constants Lowswing wires: overdrive • Go faster: overdrive the transmitter voltage • A form of driver pre-emphasis • Speedup limited to about 2.5x • Leading edge of transition starts slowly
800mV 0.7t 0.4t 600mV 0.3t Voltage 400mV 200mV 0 0 0.5t t 1.5t 2t Time constants Lowswing wires: overdrive • Go faster: overdrive the transmitter voltage • A form of driver pre-emphasis • Speedup limited to about 2.5x • Leading edge of transition starts slowly
Lowswing wires: overdrive 2 • Overdrive costs extra power • Especially differential wires! (up & down) • Be careful when sending repeated bits • Wire will reach +/-Voverdrivehysteresis • So either stop driving on repeated bits • Requires some care to avoid noise • …Or balance out the wire after each bit • Wire equalization
Lowswing wires: equalization • Short the differential wires together • Makes the wires two-phase: prech & eval • Must do twice as much work • But it’s also twice as fast (pre-emphasis)
Lowswing wires: equalization • Short the differential wires together • Makes the wires two-phase: prech & eval • Must do twice as much work • But it’s also twice as fast (pre-emphasis) 100% 75% Voltage 50% 25% 0 0 0.5t 1t 1.5t 2t Time constants
100% 75% Voltage 50% 25% 0 0 0.5t 1t 1.5t 2t Time constants Lowswing wires: equalization • Short the differential wires together • Makes the wires two-phase: prech & eval • Must do twice as much work • But it’s also twice as fast (pre-emphasis)
Lowswing wires: equalization 2 • Equalization shouldn’t cost any power • Activity factor is doubled (2X power) • But charge gets recycled (1/2X power) • Equalization saves overdrive power • Most of the wire does not split very far
Drive Equal 0.4 Near end Far end Voltage 0.2 0 0 2 4 6 8 10 Time (FO4s) Lowswing wires: equalization 2 • Equalization shouldn’t cost any power • Activity factor is doubled (2X power) • But charge gets recycled (1/2X power) • Equalization saves overdrive power • Most of the wire does not split very far
Drive Equal Overdrive 400mV 0.4 Near end 0.4 Far end Voltage DV 0.2 0.2 Target 100mV 0 0 0 2 4 6 8 10 0% 50% 100% Time (FO4s) Position along wire Lowswing wires: equalization 2 • Equalization shouldn’t cost any power • Activity factor is doubled (2X power) • But charge gets recycled (1/2X power) • Equalization saves overdrive power • Most of the wire does not split very far
Lowswing wires: equalization 3 • But: equalization is hard • Cannot use our overdrive trick • Requires balance devices along the wire • 10mm wire used balance gates every 2mm • Equalization costs a lot of clock power • Equalization makes activity factor = 1 • No longer data-dependent
Lowswing wires: receiver Vdd • PMOS inputs for low-voltage wires • Clocked with chip clock • Because data is clock aligned anyway F in+ in- out- out+
Lowswing wires: receiver Vdd Very fast amplification: The smaller the input swing, the better! • PMOS inputs for low-voltage wires • Clocked with chip clock • Because data is clock aligned anyway F in+ in- out- out+
Lowswing wires: receiver Vdd Very fast amplification: The smaller the input swing, the better! • PMOS inputs for low-voltage wires • Clocked with chip clock • Because data is clock aligned anyway F in+ in- So offset voltages are important out- out+
Lowswing wires: receiver 2 • Tradeoff in receiver sizing • Big receiver: less offset, more clk pwr • Small receiver: more offset, less clk pwr • Our design emphasized clock power • Small receiver w/ 3s offsets of 90mV • Simulated via Monte Carlo in spice • In hindsight, a bad choice • Clock power mostly in balance circuits
Measured results 180nm MOSIS/TSMC 1.8V, 6 layers Al 95pS FO4 delay 9.2mm2 testchip
Low-swing buses 5, 7.5 and 10mm 1-, 2-, and 3-b buses Measured results 180nm MOSIS/TSMC 1.8V, 6 layers Al 95pS FO4 delay 9.2mm2 testchip
Full-swing buses 5mm and 10mm 1-, 2-, and 3-b buses Optimally repeated Measured results 180nm MOSIS/TSMC 1.8V, 6 layers Al 95pS FO4 delay 9.2mm2 testchip
Offset block (re-spun on 250nm National process) Measured results 180nm MOSIS/TSMC 1.8V, 6 layers Al 95pS FO4 delay 9.2mm2 testchip