1 / 78

21.1 Efficient On-Chip Global Interconnects

21.1 Efficient On-Chip Global Interconnects. Ron Ho, Ken Mai, Mark Horowitz Computer Systems Laboratory Department of Electrical Engineering Stanford University. On-chip wires and scaling. What do technology trends say? Local repeated wires keep up with gates Global repeated wires do not.

kirby
Download Presentation

21.1 Efficient On-Chip Global Interconnects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 21.1 Efficient On-Chip Global Interconnects Ron Ho, Ken Mai, Mark Horowitz Computer Systems Laboratory Department of Electrical Engineering Stanford University

  2. On-chip wires and scaling • What do technology trends say? • Local repeated wires keep up with gates • Global repeated wires do not

  3. Scaled-length wires 1000 100 10 1 0.18 0.13 0.1 0.07 0.05 0.035 0.025 0.018 0.013 On-chip wires and scaling • What do technology trends say? • Local repeated wires keep up with gates • Global repeated wires do not Mid-layer metals Wire delay/gate delay Top-layer metals

  4. Scaled-length wires Fixed-length wires 1000 1000 100 100 10 10 1 1 0.18 0.18 0.13 0.13 0.1 0.1 0.07 0.07 0.05 0.05 0.035 0.035 0.025 0.025 0.018 0.018 0.013 0.013 On-chip wires and scaling • What do technology trends say? • Local repeated wires keep up with gates • Global repeated wires do not Mid-layer metals Mid-layer metals Wire delay/gate delay Top-layer metals Top-layer metals

  5. On-chip wires and scaling (2) • What are designers doing about this? • Answer: Modular architectures • High performance scaled compute cores • Long wire, hi-bandwidth global network • Multi-core CPUs: IBM Power, Mitsubishi Electric M32R, Cisco Toaster • Research architectures: MIT “RAW”, UT-Austin “GPA”, Stanford “Smart Memories”

  6. 5cm2 die On-chip wires and scaling (3) • Smart Memories on-chip wire network • Grid connects together computation tiles • 4.25mm-long 16B buses (in 0.1mm tech) • How do we build such a network?

  7. 5cm2 die 0.3cm2 quad On-chip wires and scaling (3) • Smart Memories on-chip wire network • Grid connects together computation tiles • 4.25mm-long 16B buses (in 0.1mm tech) • How do we build such a network?

  8. SRAM CPU 0.075cm2 tile 5cm2 die 0.3cm2 quad On-chip wires and scaling (3) • Smart Memories on-chip wire network • Grid connects together computation tiles • 4.25mm-long 16B buses (in 0.1mm tech) • How do we build such a network?

  9. Global Wire Grid SRAM CPU 0.075cm2 tile 5cm2 die 0.3cm2 quad On-chip wires and scaling (3) • Smart Memories on-chip wire network • Grid connects together computation tiles • 4.25mm-long 16B buses (in 0.1mm tech) • How do we build such a network?

  10. Outline • Review of repeated wires • Global wire characteristics • Low-swing architecture • Basic components • Techniques for lower latency • Testchip and measured results • Summary

  11. Repeated wires: review • Traditional solution for long wires • CMOS inverters as repeater stages • Makes delay linear with wire length • Improves latency and bandwidth • Pick Lwire, Wgate to minimize delay

  12. Lwire Wgate Repeated wires: review • Traditional solution for long wires • CMOS inverters as repeater stages • Makes delay linear with wire length • Improves latency and bandwidth • Pick Lwire, Wgate to minimize delay

  13. Repeated wires: power • The problem: wire and repeater power • For Smart Memories: > 100W at peak BW • Optimal delay != optimal power

  14. 1 W / Wopt 0.5 0.5 1 1.5 2 L/Lopt Repeated wires: power • The problem: wire and repeater power • For Smart Memories: > 100W at peak BW • Optimal delay != optimal power 2% Delay contours

  15. 1 W / Wopt 0.5 0.5 1 1.5 2 L/Lopt Repeated wires: power • The problem: wire and repeater power • For Smart Memories: > 100W at peak BW • Optimal delay != optimal power 2% Delay contours Best delay at (1,1)

  16. 1 W / Wopt 0.5 0.5 1 1.5 2 L/Lopt Repeated wires: power • The problem: wire and repeater power • For Smart Memories: > 100W at peak BW • Optimal delay != optimal power 2% Delay contours Add 5% energy contours

  17. 1 1 W / Wopt W / Wopt 0.5 0.5 0.5 1 1.5 2 0.5 1 1.5 2 L/Lopt L/Lopt Repeated wires: power • The problem: wire and repeater power • For Smart Memories: > 100W at peak BW • Optimal delay != optimal power 2% Delay contours 2% E*D contours

  18. 1 1 W / Wopt W / Wopt 0.5 0.5 0.5 1 1.5 2 0.5 1 1.5 2 L/Lopt L/Lopt Repeated wires: power • The problem: wire and repeater power • For Smart Memories: > 100W at peak BW • Optimal delay != optimal power 2% Delay contours 2% E*D contours Save 30% energy for 11% delay cost

  19. Repeated wires: limitations • Energy savings is limited to ~30% • Risetime/reliability constraints • Repeaters also add layout complexity • Lots of 16B wide buses across chip • Via/wire and device congestion • Can we do better? • Save more energy and avoid complexity

  20. Global wire characteristics • What can we exploit? • Wire regularity and homogeneity • Latency the same for each link • Design one, use everywhere • Very well-characterized wire environments • Global data synchronization to chip clock • Enables simple clock-based timing

  21. Lowswing wires: architecture Vdd 0 bit bit F

  22. Transmitter Equalization circuits bit bit F Lowswing wires: architecture Vdd 0

  23. Differential, twisted wires bit bit F Lowswing wires: architecture Vdd 0

  24. Clocked amplifying receiver bit bit F Lowswing wires: architecture Vdd 0

  25. bit bit F Lowswing wires: architecture Vdd 0 Point-to-point link distance

  26. bit bit F Lowswing wires: architecture Vdd 0 Save a lot of power: V2 Point-to-point link distance

  27. bit bit F Lowswing wires: architecture Vdd 0 Save a lot of power: V2 Save a lot of layout complexity Point-to-point link distance

  28. DV bit 8.1mm F wire bit 8.1mm Lowswing wires: drivers • Use good linear Res of NMOS devices • Lowswing, so we have small Vds • Devices sized for 10mm bus in 180nm • Emulates RC of 4.25mm bus in 100nm

  29. DV bit 8.1mm F wire bit 8.1mm Lowswing wires: drivers • Use good linear Res of NMOS devices • Lowswing, so we have small Vds • Devices sized for 10mm bus in 180nm • Emulates RC of 4.25mm bus in 100nm Big challenge is driving long, lossy wires quickly

  30. 800mV 0.7t 600mV Voltage 400mV 200mV 0 0 0.5t t 1.5t 2t Time constants Lowswing wires: overdrive • Go faster: overdrive the transmitter voltage

  31. Lowswing wires: overdrive • Go faster: overdrive the transmitter voltage 800mV 0.7t 0.4t 600mV Voltage 400mV 200mV 0 0 0.5t t 1.5t 2t Time constants

  32. 800mV 0.7t 0.4t 600mV 0.3t Voltage 400mV 200mV 0 0 0.5t t 1.5t 2t Time constants Lowswing wires: overdrive • Go faster: overdrive the transmitter voltage

  33. 800mV 0.7t 0.4t 600mV 0.3t Voltage 400mV 200mV 0 0 0.5t t 1.5t 2t Time constants Lowswing wires: overdrive • Go faster: overdrive the transmitter voltage • A form of driver pre-emphasis • Speedup limited to about 2.5x • Leading edge of transition starts slowly

  34. 800mV 0.7t 0.4t 600mV 0.3t Voltage 400mV 200mV 0 0 0.5t t 1.5t 2t Time constants Lowswing wires: overdrive • Go faster: overdrive the transmitter voltage • A form of driver pre-emphasis • Speedup limited to about 2.5x • Leading edge of transition starts slowly

  35. Lowswing wires: overdrive 2 • Overdrive costs extra power • Especially differential wires! (up & down) • Be careful when sending repeated bits • Wire will reach +/-Voverdrivehysteresis • So either stop driving on repeated bits • Requires some care to avoid noise • …Or balance out the wire after each bit • Wire equalization

  36. Lowswing wires: equalization • Short the differential wires together • Makes the wires two-phase: prech & eval • Must do twice as much work • But it’s also twice as fast (pre-emphasis)

  37. Lowswing wires: equalization • Short the differential wires together • Makes the wires two-phase: prech & eval • Must do twice as much work • But it’s also twice as fast (pre-emphasis) 100% 75% Voltage 50% 25% 0 0 0.5t 1t 1.5t 2t Time constants

  38. 100% 75% Voltage 50% 25% 0 0 0.5t 1t 1.5t 2t Time constants Lowswing wires: equalization • Short the differential wires together • Makes the wires two-phase: prech & eval • Must do twice as much work • But it’s also twice as fast (pre-emphasis)

  39. Lowswing wires: equalization 2 • Equalization shouldn’t cost any power • Activity factor is doubled (2X power) • But charge gets recycled (1/2X power) • Equalization saves overdrive power • Most of the wire does not split very far

  40. Drive Equal 0.4 Near end Far end Voltage 0.2 0 0 2 4 6 8 10 Time (FO4s) Lowswing wires: equalization 2 • Equalization shouldn’t cost any power • Activity factor is doubled (2X power) • But charge gets recycled (1/2X power) • Equalization saves overdrive power • Most of the wire does not split very far

  41. Drive Equal Overdrive 400mV 0.4 Near end 0.4 Far end Voltage DV 0.2 0.2 Target 100mV 0 0 0 2 4 6 8 10 0% 50% 100% Time (FO4s) Position along wire Lowswing wires: equalization 2 • Equalization shouldn’t cost any power • Activity factor is doubled (2X power) • But charge gets recycled (1/2X power) • Equalization saves overdrive power • Most of the wire does not split very far

  42. Lowswing wires: equalization 3 • But: equalization is hard • Cannot use our overdrive trick • Requires balance devices along the wire • 10mm wire used balance gates every 2mm • Equalization costs a lot of clock power • Equalization makes activity factor = 1 • No longer data-dependent

  43. Lowswing wires: receiver Vdd • PMOS inputs for low-voltage wires • Clocked with chip clock • Because data is clock aligned anyway F in+ in- out- out+

  44. Lowswing wires: receiver Vdd Very fast amplification: The smaller the input swing, the better! • PMOS inputs for low-voltage wires • Clocked with chip clock • Because data is clock aligned anyway F in+ in- out- out+

  45. Lowswing wires: receiver Vdd Very fast amplification: The smaller the input swing, the better! • PMOS inputs for low-voltage wires • Clocked with chip clock • Because data is clock aligned anyway F in+ in- So offset voltages are important out- out+

  46. Lowswing wires: receiver 2 • Tradeoff in receiver sizing • Big receiver: less offset, more clk pwr • Small receiver: more offset, less clk pwr • Our design emphasized clock power • Small receiver w/ 3s offsets of 90mV • Simulated via Monte Carlo in spice • In hindsight, a bad choice • Clock power mostly in balance circuits

  47. Measured results 180nm MOSIS/TSMC 1.8V, 6 layers Al 95pS FO4 delay 9.2mm2 testchip

  48. Low-swing buses 5, 7.5 and 10mm 1-, 2-, and 3-b buses Measured results 180nm MOSIS/TSMC 1.8V, 6 layers Al 95pS FO4 delay 9.2mm2 testchip

  49. Full-swing buses 5mm and 10mm 1-, 2-, and 3-b buses Optimally repeated Measured results 180nm MOSIS/TSMC 1.8V, 6 layers Al 95pS FO4 delay 9.2mm2 testchip

  50. Offset block (re-spun on 250nm National process) Measured results 180nm MOSIS/TSMC 1.8V, 6 layers Al 95pS FO4 delay 9.2mm2 testchip

More Related