490 likes | 644 Views
PROPEL : Power & Area-Efficient, Scalable Opto -Electronic Network-on-Chips ( NoCs ) . Thesis Defense. Randy W. Morris, Jr. Affiliation: EECS, Ohio University E-mail: rm700603@ohio.edu. Advisor: Avinash Kodi. Outline . Motivation & Background PROPEL: Architecture
E N D
PROPEL : Power & Area-Efficient, Scalable Opto-Electronic Network-on-Chips (NoCs) Thesis Defense Randy W. Morris, Jr. Affiliation: EECS, Ohio University E-mail: rm700603@ohio.edu Advisor: AvinashKodi
Outline • Motivation & Background • PROPEL: Architecture • PROPEL: Implementation • Performance Analysis • Conclusion
Why Chip Multi-Processor? (1/2) After 2002 diminishing returns from single core designs!! Courtesy: J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach, 4th edition, Morgan Kauffman, San Francisco, 2007.
Why Chip Multi-Processor? (2/2) Courtesy: G. Konstadinidis and et. al., “Architecture and Physical Implementation of a Third Generation 65 nm, 16 Core, 32 Thread Chip-Multithreading SPARC Processor” Examples: RAW, Core 2 Duo, Quad Core, Ultra Sparc
Wire Delay Problem 20mm 20mm 20mm 1 3 0 2 6 5 3 7 4 0 2 1 14 12 13 8 10 11 15 9 0 1 22 20 21 16 18 19 23 17 5 7 4 6 30 28 29 24 26 27 31 25 9 11 8 10 38 36 37 32 34 35 39 33 3 2 46 44 45 40 42 43 47 41 13 15 12 14 54 52 53 48 50 51 55 49 62 60 61 56 58 59 63 57 Past FUTURE Present • Wire delay proportional to wire’s RC constant Resistance increases as Capacitance remains constant.
Network-on-Chip (NoC) Router Route Computation (RC) Virtual Channel (VC) Core 3 Core 2 Core 1 Core 0 Crossbar Switch Core Credits In/Out Switch Allocator (SA) +X +X Router Core 7 Core 6 Core 5 Core 4 Link -X -X +Y +Y Core 11 Core 10 Core 9 Core 8 -Y -Y Core 15 Core 14 Core 13 Core 12 Processing Core
Power Dissipation Intel Tera-Flops (65 nm) Tile Power Routing Power Courtesy: Y. Hoskote, “A 5-GHz Mesh Interconnect for A Teraflops Processor,” IEEE Computer Society, 2007 pp. 51-61 • 28% of a tile’s overall power is for the router and links • Link power will become a more major contribution of a router’s • overall power for future VLSI technology • Router and link power should be about 10-15% of the tile’s power budget Potential Solutions: Optics, RF and 3D stacking
Why use Optics? • Lower latency • Higher bandwidth (WDM, SDM & TDM) • Increased bandwidth density(compact parallel optics) • Low power (1.1 mW/Gb) • Bit-rate independent of distance • Lower cross-talk • Does not suffer for impedance mismatch • and signal reflection • Low signal attenuation
Electrical Interconnect R =wire resistant per length C =wire capacitance per length Cp=inverter output capacitance C0=inverter input capacitance Rs= inverter resistance Sopt=inverter size Lopt = Wire distance rs R, C Cp C0 lopt RC Link: sopt
ITRS 2007 Transistor & Link Parameters? Electrical link device parameters for various VLSI technologies • Increase wire delay due to RC constant • Increase in Ioffn & Ishortckt current parameters
Optical Interconnect On-Chip Optical Layer Off-Chip Laser On-Chip Modulator Photodetector Transmission Medium - Transmitter Electronics Layer Buffer Chain TIA Limiting Amplifier Driver for Electronics
Resonant wavelength (λ0) λ0 m= neff 2R m an integer VR neff effective refractive index R radius of the ring resonator VR n+ p+ n+ Input Port 0 Output Port 0 Micro-ring Resonators =VOFF n+ p+ n+ Input Port 0 Output Port 0 VR =VOFF =VON Output Port 1 n+ p+ n+ • CMOS compatible • Low power (0.1 mW) • Small footprint (10 um) • High Bandwidth (10 Gb) Output Port 0 Input Port 0
Waveguide & Receiver [1] N. Kirman and et. al., “Leveraging Optical Technology in Future Bus-based Chip Multiprocessors”, 39th Annual IEEE/ACM International Symposium on Microarchitecture, 2006 Vol. 9 , Iss. 13 Dec. 2006 pg.492 – 50 [2] S. Koester et. al., “Ge-on-SOI-Dectector/Si-CMOS-Amplifier Receivers for High-Performance Optical-Communication Applications,” Journal of Lightwave Technology, Vol. 25, No. 1, January 2007 [3] C. Kromer and et. al., “A 100-mW 4X10 Gb/s Transceiver in 80-nm CMOS for High-Density Optical Interconnects,” IEEE Journal of Solid-State Circuits, Vol. 40, No. 12, December 2005 [4]D. Kuchta and et. al., “120-Gb/s VCSEL-based parallel-optical interconnect and custom 120-Gb/s testing station,” Journal of Lightwave Technology, Vol. 22 No. 9 pp. 2200-2212, Sept. 2004
Electrical/Optical Comparison Power-delay product at various technology nodes for a 5 mm link. Optics is more advantageous: 52nm for Global & 45 nm for Semi-global Interconnects
Critical Length Critical Length is the distance where optical becomes more advantageous core-to-core distance
Advantages of PROPEL • Efficient use of optical components • Balance between optics and electronics • Simple network design – Low diameter, DOR • Scalability • Fault Tolerant
PROPEL’s Design 0, 1, 2, … Broadband Light source Tile 0 0 1 4 5 8 10 12 14 L2 L2 L2 L2 2 6 7 9 11 13 15 3 Photonic Transceiver L2 L2 L2 28 30 L2 16 17 20 22 24 26 Optical Interconnect tile Core Core 0 Core 1 L2 Cache 27 29 31 18 19 21 23 25 Photonic Transceiver 40 42 44 45 32 33 36 38 L2 L2 L2 Core2 Core3 L2 41 43 46 47 34 35 37 39 L2 L2 L2 56 57 60 61 48 49 52 53 L2 58 62 63 59 50 51 54 55
PROPEL’s Routing & Wavelength Assignment (x-direction) Broadband Signal λ1(0,0) λ3(0,0) λ2(0,0) Home Channel 0 Home Channel 1 λ2(2,0) λ3(2,0) λ0(1,0) Home Channel 2 Home Channel 3 Core 0 Core 8 Core 4 Core 12 Core 13 Core 9 Core 5 Core 1 L2 Cache L2 Cache L2 Cache L2 Cache Core 14 Core 2 Core 6 Core 10 Core 15 Core 11 Core 3 Core 7 λ0(1,0)+λ2(1,0)+λ3(2,0) λ1(0,0)+λ2(0,0)+ λ3(0,0) Tile 0 Tile 1 Tile 3 Tile 2
PROPEL’s 64 Wavelength Design Research has shown 64-wavelengths are possible to traverse down one waveguide. Laser Optical Inter-Title Communication Channels X-Receiver X-Receiver X-Receiver X-Receiver X-Transmitter X-Transmitter X-Transmitter X-Transmitter λ(48-63) λ(0-15) λ(32-47) λ(16-31) Core 4 Core 12 Core 8 Core 0 Core 5 Core 1 Core 13 Core 9 L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache Y-Transmitter Y-Transmitter Y-Transmitter Y-Transmitter Shared L2 Shared L2 Shared L2 Shared L2 Core 14 Core 6 Core 10 Core 2 Core 3 Core 15 Core 7 Core 11 L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache Y-Receiver Y-Receiver Y-Receiver Y-Receiver Tile 2 Tile 3 Tile 1 Tile 0
PROPEL’s x- and y-direction Implementation Laser Off-Chip Bank 0 Bank 1 X-Receiver X-Transmitter Tile 0 Tile 1 Tile 2 Tile 3 Core 0 Core 1 L1 Cache L1 Cache Y-Transmitter Tile 4 Tile 5 Tile 6 Tile 7 Bank 2 Shared L2 Core 2 Core 3 Tile 8 Tile 1 Tile 2 Tile 3 L1 Cache L1 Cache Y-Receiver Bank 3 Tile 12 Tile 5 Tile 6 Tile 7 Bank 4-15 On-Chip DRAM
Memory Routing and Wavelength Assignment Bank 0 Bank 3 Bank 1 Bank 2 . . . . . . . . . . . . . . . . Receiver λ48-63 λ16-31 λ32-47 λ0-15 From CMP To CMP From Laser Transmitter λ0-15 λ16-31 λ32-47 λ48-63
Communication Example Route Computation (RC) Virtual Channel (VC) Credits In/Out Switch Allocator (SA) Laser Crossbar Switch X0 Tile 0 Tile 1 Tile 2 Tile 3 Tile 4 Tile 5 Tile 6 Tile 7 X0 X-Transmitter X-Receiver X1 X1 Core 0 Core 1 Tile 8 Tile 1 Y-Transmitter X2 L1 Cache L1 Cache X2 Shared L2 Y0 Y0 Tile 12 Tile 13 Core 2 Core 3 Y1 Y1 L1 Cache L1 Cache Y-Receiver Y2 Y2 Tile 3 communicates with Tile 8. L2 Cache
Modulation Implementation λ0-15 λ16-31 λ32-47 . . . . . . Broadband Signal . . . . . . λ16 λ0 λ31 λ32 λ15 λ47 23
Multicasting & Broadcasting Tile 1 Tile 2 Tile 3 Tile 0 Tile 4 Tile 8 • Multicasting: single tile to multiple tiles. • Broadcasting: single tile to all-tile communication. • Use 3 individual multicasts Tile 12 Sending Tile Communication Link Tile 5 Tile 6 Tile 7 Tile 9 Tile 10 Tile 11 Tile 13 Tile 14 Tile 15
Performance Evaluation • Cost & Component Comparison • Synthetic Traffic • OPTISM • Uniform, Bit-reversal, Butterfly, Complement, Matrix transpose, Perfect Shuffle • SPLASH-2 • Simics with GEMS and Garnet • FFT, LU, Radiosity and Ocean • Networks topology evaluated • Electrical: Mesh, Cmesh and Flattened-butterfly • Optical: Circuit-switch, Shared-bus and Corona
Route Computation (RC) Electronic Parameters Credits In/Out Virtual Channel (VC) Switch Allocator (SA) Esw = wf × (Cxbi + Cxbo)V2DD Crossbar (0.8 mW/flit) Crossbar Switch Pwrite = Pwordline + (2 × F × Pbitline) + (F × Pmemory-cell) Pread= Pwordline + F × (Pbitliner + Pchg) VC Buffer (4.03 mW/flit) +X +X -X -X +Y +Y -Y -Y Processing Element (PE) Plink = Pdynmanic + Pleakage+ Pshort¡ckt Electrical Link (22 mW/mm)
Optical Parameters On-Chip Optical Layer Off-Chip Laser On-Chip Modulator Photodetector Transmission Medium Electronics Layer Receiver Circuitry (1.1 mW/Gbps) Micro-ring Modulator (0.1 mW) TIA Limiting Amplifier Driver for Electronics Buffer Chain
Component Comparison PROPEL is the most cost effective NoCs !!!!
Synthetic Traffic Trace • Uniform traffic: Each packet's destination has an • equal probability to be all nodes. • Bit-Reversal:. • Source: an-1,an-2,...,a1,a0Destination: a0,a1 ,..., an-2,an-1 • Butterfly: • Source: an-1,an-2,...,a1,a0Destination: a0,an-2,...,a1,an-1 • Complement: • Source: an-1,an-2,...,a1,a0Destination: an-1’,an-2’,...,a1’,a0’ • Matrix Transpose • Source: an-1,an-2,...,a1,a0Destination: an/2-1,...,a0,an-1,an-2 • Perfect-shuffle: • Source: an-1,an-2,...,a1,a0Destination: an-2,an-3,...,a0,an-1
Uniform Traffic Throughput • 25% Improvement • over Mesh • 9% Improvement • over Flattened-butterfly • Over 2× increase in • performance over • Circuit-switch, Cmesh • and Shared-bus
Uniform Traffic Latency • PROPEL saturates at a • network load of 0.5 • Saturates at a network • load of 0.1 higher than • than Flattened-butterfly • Saturates at a 2× higher • network load than • Shared-bus and • Circuit-switch.
Bit-Reversal Traffic Latency • PROPEL saturates at a • network load of 0.25 • Saturates at a network • load of 0.25 higher than • than Flattened-butterfly • Saturates at a 1.5× higher • network load than • Shared-bus and • Circuit-switch.
Complement Traffic Latency • Networks with core • concentrations create • communication hotspot.
Matrix Transpose Traffic Latency • PROPEL saturates at a • network load of 0.3 • Circuit-switch saturates • higher than the electrical • networks
Synthetic Traffic Power Dissipation 5× Reduction In Power
Simics Parameters • Simics is a full system simulator from Virtutech
SPLASH-2 Benchmarks • FFT kernel is a 1-Dimensional version of the radix-n1/2 six step FFT algorithm. • LU kernel is used to factor a dense matrix into the upper and lower triangular matrices. • Radiosity is a graphics kernel used to calculate the equal distribution of light in a scene. • The Ocean application evaluates the boundary and eddy currents of large scale ocean movements.
Conclusion • PROPEL is a low power high bandwidth NoC for future many-core processors. • PROPEL uses both electronic for packet switching and optics for inter-router communication, allowing for a reduction in electrical and optical components. • PROPEL uses the least number of optical components and consumes the least area, when compared to other opto-electronic networks. • PROPEL is able to outperform and dissipate less power when compared to well-known network topologies.
Future Work • Use optics to go to memory • Dynamic Bandwidth • Dynamic Voltage Scaling • Application Integration with the NoC
Examples of NoCs (1/2) Core Router Core Link Router Link Torus Mesh • Advantages • Reduced Hop Count • DOR routing • Disadvantages • Difficult to Integrate on-chip • Advantages • Simple to Integrate on-chip • DOR routing • Disadvantages • High hop count
Examples of NoCs (2/2) Flattened-butterfly Cmesh • Advantages • Max hop count of 2 • Reduce power dissipation • Disadvantages • Not easily scalable • Advantages • Reduced Network Diameter • Fewer Routers • Disadvantages • Multiple cores share same ports
PROPEL Multicasting Example Laser Multicast example: Tile 0 communicates the same data to Tile 1,2 & 3 X-Receiver X-Receiver X-Receiver X-Receiver X-Transmitter X-Transmitter X-Transmitter X-Transmitter Core 0 Core 12 Core 4 Core 8 Core 1 Core 9 Core 5 Core 13 L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache Y-Transmitter Y-Transmitter Y-Transmitter Y-Transmitter Shared L2 Shared L2 Shared L2 Shared L2 Core 10 Core 6 Core 14 Core 2 Core 3 Core 15 Core 11 Core 7 L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache Y-Receiver Y-Receiver Y-Receiver Y-Receiver Tile 2 Tile 3 Tile 1 Tile 0
PROPEL’s Implementation (3/4) Transmitters Off-chip laser λ0-15 λ16-31 λ32-47 λ48-63 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . From Memory λ16-31 λ0-15 λ32-47 λ48-63 λ0-15 λ32-47 λ16-31 λ16-31 λ48-63 λ0-15 λ32-47 λ48-63 λ0-15 To Memory λ32-47 λ16-31 λ48-63 Receivers Tile 2 Tile 3 Tile 1 Tile 0
PROPEL’s Design64-Wavelengths Assignment • Research has show 64-wavelengths are possible to traverse down one waveguide. • Wavelengths used for PROPEL are extended from 4 to 64.
PROPEL Broadcasting Tile 1 Tile 2 Tile 3 Tile 0 Tile 4 Tile 8 • Single tile to all-tile communication. • Use 3 individual multicasts Tile 12 Sending Tile Communication Link Tile 5 Tile 6 Tile 7 Tile 9 Tile 10 Tile 11 Tile 13 Tile 14 Tile 15
Electrical Link Power Dissipation Optical Power Dissipation