CMP Network-on-Chip Overlaid With Multi-Band RF-Interconnect

CMP Network-on-Chip Overlaid With Multi-Band RF-Interconnect M. Frank Chang*, Jason Cong, Adam Kaplan, Mishali Naik, Glenn Reinman, Eran Socher*, Sai-Wang Tam* UCLA Computer Science Department *UCLA Electrical Engineering Department HPCA 2008

Network on Chip Challenges Chip Multiprocessor Trends # Cores on chip increasing Increased bandwidth demand on interconnect Wires scaling poorly compared to transistors Increased latency cost to communicate between distant points on die Demands on future interconnect Scalable NoC topology Support high traffic volume with low latency Constrained by Power Silicon Area Compatibility with mainstream CMOS technology

Bandwidth Efficiency Problem in Repeaters @ 90nm CMOS Technology Data Rate: 4 Gbit/s fT of 90nm CMOS can be as high as 120GHz Signal bandwidth only about 4GHz 96% of available bandwidth is wasted and not being used Open Question: How to use extra-bandwidth of CMOS for higher data rate? Answer: Using well-developed RF techniques for data transmission

low pass filter output buffer data 1 data 1 mixer mixer f1 f1 Transmission Line . . . . . . data 10 data 10 f10 f10 Multi-Band RF-Interconnect • N different data streams (N=10 in figure above) may transmit on the same transmission line simultaneously

Future Trends in RF-I

Interconnect Topology Comparison • Comparison across process technology of… • Traditional RC parallel bus • RF-Interconnect • Optical Interconnect • As process technology scales toward 22nm… • RF-I has lowest latency • RF-I consumes least energy • RF-I has highest data rate density • RF-I is compatible with current CMOS technology

RF-I Physical / Logical Organization • Physically • RF-I is shared bundle of transmission lines • Connected to and shared between set of RF-enabled routers • Logically • RF-I behaves as set of N express channels • Each channel assigned to source, destination router pair (s,d) • Both s and d must be RF-enabled

Architectural Challenges of RF-I • How many/which routers should be RF-enabled? • How many RF-I ports should each router have? • Dedicated or multiplexed with other ports? • How much RF-I bandwidth to allocate? • Total? Per communicating pair? • Impacts active layer area consumed by RF-I components • Which routing strategy to employ in presence of RF-I express channels? • Dynamic or static allocation of frequency bands to sources/destinations • Dynamic: requires arbitration overhead for channel assignment • Static: may miss opportunity to match changing communication demand

Our decisions… • How many/which routers should be RF-enabled? • 16 routers (3 per quadrant and 4 in center) • How many RF-I ports should each router have? • 16 dedicated ports • How much RF-I bandwidth to allocate? • Start with 256B total, 16B per cycle per communicating pair • Which routing strategy to employ in presence of RF-I express channels? • Shortest-Path Routing • Dynamic or static allocation of frequency bands to sources/destinations • Static, to save overhead

Baseline Mesh Interconnect Topology 10x10 mesh of 5-cycle pipelined routers NoC runs at 2GHz XY/YX routing 64 4GHz 3-wide processor cores containing 8KB L1 Data Cache 8KB L1 Instruction Cache 32 L2 Cache Banks 256KB each Organized as shared NUCA cache 4 Main Memory Interfaces Labeled with + in the figure C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ C C C C C C C C C C C C R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R $ $ $ $ $ $ $ $ $ $ $ $ C C C C C C C C C C C C R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ C C C C C C C C C C C C R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ C C C C C C C C C C C C R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R $ $ $ $ $ $ $ $ $ $ $ $ C C C C C C C C C C C C R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ C C C C C C C C C C C C R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R (square) = router C (circle) = processor core $ (diamond) = L2 cache bank + (plus) = main memory interface

MORFIC: Mesh Overlaid with RF-InterConnect Shared Z-shaped RF waveguide Organized as 8 bidirectional shortcut links Each direction of each shortcut can transmit simultaneously over shared medium Router A can send a flit to other router A, B to B, … H to H in a single cycle Router labeled X cannot directly send to any router not labeled X E.g. Router B in upper left cannot send to router E in upper right directly However, B in upper left can send to B in upper right, and then north to E using normal mesh link C C E A A D D E B B B B G G H H E A E A F F F F H H C C D D G G LOGICAL ORGANIZATION PHYSICAL ORGANIZATION

MORFIC Results For 256B Total RF-I • 256B RF-I consumes 0.18% silicon overhead on 400mm2 die • RF-I components: 0.13%, Router overhead: 0.05% • Normalized Splash-2 Execution Time and Average Packet Latency Results • Normalized to baseline mesh run-cycles/latency at 1 • Average 13% (max 18%) performance improvement • Average 22% (max 24%) packet latency improvement

Deadlock: To Avoid or Confront? • South-Last Strategy [Ogras and Marculescu, 2006] • Routes which can lead to circular buffer dependence are forbidden  avoids deadlock • Deadlock Detection & Recovery (DDR) • Based on Duato and Pinkston’s theory[Duato and Pinkston 2001] • If deadlock occurs, route all packets in the network on a spare virtual channel • Use deadlock-free XY-routing • Packets entering network after this point may be routed normally

How to detect deadlock…? • Rather than detect that deadlock has occurred • Detect circular buffer dependency • Each router maintains a list of other routers waiting on it • When buffer at neighbor router d is full, sender s transmits waiting-list message to neighbor • Bit vector indicating which routers are waiting on s, as well as s’s ID • If a router is “waiting on itself,” circular buffer dependency has occurred • Raise DEADLOCK condition • If d’s buffer empties, s sends one time clear-waiting-list message to reset state

Deadlock Results • South-Last strategy too restrictive • Halves the average realizable performance • Deadlock is best detected and recovered from when it occurs • Detection happens reasonably quickly • Performance during recovery no worse than baseline

Varying Total RF-I Bandwidth Application performance can degrade by more than 400% for small RF-I allocations Too many packets waiting at RF-I access points RF-I shortcuts become bottlenecks

# Flits Sent At Each Router # Stall Cycles At Each Router Mesh + 32B RF-I with 100% Usage Mesh + 256B RF-I with 100% Usage Mesh + 32B RF-I with 25% Usage Baseline 10x10 Mesh (no RF-I) Router Activity and Congestion • Lighter shade represents more activity • X% usage: (100-X)% of packets are XY/YX routed (no RF-I), X% packets routed on shortest path • Utilizing 32B RF-I 25% of the time spreads router activity while avoiding bottlenecks at shortcut-access points (compared to 100% usage)

Varying RF-I Utilization (1 / 2) Drop utilization 10% and lock! RF-I Over-utilized: Router stalls outnumber 2/3 * flits-sent this period • Search-and-Set technique • Finds best utilization and locks it for rest of app execution utilization time

Varying RF-I Utilization (2 / 2) NETWORK CONDITIONS CHANGE! • Fully-adaptive can improve performance by as much as… • 6.5% over baseline on 32B RF-I allocation • 14.3% over baseline on 96B RF-I allocation If under-utilized utilization If over-utilized time • Fully-adaptive technique • Same as Search-and-Set, except will unlock and sweep if either flits-sent or router-stalls change by 150% over course of execution

Ongoing Work: Reconfigurability of RF-I • For each channel • Source and destination may be reconfigured via frequency-band reassignment • Can assign variable # of channels to each source, destination pair (s,d) • critical channels given more bandwidth • A flexible means to reconfigure topology PHYSICAL LOGICAL B LOGICAL A

Conclusion • We introduce RF-I technology for on-chip communication • We present MORFIC architecture • 64 Cores, 32 L2 Cache Banks, 4 Memory Interfaces • RF-I provides an average 13% (max 18%) performance improvement for area cost of 0.18% of active layer • Deadlock detection and recovery performs better than deadlock avoidance • RF-I access points may become bottlenecks • Adapting RF-I utilization to changing network conditions can avoid congestion at access points

THANK YOU! • Any questions?

CMP Network-on-Chip Overlaid With Multi-Band RF-Interconnect