380 likes | 429 Views
A High Throughput Network-on-Chip Architecture for System-on-Chip Interconnect. Abdelhafid Bouhraoua and M.E.S El-Rabaa Computer Engineering Department (COE) College of Computer Science and Engineering (CCSE) King Fahd University of Petroleum and Minerals (KFUPM)
E N D
A High Throughput Network-on-Chip Architecture for System-on-Chip Interconnect Abdelhafid Bouhraoua and M.E.S El-Rabaa Computer Engineering Department (COE) College of Computer Science and Engineering (CCSE) King Fahd University of Petroleum and Minerals (KFUPM) Dhahran, Eastern Province, Saudi Arabia
Outline • Networks-on-Chips • State of The Art • Fat Tree Network Properties • Modified Fat Tree • Router Architecture • Performance Evaluation • Conclusion
Outline • Networks-on-Chips • State of The Art • Fat Tree Network Properties • Modified Fat Tree • Router Architecture • Performance Evaluation • Conclusion
Semiconductor Industry Future Technology Evolution is Faster than Design Evolution
Chip Design Methodology • Chip Complexity: 100M – 3B. • ASIC Methodology Not Suitable • RTL Synthesis Back End • Impossible to handle Full RTL Design for the whole project • IP-Reuse Based Methodology • Opens a Wide Range of Possibilities • IP Blocks Together On a Chip “System-on-Chip” From ASIC System-on-Chip Era
SoC Constraints • Very Short Time-To-Market • Compressed Schedule • Very Short Lifecycle • Low Development Cost • Small Team • High Complexity • Available Silicon Resources to Produce Cost-Effective Highly Integrated SoCs. • Broad Range of IP Blocks • Impossibility to know them all
SoC Methodology Main Task: Integration of The IP Blocks • System Level Integration • Data Formatting and Conversion • Protocol Interfacing • Control Interfacing • Interconnection Level Integration • Signal Interfacing • Data Transfer Interfacing • Wire Interconnect and Back-end Integration
Interconnecting The IPs Brute Force Method: • Design The Interface Block for Every Pair of IPs in the SoC • Point-to-Point Communication between IPs Problems: • Design Effort Similar to that of a new IP Block • If 20 Different Blocks Around 400 New Designs !!!! • Point-to-Point Communication Wiring Mess • 50 Blocks; 8bits bidirectional More than 20,000 Globally Routed Wires Should Look For a Better Way
Networks-on-Chips “Route Packets NOT Wires”, William J. Dally Idea: Build a Complete on-Chip Network • Unified Communication Model (Similar to OSI Stack) • No Ad-hoc Effort • Standardized Interfacing (May be provided by IP Vendors) • Unified Network Elements (Routers, Link Interfaces) • No Design required by the SoC Teams • Flexible Interconnect and Reduced Global Wiring
NoC Requirements • Performance • How fast packets are moved across the network? • How much traffic is carried at the same time and for how long? • Overhead • How Big is its required Size (in Gates) ? • Adaptivity • Does it Adapt Easily to new Designs ? • Complexity • How Easy is Interfacing to it ?
Outline • Networks-on-Chips • State of The Art • Fat Tree Network Properties • Modified Fat Tree • Router Architecture • Performance Evaluation • Conclusion
Previous Work • Majority directly derived from other research (Interconnection Networks for Parallel Architectures) • Reproduce what has been learned in the area of inter-chip networks, • Focus on the router architecture alone to achieve certain goals in latency • Asynchronous design of NoCs, mainly GALS • Circuit switching techniques introduced to provide a certain guarantee for the latency. • Did not fully take advantage of the fact that the network is on-chip where the main gain is no-pin limitation. • Router architectures directly derived from inter-chip architectures where the routers were implemented on a single chip substantial overhead. • Added complexity to achieve guaranteed latency is an overkill in the on-chip context.
Which Network? Most Straightforward Crossbar • Good Throughput (maxes at 66%) • Non Scalable (Quadratic) • Complexity Of Implementation for Higher Number of I/Os.
2-D Mesh • Very Popular Topology in NoCs. • Very Suitable for the 2D nature of Chip Floorplanning (Tiling) • Very High Constraints • Inefficient routing algorithms (deadlock-free by construction) • Efficient routing algorithms (Complex implementation) • Poor performance: Saturation reached at 30 %.
Analysis • Low throughput. Means: latency cannot be guaranteed above the maximum throughput levels • Low throughput cause by contention over the output ports of routers among several incoming packets • Cannot prevent contention from happening. Contention makes router architectures more complex because they need to integrate buffering and prioritization logic. • Routers that implement both packet and circuit switching makes the architecture even more complex.
Methodology • Take advantage of the On-Chip Context: • Design frozen before tape out • No internal IO limitations • Aim for a High Throughput Architecture • Circuitry used at 30% of its maximum is NOT an optimal Solution (Clock frequency, power). • Reduced router size • Integrate a large number of routers • Wormhole routing vs. Store and Forward • Reduce required buffers in routers
R R R R R R R R R R R R C C C C C C C C Fat Tree What topology resembles a crossbar? Banyans or Multistage Interconnection Networks. • Bidirectional multistage or folded multistage networks • Bidirectional multistage are two entities: • The Fat Tree (FT) • The butterfly. • Fat Tree better than butterfly (previous work) • n+1 Stages (or rows) • Size is • Routers = n x 2n • Clients = 2n+1 • Diameter = 2logk + 1; n = log k
Outline • Networks-on-Chips • State of The Art • Fat Tree Network Properties • Modified Fat Tree • Router Architecture • Performance Evaluation • Conclusion
UP Router LEFT RIGHT Routing in Fat Tree Routing reduced to routing in a binary tree. Binary Trees Three Routing Directions • UP • RIGHT • LEFT
[0, l[ U [u, 2n+1 -1] Router (r,c) [l, l+2r-1[ [l+2r-1, u[ Routing in Fat Tree • Matrix n rows x 2(n-1) columns. • Router (r,c) • r : row index (rows are indexed from 0 to n-1) • c: column index (columns are indexed from 0 to 2(n+1) -1) • Size of the clients’ address space reachable using the downside ports is equal to 2r • It is always a continuous interval of addresses of the form [l, u[. • Lower bound l : smallest address reached from the router (r,c). • Smallest address within the range obtained by clearing the lowest r bits of the column c. • l = (c/2r) x 2r. • Upper bound u: largest address reached from the router (r,c). • Largest address obtained by adding 2r to the lower bound l. • u = l+ 2r.
R R R R R R R R R R R R C C C C C C C C Routing In Fat Tree “Summit” Routers Alternate Paths Routing UP: Adaptive Routing Down: Deterministic
Outline • Networks-on-Chips • State of The Art • Fat Tree Network Properties • Modified Fat Tree • Router Architecture • Performance Evaluation • Conclusion
Contention in Fat Tree • Packets coming from the UP links are never routed up • Only packets coming from the bottom links are routed up. • Since the number of UP links is equal to the number of bottom links, there cannot be any contention when routing up. • Contention occurs only when going down. • Bottom links are split in RIGHT and LEFT links, deterministic routing of packets will lead to contention. UP Contention on the way down Many Choices for Going UP LEFT RIGHT
R R R R R R R R R R R R C C C C C C C C Modified Fat Tree Doubling of downward links eliminates contention
Outline • Networks-on-Chips • State of The Art • Fat Tree Network Properties • Modified Fat Tree • Router Architecture • Performance Evaluation • Conclusion
k k Right Inputs Left Inputs 2k + 2 2k + 2 Router Architecture • No Crossbar • No Buffers (Pushed to the Clients) • Every downstream input simultaneously connected to two outputs. • Contention eliminated between the inputs going downstream. • Number of outputs is 2k+2 for k inputs (case of when the router is a summit) • Router models differ from each other only by two items: • Number of input and output ports on the down link • Routing function constants (r,c)
Address A B ≥ < LEFT l A B ≥ < RIGHT L A B ≥ < u UP Routing Circuitry • All network elements are constants and frozen at design time. • All lower bound and upper bound values, used to generate the routing functions, are constants for each router. • These constants are entered as inputs into the routing function • Routing Function implemented using comparators. • Constants needed by the routing function are: • l • L = l + 2r-1 • u
FIFO FIFO FIFO FIFO FIFO Client Interface • Buffers pushed to the Client Interfaces • Each incoming link is terminated with a FIFO memory. • The different FIFO memories connected to the client through a single shared bus. Up Link Down Links (from router) Client/IP Block • Bus can be wider to perform data transfers faster than what is received in the FIFOs. • The size of FIFOs customizable by design team according to the specifications
Outline • Networks-on-Chips • State of The Art • Fat Tree Network Properties • Modified Fat Tree • Router Architecture • Performance Evaluation • Conclusion
Simulation Conditions • Uniform Traffic Generation • Uniform Distribution of Destinations • Traffic Rate constant fraction of Maximum Link Bandwidth • Variable Packet Size (within a predetermined range; eg. 64 bytes +/- 10%) • Simulation Platform: Cycle-based C-based. Developed for this purpose
Throughput • More than 90% Throughput achieved • Compare with Regular Fat Tree
Area and Speed Buffer-less architecture less costly
Client Buffer Utilization • Buffers pushed to the client interfaces. • Considerable number of buffer lanes is necessary for every client interface. • Simulations shows a linear progression of the maximum number of lanes used during operation. • Obtained figures are an order of magnitude lower than the number imposed by the architecture. • Number of buffer lanes in the client interface can be tailored to suit the class of applications at hand while reducing buffering area.
Outline • Networks-on-Chips • State of The Art • Fat Tree Network Properties • Modified Fat Tree • Router Architecture • Performance Evaluation • Conclusion
Conclusion • A contention-free modified FT architecture is proposed. • Proposed architecture achieves maximum theoretical throughput and has smaller latency than conventional FTs. • Latency increases linearly with input load. • Achieved performance is actual performance using a contention-free network. • The area of the network is kept small because of the absence of buffers in the router architecture. • Number of buffer lanes in the client interfaces can be tailored for a specific platform to suit the class of applications at hand while reducing buffering area.