400 likes | 462 Views
Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints. F.Gilabert †, D.Ludovici § , S.Medardoni ‡, D.Bertozzi ‡, L.Benini †† , G.N.Gaydadjiev § ‡University of Ferrara. †† University of Bologna. †Universidad Politecnica de Valencia.
E N D
Designing RegularNetwork-on-ChipTopologiesunder Technology, Architecture and Software Constraints F.Gilabert†, D.Ludovici §, S.Medardoni‡, D.Bertozzi‡, L.Benini††, G.N.Gaydadjiev§ ‡University of Ferrara. ††University of Bologna. †Universidad Politecnica de Valencia. §Delft University of Technology,.
Multi-dimension topologies • 2D mesh frequently used for NoC design • perfectly matches 2D silicon surface • high level of modularity • controllability of electrical parameters But its avg latency and resource consumption scale poorly with network size • Topology with more than 2 dimensions attractive: • higher bandwidth and lower avg latency • on-chip wiring more cost-effective than off-chip But physical design issues might impact their effectiveness and even feasibility (decreased operating frequency) (higher link latency)
Objective Explore the effectiveness and feasibility of multi-dimensional topologies Under realistic technological constraints • Physical synthesis impact over performance Regularity broken by asymmetric tile size or heterogeneous tiles! Latency in injection links? Latency in express links? Our approach Physical parameters from the physical synthesis are applied to system-level simulations Silicon-aware performance analysis Which switch operating frequency? Over-the-cell routing?
Objective Explore the effectiveness and feasibility of multi-dimensional topologies Under realistic architectural constraints • Physical synthesis impact over performance • Impact of chip I/O interface over topology performance May introduce an upper bound to the topology performance, affecting the performance differentiation between topologies • Our approach • Chip I/O interface modeling • Capture the implications of I/O performance on topology performance differentiation
Objective Traffic pattern usually abstracted as an average link bandwidth utilization or as a synthetic traffic pattern Explore the effectiveness and feasibility of multi-dimensional topologies Software constraints: communication semantics of the middleware • Physical synthesis impact over performance • Impact of chip I/O interface over topology performance • Realistically capture traffic behavior May lead to highly inaccurate performance predictions (traffic peaks, different kinds of messaging, synchronization mismatches) • Our approach • Project network traffic based on latest advances in MPSoC communication middleware • Generate traffic patterns for the NoC “shaped” by the above communication middleware (e.g., synchronization, communication semantics)
Backend synthesis flow • Communication semantics • Topologies under test • Physical synthesis • Layout-aware topology performance • Conclusions
Topology specification Backendsynthesis flow Transactional Simulator Topology generation RTL SystemC/Verilog OCP Traffic Generator Physical Synthesis Simulation Floorplan Placement VCD Trace Clock Tree Synth., Power Grid, routing, post-routing opt Netlist, Parasitic Extraction Prime time Power estimation Prime time SDF (timing)
Backend synthesis flow • Communication semantics • Topologies under test • Physical synthesis • Layout-aware topology performance • Conclusions
Tile Architecture Tile Memory Core Processor Core • Processor core • Connected through a Network Interface Initiator • Local memory core • Connected through a Network Interface Target • Two network interfaces can be used in parallel Network IF Initiator Network IF Target
Communication protocol Producer Tile ConsumerTile • Step 4: Consumer sends a notification upon completion • This allows the producer to send another message to this consumer • Step 3: Consumer reads data from the producer • Step 2: Consumer detects unblocked semaphore • Requests producer for data • Step 1: Producer checks local semaphores for pending messages for the destination • If not, it writes data to the local tile memory and unblocks a semaphore at the consumer tile • The producer is free to carry out other tasks Local Polling Local Polling 1 2 Read Operation Write Message 3 4 ResetSemaphore • Message sent only when consumer is ready to read it • Only one outstanding message for a producer-consumer pair • Low network bandwidth utilization • Tight latency constraints on the topology Dalla Torre, A. et al., ”MP-Queue: an Efficient Communication Library for Embedded Streaming Multimedia Platform”, IEEE Workshop on Embedded Systems for Real-Time Multimedia, 2007.
Backend synthesis flow • Communication semantics • Topologies under test • Physical synthesis • Layout-aware topology performance • Conclusions
Topologies Under Test – 16 tiles Tile Switch 4-ary 2-mesh Baseline Topology
Topologies Under Test – 16 tiles Tile Tile Switch Switch 4-ary 2-mesh Baseline Topology 2-ary 4-mesh High Bandwith
Topologies Under Test – 16 tiles Tile Tile Switch Switch 4-ary 2-mesh Baseline Topology 2-ary 2-mesh Low latency
Topologies Under Test – 64 tiles 8-ary 2-mesh Baseline Topology
Topologies Under Test – 64 tiles 8-ary 2-mesh Baseline Topology 2-ary 6-mesh High Bandwith
Topologies Under Test – 64 tiles 8-ary 2-mesh Baseline Topology 2-ary 4-mesh Low Latency
Backend synthesis flow • Communication semantics • Topologies under test • Physical synthesis • Layout-aware topology performance • Conclusions
Physical Synthesis • Link latency and maximum frequency • Performance, area and power • Quantified by post-layout analysis • For 16 tile systems • Real physical parameter values were obtained • For 64 tile systems • Physical parameter values extrapolated based on 16 tiles results • Synthesis time constraints
Physical Synthesis – 16 Tiles • Network building blocks synthesized for maximum performance • Timing path in network logic • Ignore switch-to-switch links. • Critical paths are in the switches • never in the network interfaces • Network speed closely reflects the maximum switch radix
Physical Synthesis – 16 Tiles • Inter-switch wiring reduces performance • The connectivity pattern of 2-ary 4-mesh results into a larger frequency drop than the 2D mesh • The 2-ary 2-mesh pays its lower number of switching resources with a larger switch-to-switch separation • Severe degradation of network performance
Physical Synthesis – 16 Tiles • Frequency-ratioed clock domain crossing in network interface • Network speed affects core speed. • Maximum core speed of 500 MHz is assumed • Post-layout speed drop • Cores cannot sustain the network speed • A divider of 2 is applied
Physical Synthesis – 16 Tiles • 2-ary 4-mesh larger area footprint than the 2D mesh • 2-ary 2-mesh reduces the number of switches • Larger radix • Area not halved
Physical Synthesis – 64 tiles • 64 tile hypercubes present very long links • Switch-to-switch link delay impacts overall network speed • Overall network speed unacceptably low for 64 tiled systems • Link pipelining becomes mandatory • Allows to sustain network speed even in the presence of long links • Number of pipeline stages depends on the link length on the layout
Physical Synthesis – 64 tiles Concentrated 2-ary 4-mesh
Physical Synthesis – 64 tiles Aggressive link pipelining 200% area overhead for 20% improvement in performance Not usable
Backend synthesis flow • Communication semantics • Topologies under test • Physical synthesis • Layout-aware topology performance • Conclusions
Workload distribution External I/O • Producer, worker and consumer tasks • I/O devices dedicated to input OR output data P P C C W W W W W W W W W W W W
Topology performance • 1 Input and 1 Output ports to the external memory are assumed for 16 tile systems • 4 Input and 4 Output ports to the external memory are assumed for 64 tile systems • I/O ports are accessed through sidewall tiles • The mapping of producer(s) and consumer(s) tasks is therefore constrained to these tiles
Topology performance • Several I/O mapping strategies were considered: • For sake of space, we only show here the most significative • OneSided: all the I/O tiles are placed on the same side of the chip.
Topology performance - 16 tiles • 2-ary 4-mesh reduces total number of cycles by 27.4% • 2-ary 2-mesh reduces cycles only by 1.6% over the hypercube • Chip I/O becomes the bottleneck • Real operating frequency of each topology changes conclusions • Physical degradation is too severe to be compensated • 2-ary 2-mesh shows superior energy saving properties • 50% over the 2D mesh
Topology performance - 64 tiles • 2D mesh outperforms the non-reduced hypercubes • Systems under test are I/O constrained • Computation tiles spend around 50% of their time waiting to send data to the consumer tile • Upper bound to topology-related performance optimization • Improvement in terms of execution cycles • Performance improvements in cycles are not such to offset the lower operating speed Removal of the I/O bottleneck has to be considered as mandatory to achieve performance differentiation between topologies
Topology performance - 64 tiles • Network and tiles work at the same frequency • Maximum frequency for all tiles: I/O tiles and processing tiles. • Very similar performance • Reduced number of cycles • Low network frequency • Reduced hardware resources • 4 times less switches, half the number of ports and works at half the frequency
Backend synthesis flow • Communication semantics • Topologies under test • Physical synthesis • Layout-aware topology performance • Conclusions
Conclusions • Bottom-up approach to assess k-ary n-mesh topologies • A number of real life issues are considered: • Physical constraints of nanoscale technologies • Impact of I/O interface • Communication semantics of the middleware • The intricate wiring of multi-dimension topologies or the long links required by concentrated k-ary n-meshes can be changed into 2 different kind of performance overhead by means of proper design techniques:
Conclusions • Operating frequency reduction: in spite of a lower number of execution cycles, multi-dimension topologies loose in terms of RET due to lower working frequency. • Concentrated topologies provide a way to trade performance for power/area • Increase of link latency: the utilization of retiming stages allows to sustain operating frequency while increasing the network latency. Area and power overhead have to be taken into account • Link pipelining can not materialize a frequency higher than the switch radix itself • for 64 tile systems we found that in general, the 2D mesh outperforms the hypercubes. • In spite of a better execution cycles, the real elapsed time is worst because of a lower operating frequency
Conclusions • Unexpected results for the reduced 2-ary 4-mesh: • Expected: Low cost - Low performance solution • Results: Low cost with similar performance as 2D mesh • Increment in core speed allows to reduce the impact: • I/O tile congestion • Processing tiles • Possible solution to hypercube physical degradation issues: • Decouple network speed from core speed (GALS) • Other solutions: • High performance – high radix switches
Designing RegularNetwork-on-ChipTopologiesunder Technology, Architecture and Software Constraints F.Gilabert†, D.Ludovici §, S.Medardoni‡, D.Bertozzi‡, L.Benini††, G.N.Gaydadjiev§ ‡University of Ferrara. ††University of Bologna. †Universidad Politecnica de Valencia. §Delft University of Technology,.