240 likes | 387 Views
Traffic Steering Between a Low-Latency Unsiwtched TL Ring and a High-Throughput Switched On-chip Interconnect. Jungju Oh, Alenka Zajic , Milos Prvulovic. Contents. Introduction Hybrid Network Low-Latency Transmission Line Ring Traffic Steering Evaluation Result Conclusion.
E N D
Traffic Steering Between a Low-Latency UnsiwtchedTL Ring and a High-Throughput Switched On-chip Interconnect Jungju Oh, AlenkaZajic, Milos Prvulovic
Contents • Introduction • Hybrid Network • Low-Latency Transmission Line Ring • Traffic Steering • Evaluation • Result • Conclusion
Introduction • On-chip communication latency is increasing • Broadcast interconnect • Insufficient bandwidth and delay for many-core • Growing core counts→ contention • Growing core counts→ longer wire→ larger wire capacitance → longer delay • Unfavorable wire delay with technology scaling • Packet-switched on-chip network (OCN) • Short links → fast communication between adjacent nodes • Scalable aggregated bandwidth • Packets travel many links and pipelined routers • Growing core counts → increasing hop counts/latency for far-apart cores ITRS 2012
Motivation • Switched on-chip network • Good latency for local traffic, but not for long-distance traffic • Much more local than long-distance traffic • Broadcast interconnect • Avoids routing latency even for long-distance traffic • Cannot handle much traffic
Hybrid Network • Exploit the strengths • Broadcast on Transmission Line: low latency • Switched on-chip network: throughput • … alleviate weakness • Limited TL throughput – use only for critical and/or long-distance traffic • High switching overheadfor long-distance traffic – use TL • Two critical components to this work • Transmission Line Broadcast Interconnect – the Why and the How • Traffic Steering – which messages use which interconnect
Transmission Line • Why Transmission Line? • Extremely fast propagation • Use electromagnetic wave for signal propagation • 0.0075 ns/mm (unrepeated wire: 0.54 ns/mm) • Not affected by technology scaling • Butexpensivein terms of metal area (20 µm-wide vs. 0.135 µm global wire) • Limited throughput Transmission Line 4.193 µm 0.135 µm 4.571 µm Traditioanl Wire 4.1 µm … 8.457 µm 16 µm Ground Traditional Global Wire TL vs.
Transmission Line Ring • Transmission Line • Extremely fast propagation • Butexpensive in terms of metal area • Why Ring? • Minimizes overall TL cost • Allows fast arbitration (token passing)
Unidirectional Transmission Line Ring • Two major problemswith TL caused by many connections in many-core • Attenuation of signal (power split at connections) • Signal reflections/reverberations(discontinuity at connections) • Signal needs to stay stronger than sum of noise and reverberations! • Unidirectional Transmission Line (UTL) ring makes it easy to design • Chained directional couplers in a ring shape • Control of attenuation • Almost no reflected signal • Directional Coupler • Two TL lines running in parallel Transmission Line
Unidirectional Transmission Line Ring • Directional Coupler • Two TL lines running in parallel • Signal into one end ① • Most comes out on other end ② • But some is transferred (EM-coupled) to same direction on other line ③ • Directivity: (almost) no signal on ④ • Chain couplers using one line, use the other to connect transmitters/receivers ② ① × ③ ④ Transmission Line Rx1 Tx1 Rx2 Tx2 Core 1 Core 2
Using the UTL Ring • Simple receiver/transmitter • Simple modulation: on-off keying • 1 bit = one or more consecutive pulses • How fast can we transfer? • Depends on available spectrumof the transmission medium • UTL coupler: 20–60 GHz • 40 GHz clock, 2 pulses/bit → 20 Gbps • Transmitter • PLL (pulses) • Pass-gate (on/off pulses) • Amplifier (impedance matching) • Receiver • Pulse detector, • Shift register (collect high rate bits) Transmitter Amp PLL Data Receiver Data Detector Shift register
Traffic Steering • Which packet should use which network? • Static steering • E.g. >8 hops go to TL, rest goes on mesh • Lacks adaptivity • When traffic low, 8-hop, 7-hop, etc. could benefit from ring • When traffic high, ring can become saturated
Adaptive Steering • Ring-Affinity Score • More hops more benefit from using the ring • Non-critical packet no benefit • Ring Affinity Score = latency differenceplus criticality adjustment • Threshold • Score above threshold use ring • Adjust threshold to prevent ring bandwidth saturation • Too much traffic on the ring queuing delays all benefit dissapears
Ring-Affinity Score • Score • : criticality adjustment • Constant penalty to non-critical coherence messages for simplification • (latency benefit) • : latency estimate for mesh • : latency estimate for UTL ring • How to get ? • Depends on packet’s hop count, mesh network congestion • Tried using just hop count times router latency, not good enough! • Small cache in each node, stores recent latencies for given hop count • E.g. 8x8 mesh 15 hop counts 15 sets in the latency cache • Each set keeps most recently observed latencies • Predictor chooses between using just the most recent latency, the average of latest latencies, or the average of all () latencies
Ring Affinity Scoring • Estimating • How long to transmit? Easy. • How long to get the token? • We see everything on the ring! • Can remember who sentthe last few packets, and when • We know how far away the token is (last sender) • We can estimate how “fast” it “moves” • Example: 7 nodes in 10 cycles (0.7 nodes/cycle) • If token 30 nodes away, estimated is 21 cycles (30*0.7) • Detailed equations and explanations are in the paper Core 3 sent packet on ring at cycle 10 Core 10 sent packet on ring at cycle 20 3 10
Threshold and Re-steering • Threshold adjusted to manage UTL ring utilization • Low enough to avoid excessive queuing • But high enoughnot to waste the ring throughput • Target utilizations around 75% tend to work well • Threshold Management • Packet steered to ring when its score exceeds the threshold • Increase threshold when ring utilization higher than desired • Decrease the threshold if ring utilization is too low • Re-Steeringing • Sudden burst of high-scoring packets… • Threshold adaptation takes a while • Meanwhile, ring packets have very long latencies • If ring-steered packet sits in queue too long, re-steer to the mesh • How long is too long?
Evaluation • Simulated using SESC • 64-tile CMP, 2-issue OoO, 1GHz, 32KB L1 D/I cache, 1MB slice of L2 • 8×8 mesh (switched NoC) with 128 bit link width, 8 VC (24 buffers) • Applications from PARSEC 3, SPLASH-2 benchmark suites • Half of the applications show <20% improvement with idealinterconnect • Focus analysis on on-chip latency sensitive applications
Speedup 1.14x
Speedup • 4-concentrated mesh + UTL Ring • 8.7% improvement: 1.13× → 1.23×
Speedup • 4-concentrated mesh + UTL Ring • 8.7% improvement: 1.13× → 1.23× • Flattened Butterfly + UTL Ring • 5.7% improvement: 1.10× → 1.16×
Summary • Increasing core counts worsens on-chip latency • Unidirectional Transmission Line Ring • Low-latency • But limited throughput • Use UTL Ring with switched interconnect synergistically • UTL Ring for low latency • Switched interconnect for throughput • Adaptive traffic steering enables judicious use of the ring • Proposed traffic steering provides 14% performance improvement
Result: Latency Reduction of UTL Ring • UTL Ring latency is 55% lower than the mesh • Lower latency than advanced interconnects • >44% latency reduction over concentrated mesh and flattened butterfly • But we can only do this for 13% to 44% of messages (2.0% to 9.9% of the bits) 44.3% 43.9%
Result: Speedup vs. Mesh Alone • Performs slightly better than advanced on-chip network • 1.14 (Mesh + UTL ring) • vs.1.13(concentrated mesh) and 1.10(flattened butterfly) 1.14× 1.13× 1.10×
Adaptive vs Non-Adaptive Steering • Non-adaptive random steering • 0.63× slowdown on application (ocean-nc) with high on-chip traffic • 1.02× speedup if 30% of packets use UTL Ring randomly (RND30) • 0.96× slowdown if 50% (RND50) • Adaptive traffic steering • 1.14×speedup(up to 1.20× with 64 Gbps configuration) slowdown