Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks

Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah and *HP Labs

Motivation - I Future CMPs are likely to be power-limited On-chip networks consume 20-36% of total chip power Network power dominated by routers Chip design and verification costs are tremendous Directory-based protocols are complicated and have the inherent problem of indirection Snooping-based protocols are well understood and simple to design Metal and wiring are cheap and plentiful We are no longer pin limited for the interconnection network University of Utah

Motivation - II Future of multi-core computing likely to diverge into two separate tracks Mid-range multicore machines for home/office 16-64 cores Many-core machines for scientific/server applications 1000s of cores Even machines with large core counts are likely to be virtualized, with communication localized to small chunks of approx. 64 cores Design energy-efficient networks for moderate core-counts VM University of Utah

Executive Summary University of Utah • Elimination of routers leads us back to bus-based networks • Dramatic reduction in energy consumption, little or no loss in performance, reduction in design complexity • Enhancing the life of buses for moderately sized CMPs • Filtered segmented bus, low-swing wiring, address interleaved buses, page coloring

Outline • Overview • Proposal I - Filtered Segmented Bus • Proposal II - Low-swing Wiring • Proposal III - Address Interleaved Buses • Proposal IV - Page Coloring • Evaluation • Conclusion University of Utah

Baseline Chip and Interconnect Organization • Simple mesh used for illustration here, other options discussed in the paper • Static-NUCA shared L2, each line has a “home” slice based on its address Core L1 Router L2 University of Utah

Where does energy go in the network? 1.39e-10 J/access 8X 1.56e-11 J/access Router Link Energy estimates based on CACTI 6.0 and Orion 2.0 University of Utah

Outline • Overview • Proposal I - Filtered Segmented Bus • Proposal II - Low-swing wiring • Proposal III - Address Interleaved Buses • Proposal IV - Page Coloring • Evaluation • Conclusion University of Utah

What is the solution? • We are left with.. a bus! • Could we really just use a bus? • Not really • Too many links activated on every transaction • Energy gained by eliminating routers lost by activating more links • Poor performance due to increased arbitration times and network contention University of Utah

We can do better.. Useless snoop: Particular cache line not present in any other core University of Utah

Filtered Bus • Segment and filter snoop transactions at intermediate points • Two types of filters • Out-filter • In-filter • Reduces number of links activated • Allows for safe parallelism (serialization happens at the central bus if required) Bus link Filter University of Utah

Filters Bus link In + Out Filter • Each “filter” depicted in the figure is a combination of an “Out-filter” and an “In-filter” • Each of these is a Counting Bloom Filter • 2 arrays of 10-bit entries • Subsets of the address bits hashed into each of these arrays, incremented to add entries, decremented to remove entries • To test for membership, simply check if entries in both arrays are non-zero • Compact representation, false positives possible University of Utah

Out-filter - Case 1 Home Segment R • Bloom filter in every segment keeps track of a superset of lines that call that segment “home” and have been sent “out” of that segment • If a line has never left a segment, none of its transactions need to be seen outside Energy Saved Bus link In - Filter Out - Filter Activated bus Activated filter R – Requested Address University of Utah Completely localized transaction Only home segment activated

Out-filter – Case 2 Home Segment R • If the line is being requested from outside its home segment, transaction has to go out on the central bus • The out-filter of the home segment is updated appropriately • The in-filter then takes over R R R Update Bus link In - Filter Out - Filter Activated bus Activated filter R – Requested Address University of Utah

In-filter • Bloom filters keep track of a superset of lines currently present in the segment • Only broadcast within the local segment if required R Energy Saved R R Bus link In - Filter Out - Filter Activated bus Activated filter R – Requested Address University of Utah

Arbitration • Global arbitration delay is non-trivial for a single bus connecting even 16 cores • Multi-step arbitration, as required • On every request • arbitrate for local bus and broadcast • if filter indicates that the transaction is complete, “validate” broadcast via wired-OR • if not, arbitrate for central bus and hold broadcast in a single-entry buffer until the central bus is available • at the remote sub-buses, priority is given to requests originating from the central bus University of Utah

Low-swing Wiring • Differential low-swing wiring up to 10X more energy efficient than regular wiring • These have less impact on packet-switched networks since routers are the bottleneck anyway • Amdahl’s law! • Slightly increased latency, more metal requirement University of Utah

Address Interleaved Buses • As core counts increase, increased pressure on the bus due to contention • At 64 cores, even though bus-based networks continue to be highly energy efficient, performance begins to dip • To shore up performance, increase the number of buses • different buses handle mutually exclusive addresses • increased metal requirement University of Utah

Page Coloring • OS-assisted page-coloring for L2 cache • We use a simple first-touch approach • Improved locality helps any network, but is especially well-suited for our network because • More flexibility in page placement • Less negative impact by sub-optimal page placement • Improves filter behavior University of Utah

Methodology • Virtutech SIMICS full-system simulator • “g-cache” significantly modified to add network models • CACTI 6.0 and Orion 2.0 for router/link energy computation • 16 cores for most experiments, sensitivity analysis for 32- and 64-core systems • 32nm process, 3GHz clock • 32K D-L1, 16K I-L1, 2MB/slice shared L2 • 200 cycle main memory latency • 4KB page size • PARSEC, NAS, SPLASH-2 benchmark suites – run for entire Region-Of-Interest/parallel section • Baseline routers - 4 VCs, 8 buffers/VC University of Utah

Energy Consumption – Address Network Ring – 20x Grid – 27x Fbfly – 31x University of Utah

Energy Consumption – Data Network Ring – 2x Grid – 2.5x Fbfly – 3x University of Utah

How does energy consumption reduce? • Router : Link energy ratio is high enough to significantly impact energy characteristics • Efficient bloom filters, at 16KB/filter • Out-filters are 85% accurate (note that there are only false positives, no false negatives) • In-filters are 90% accurate University of Utah

Effect of Page Coloring • More locality • Better filtering • Out filter accuracy increases from 85% to 97% University of Utah

System Performance Ring – 7% Grid – 3% Fbfly – 1% University of Utah

How does performance improve? • Two basic reasons • Inherent indirection in directory-based protocols • Deep pipelines in routers increasing the no-load latency • Avg. latency in bus-based network is 16.4 cycles • Arbitration (3.7 cyc) + Contention (1 cyc) + Bloom filter (1.2 cyc) + Link latency (10.5 cyc) • Even in the most connected FBFLY, average of 1.5 hops per message, bare minimum two messages per transaction – 3 hops – 15 cycles without contention • Link (6 cyc) + Router (9 cyc) University of Utah

Scaling – 32 Cores – Energy Average energy reduction of 19X in address network, 3X in data network University of Utah

32 Cores – Performance Average 5% drop in performance University of Utah

Scaling - 64 Cores – Energy Average reduction of 13X in address network, 2.5X in data network University of Utah

64 Core - Performance University of Utah Average 39% increase in execution time compared to fbfly, only 12% increase with just two interleaved buses

Router Optimizations University of Utah • For packet-switched networks to be as energy efficient as bus-based networks, Router : Link energy ratio should be less than • 3.5 X at 16 cores • 4.5X at 32 cores • 7X at 64 cores • Current energy ratio is approx. 70X

Related Work • Packet Switched Networks • Dally/Towles (DAC ’01), Kim et al. (MICRO ’07), Grot et al. (HPCA ’09), TRIPS, TILERA • Hierarchical Networks • Muralimanohar et al. (ISCA ’07), Das et al. (HPCA ’09) • Snoop Filtering • Moshovos et al. (HPCA ’01), Strauss et al. (ISCA ’06), Salapura et al. (HPCA ’08) • Bus applications in CMPs • Manevich et al. (NOCS ’09) University of Utah

Key Contributions • For moderate core counts, buses just work! • Dramatic energy reduction • little or no loss in performance • simple snooping protocols, reduction in design complexity • Low-swing wiring • Multiple Address Interleaved buses • OS-assisted page coloring • Potential for router optimization University of Utah

Thank you.. • Questions? University of Utah

Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks

Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks

Presentation Transcript

Networks-on-Chip

Networks-on-Chip

Energy-Efficient Sensor Networks

Towards Scalable and Energy-Efficient Memory System Architectures

Networks-on-Chip

Towards Scalable and Energy-Efficient Memory System Architectures

Efficient Timing Channel Protection for On-Chip Networks

Energy Efficient and High Speed On-Chip Ternary Bus

Towards Energy Efficient Hadoop

Towards Energy Efficient MapReduce

On Chip Bus

On Chip Bus

Networks on Chip

Networks-on-Chip

Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing

On-Chip Communication: Networks on Chip (NoCs)

Networks-on-Chip

Lab6 On-Chip Bus

On-Chip Bus

Networks-on-Chip

Networks on Chip

Energy Efficient Wireless Sensor Networks A Survey on Energy Based Routing Techniques