1 / 39

Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks

Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks. Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah and *HP Labs. Motivation - I. Future CMPs are likely to be power-limited On-chip networks consume 20-36% of total chip power

tamra
Download Presentation

Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah and *HP Labs

  2. Motivation - I Future CMPs are likely to be power-limited On-chip networks consume 20-36% of total chip power Network power dominated by routers Chip design and verification costs are tremendous Directory-based protocols are complicated and have the inherent problem of indirection Snooping-based protocols are well understood and simple to design Metal and wiring are cheap and plentiful We are no longer pin limited for the interconnection network University of Utah

  3. Motivation - II Future of multi-core computing likely to diverge into two separate tracks Mid-range multicore machines for home/office 16-64 cores Many-core machines for scientific/server applications 1000s of cores Even machines with large core counts are likely to be virtualized, with communication localized to small chunks of approx. 64 cores Design energy-efficient networks for moderate core-counts VM University of Utah

  4. Executive Summary University of Utah • Elimination of routers leads us back to bus-based networks • Dramatic reduction in energy consumption, little or no loss in performance, reduction in design complexity • Enhancing the life of buses for moderately sized CMPs • Filtered segmented bus, low-swing wiring, address interleaved buses, page coloring

  5. Outline • Overview • Proposal I - Filtered Segmented Bus • Proposal II - Low-swing Wiring • Proposal III - Address Interleaved Buses • Proposal IV - Page Coloring • Evaluation • Conclusion University of Utah

  6. Baseline Chip and Interconnect Organization • Simple mesh used for illustration here, other options discussed in the paper • Static-NUCA shared L2, each line has a “home” slice based on its address Core L1 Router L2 University of Utah

  7. Where does energy go in the network? 1.39e-10 J/access 8X 1.56e-11 J/access Router Link Energy estimates based on CACTI 6.0 and Orion 2.0 University of Utah

  8. Outline • Overview • Proposal I - Filtered Segmented Bus • Proposal II - Low-swing wiring • Proposal III - Address Interleaved Buses • Proposal IV - Page Coloring • Evaluation • Conclusion University of Utah

  9. What is the solution? • We are left with.. a bus! • Could we really just use a bus? • Not really • Too many links activated on every transaction • Energy gained by eliminating routers lost by activating more links • Poor performance due to increased arbitration times and network contention University of Utah

  10. We can do better.. Useless snoop: Particular cache line not present in any other core University of Utah

  11. Filtered Bus • Segment and filter snoop transactions at intermediate points • Two types of filters • Out-filter • In-filter • Reduces number of links activated • Allows for safe parallelism (serialization happens at the central bus if required) Bus link Filter University of Utah

  12. Filters Bus link In + Out Filter • Each “filter” depicted in the figure is a combination of an “Out-filter” and an “In-filter” • Each of these is a Counting Bloom Filter • 2 arrays of 10-bit entries • Subsets of the address bits hashed into each of these arrays, incremented to add entries, decremented to remove entries • To test for membership, simply check if entries in both arrays are non-zero • Compact representation, false positives possible University of Utah

  13. Out-filter - Case 1 Home Segment R • Bloom filter in every segment keeps track of a superset of lines that call that segment “home” and have been sent “out” of that segment • If a line has never left a segment, none of its transactions need to be seen outside Energy Saved Bus link In - Filter Out - Filter Activated bus Activated filter R – Requested Address University of Utah Completely localized transaction Only home segment activated

  14. Out-filter – Case 2 Home Segment R • If the line is being requested from outside its home segment, transaction has to go out on the central bus • The out-filter of the home segment is updated appropriately • The in-filter then takes over R R R Update Bus link In - Filter Out - Filter Activated bus Activated filter R – Requested Address University of Utah

  15. In-filter • Bloom filters keep track of a superset of lines currently present in the segment • Only broadcast within the local segment if required R Energy Saved R R Bus link In - Filter Out - Filter Activated bus Activated filter R – Requested Address University of Utah

  16. Arbitration • Global arbitration delay is non-trivial for a single bus connecting even 16 cores • Multi-step arbitration, as required • On every request • arbitrate for local bus and broadcast • if filter indicates that the transaction is complete, “validate” broadcast via wired-OR • if not, arbitrate for central bus and hold broadcast in a single-entry buffer until the central bus is available • at the remote sub-buses, priority is given to requests originating from the central bus University of Utah

  17. Outline • Overview • Proposal I - Filtered Segmented Bus • Proposal II - Low-swing wiring • Proposal III - Address Interleaved Buses • Proposal IV - Page Coloring • Evaluation • Conclusion University of Utah

  18. Low-swing Wiring • Differential low-swing wiring up to 10X more energy efficient than regular wiring • These have less impact on packet-switched networks since routers are the bottleneck anyway • Amdahl’s law! • Slightly increased latency, more metal requirement University of Utah

  19. Outline • Overview • Proposal I - Filtered Segmented Bus • Proposal II - Low-swing wiring • Proposal III - Address Interleaved Buses • Proposal IV - Page Coloring • Evaluation • Conclusion University of Utah

  20. Address Interleaved Buses • As core counts increase, increased pressure on the bus due to contention • At 64 cores, even though bus-based networks continue to be highly energy efficient, performance begins to dip • To shore up performance, increase the number of buses • different buses handle mutually exclusive addresses • increased metal requirement University of Utah

  21. Outline • Overview • Proposal I - Filtered Segmented Bus • Proposal II - Low-swing wiring • Proposal III - Address Interleaved Buses • Proposal IV - Page Coloring • Evaluation • Conclusion University of Utah

  22. Page Coloring • OS-assisted page-coloring for L2 cache • We use a simple first-touch approach • Improved locality helps any network, but is especially well-suited for our network because • More flexibility in page placement • Less negative impact by sub-optimal page placement • Improves filter behavior University of Utah

  23. Outline • Overview • Proposal I - Filtered Segmented Bus • Proposal II - Low-swing wiring • Proposal III - Address Interleaved Buses • Proposal IV - Page Coloring • Evaluation • Conclusion University of Utah

  24. Methodology • Virtutech SIMICS full-system simulator • “g-cache” significantly modified to add network models • CACTI 6.0 and Orion 2.0 for router/link energy computation • 16 cores for most experiments, sensitivity analysis for 32- and 64-core systems • 32nm process, 3GHz clock • 32K D-L1, 16K I-L1, 2MB/slice shared L2 • 200 cycle main memory latency • 4KB page size • PARSEC, NAS, SPLASH-2 benchmark suites – run for entire Region-Of-Interest/parallel section • Baseline routers - 4 VCs, 8 buffers/VC University of Utah

  25. Energy Consumption – Address Network Ring – 20x Grid – 27x Fbfly – 31x University of Utah

  26. Energy Consumption – Data Network Ring – 2x Grid – 2.5x Fbfly – 3x University of Utah

  27. How does energy consumption reduce? • Router : Link energy ratio is high enough to significantly impact energy characteristics • Efficient bloom filters, at 16KB/filter • Out-filters are 85% accurate (note that there are only false positives, no false negatives) • In-filters are 90% accurate University of Utah

  28. Effect of Page Coloring • More locality • Better filtering • Out filter accuracy increases from 85% to 97% University of Utah

  29. System Performance Ring – 7% Grid – 3% Fbfly – 1% University of Utah

  30. How does performance improve? • Two basic reasons • Inherent indirection in directory-based protocols • Deep pipelines in routers increasing the no-load latency • Avg. latency in bus-based network is 16.4 cycles • Arbitration (3.7 cyc) + Contention (1 cyc) + Bloom filter (1.2 cyc) + Link latency (10.5 cyc) • Even in the most connected FBFLY, average of 1.5 hops per message, bare minimum two messages per transaction – 3 hops – 15 cycles without contention • Link (6 cyc) + Router (9 cyc) University of Utah

  31. Scaling – 32 Cores – Energy Average energy reduction of 19X in address network, 3X in data network University of Utah

  32. 32 Cores – Performance Average 5% drop in performance University of Utah

  33. Scaling - 64 Cores – Energy Average reduction of 13X in address network, 2.5X in data network University of Utah

  34. 64 Core - Performance University of Utah Average 39% increase in execution time compared to fbfly, only 12% increase with just two interleaved buses

  35. Router Optimizations University of Utah • For packet-switched networks to be as energy efficient as bus-based networks, Router : Link energy ratio should be less than • 3.5 X at 16 cores • 4.5X at 32 cores • 7X at 64 cores • Current energy ratio is approx. 70X

  36. Outline • Overview • Proposal I - Filtered Segmented Bus • Proposal II - Low-swing wiring • Proposal III - Address Interleaved Buses • Proposal IV - Page Coloring • Evaluation • Conclusion University of Utah

  37. Related Work • Packet Switched Networks • Dally/Towles (DAC ’01), Kim et al. (MICRO ’07), Grot et al. (HPCA ’09), TRIPS, TILERA • Hierarchical Networks • Muralimanohar et al. (ISCA ’07), Das et al. (HPCA ’09) • Snoop Filtering • Moshovos et al. (HPCA ’01), Strauss et al. (ISCA ’06), Salapura et al. (HPCA ’08) • Bus applications in CMPs • Manevich et al. (NOCS ’09) University of Utah

  38. Key Contributions • For moderate core counts, buses just work! • Dramatic energy reduction • little or no loss in performance • simple snooping protocols, reduction in design complexity • Low-swing wiring • Multiple Address Interleaved buses • OS-assisted page coloring • Potential for router optimization University of Utah

  39. Thank you.. • Questions? University of Utah

More Related