1 / 22

CCNoC : On-Chip Interconnects for Cache-Coherent Manycore Server Chips

CCNoC : On-Chip Interconnects for Cache-Coherent Manycore Server Chips. CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli. LSI. Integrated Systems Laboratory. NoCs Major Power Consumer . Move towards manycore Tiled architectures

cyndi
Download Presentation

CCNoC : On-Chip Interconnects for Cache-Coherent Manycore Server Chips

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CCNoC: On-Chip Interconnects forCache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli LSI Integrated Systems Laboratory

  2. NoCs Major Power Consumer • Move towards manycore • Tiled architectures • Network-on-Chip (NoC) • Significant power consumer • 40% MIT RAW • 30% Intel Tera-scale • Cache coherent CMP • Server workloads C C C C Core Core $ $ $ $ C Crossbar C C C $ $ $ $ C C C C $ $ $ $ $ $ C C C C $ $ $ $

  3. Proposals to Reduce NoC Power • Multiple networks • Better area and power [Balfour & Dally ICS 2006] • Commercial server workloads • Traffic patterns are different • Run on cache coherent CMPs • Strong relation between coherence protocol and NoC • Not optimized for Commercial Server Workload traffic

  4. Contributions • Commercial server workloads • Optimized for reuse in L1, little sharing • Full blown coherence protocol in CMPs • Only some transitions are frequent • Duality in Request/Response message size • CCNoC • Full advantage of heterogeneity • Same number of buffers • 16% less power same performance as Mesh

  5. Outline • Overview • Why CCNoC? • Dual-router design • Evaluation • Conclusions

  6. Dual Router is More Efficient • Dual router • Two crossbars per routing node • Wires less expensive on-chip • Use more wires for better performance • Area and power grows faster than connectivity • Balfour & Dally ICS 2006 • Dual router: better performance, power and area N/2 bit wide N bit wide N/2 bit wide

  7. Right Dual Router Design • Avoid protocol level deadlock • Separate • Requests • Responses • Use Virtual Channels • CCNoC • sub-networks • Request / Response • No VCs needed • Same number of buffers • Buffers are power hungry H.S.Wang & L.S.Peh, MICRO 2003

  8. Protocol Activity • CMPs implement full blown coherence protocol • Some transitions are frequent [Hardavellas ISCA 2009] • Read clean block • Evict clean block • Write to unshared block • Other transitions needed for correctness (infrequent) • Read dirty block • Evict dirty • Write to shared block

  9. Frequent Read Protocol Activity Reader Directory Writer Short Req Read Req Short Req Short Resp Read Resp Long Resp Evict Clean Req

  10. Frequent Write Protocol Activity Writer Directory Short Req Fetch/Upgrade Req Short Req Short Resp Fetch Resp Long Resp Upgrade Resp

  11. Infrequent Read Protocol Activity Reader Directory Writer Short Req Downgrade Req Read Req Short Req Short Resp Long Resp Read Resp Downgrade Resp

  12. Infrequent Write Protocol Activity Writer Directory Reader 1 Reader 2 Fetch/Upgrade Req Short Req Inv Req Short Req Inv Req Short Resp Fetch Resp Inv Resp Long Resp Inv Resp Upgrade Resp Evict Dirty Req

  13. Traffic Analysis Request: 93% short Response: 86% long

  14. CCNoC Router Router NI Response Switch Request Switch Request network narrow: optimized for short messages Response network wide: optimized for long messages

  15. Previous Work • Balfour et al. ICS 2006 • Better than single large router • Read/Write traffic • Same number of reads and writes • Yoon et al. DAC 2010 • Physical channel better then virtual channel • Not optimized for cache coherent CMP • Running commercial server workloads

  16. Outline • Overview • Why CCNoC? • Dual-router design • Evaluation • Conclusions

  17. Evaluation Methodology • FLEXUS • Full system simulation • 16 or 8 UltraSPARC III ISA cores • Split I/D, 64KB L1 • 1 or 2 MB L2 • ORION 2.0 • power estimation • area estimation • Workloads • OLTP: TPC-C • IBM DB2 and Oracle • DSS: TPC-H • IBM DB2 • Q1, Q6, Q13, Q16 • Web: SPECweb99 • Apache and Zeus • Scientific: EM3D • Multiprogrammed: • SPEC2K • 2x: gcc, twolf, art, mcf

  18. Evaluation NoCs • Mesh-128 - baseline • 128 bit flit width • Torus - reference • 128 bit flit width • Mesh-176 – high performance • 176 bit flit width • CCNoC • Request: 48 bit flit width • Response: 128 bit flit width • Switches • Wormhole flow control • Input queued • Transmission protocol • On/Off • Input buffers • 2 entry

  19. Performance Performance loss: 2% Torus, 8% Mesh-176

  20. Power Savings Power savings: 16% Mesh-128, 22% Torus, 38% Mesh-176

  21. Conclusions • Duality in Request/Response traffic • Request: dominated by short messages • Response: dominated by long messages • Proposed CCNoC • Narrow request network • Wide response network • Showed significant power savings • 22% against Torus • 38% against Mesh-176

  22. Thank you! Q&A

More Related