Brian Rogers †‡ , Anil Krishna †‡ , Gordon Bell ‡ , Ken Vu ‡ , Xiaowei Jiang † , Yan Solihin †

Scaling the Bandwidth Wall:Challenges in and Avenues for CMP Scalability36th International Symposium on Computer Architecture Brian Rogers†‡, Anil Krishna†‡, Gordon Bell‡, Ken Vu‡, Xiaowei Jiang†, Yan Solihin† † ‡ NC STATE UNIVERSITY

As Process Technology Scales … P P P P P P P P P P P P P P P P P P P P $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ DRAM DRAM Scaling the Bandwidth Wall -- ISCA 2009

Problem • Core growth >> Memory bandwidth growth • Cores: ~ exponential growth (driven by Moore’s Law) • Bandwidth: ~ much slower growth (pin and power limitations) • At each relative technology generation (T): • (# Cores = 2T) >> (Bandwidth = BT) • Some key questions (Our contributions): • How constraining is increasing gap between # of cores and available memory bandwidth? • How should future CMPs be designed; how should we allocate transistors to caches and cores? • What techniques can best reduce memory traffic demand? Build Analytical CMP Memory Bandwidth Model Scaling the Bandwidth Wall -- ISCA 2009

Agenda • Background / Motivation • Assumptions / Scope • CMP Memory Traffic Model • Alternate Views of Model • Memory Traffic Reduction Techniques • Indirect • Direct • Dual • Conclusions Scaling the Bandwidth Wall -- ISCA 2009

Assumptions / Scope • Homogenous cores • Single-threaded cores (multi-threading adds to problem) • Co-scheduled sequential applications • Multi-threaded apps with data sharing evaluated separately • Enough work to keep all cores busy • Workloads static across technology generations • Equal amount of cache per core • Power/Energy constraints outside scope of this study Scaling the Bandwidth Wall -- ISCA 2009

Cache Miss Rate vs. Cache Size • Relationship follows the Power Law, Hartstein et al. (√2 Rule) R = New cache size / Old cache size α = Sensitivity of workload to cache size change M = M0 * R-α Scaling the Bandwidth Wall -- ISCA 2009

CMP Traffic Model • Express chip area in terms of Core Equivalent Areas (CEAs) • Core = 1 CEA, Unit_of_Cache = 1 CEA • P = # cores, C = # cache CEAs, N = P+C, S = C/P • Assume that non-core and non-cache components require constant fraction of area • Add # of cores term for CMP model: Scaling the Bandwidth Wall -- ISCA 2009

CMP Traffic Model (2) P = # cores, C = # cache CEAS N = P+C, S = C/P • Going from CMP1=<P1,C1> to CMP2=<P2,C2> • Remove common terms, express M2 in terms of M1 Scaling the Bandwidth Wall -- ISCA 2009

One Generation of Scaling • Baseline Processor: 8 cores, 8 cache CEAs • N1=16, P1=8, C1=8, S1=1, and ~ fully utilized BW • α = 0.5 • How many cores possible if 32 CEAS now available? • Ideal Scaling = 2X # of cores at each successive technology generation Ideal Scaling BW Limited Scaling Scaling the Bandwidth Wall -- ISCA 2009

CMP Design Constraint P = # cores, C = # cache CEAS N = P+C, S = C/P • If available off-chip BW grows by factor of B: • Total memory traffic should grow by at most a factor of B each generation • Write S2 in terms of P2 and N2: • New technology: N2 CEAs, B bandwidth => solve for P2 numerically P2 is # of cores that can be supported Scaling the Bandwidth Wall -- ISCA 2009

Scaling Under Area Constraints • With an increasing # of CEAs available, how many cores can be supported at constant BW requirement • 2x die area: 1.4x cores • 4x die area: 1.9x cores • 8x die area: 2.4x cores • 16x die area: 3.2x cores • … Scaling the Bandwidth Wall -- ISCA 2009

Categories of Techniques • Indirect • Cache Compression • DRAM Caches • 3D-stacked Cache • Unused Data Filter • Smaller Cores • Direct • Link Compression • Sectored Caches • Dual • Cache+Link Compress • Small Cache Lines • Data Sharing Scaling the Bandwidth Wall -- ISCA 2009

Indirect – DRAM Cache F – Influenced by Increased Density Ideal Scaling Scaling the Bandwidth Wall -- ISCA 2009

Direct – Link Compression R – Influenced by Compression Ratio Ideal Scaling Scaling the Bandwidth Wall -- ISCA 2009

Dual – Small Cache Lines F,R – Influenced by % Unused Data Ideal Scaling Scaling the Bandwidth Wall -- ISCA 2009

Dual – Data Sharing • Please see paper for details on modeling of sharing • Data sharing unlikely to provide a scalable solution Scaling the Bandwidth Wall -- ISCA 2009

Summary of Individual Techniques Indirect Direct Dual Scaling the Bandwidth Wall -- ISCA 2009

Summary of Combined Techniques Scaling the Bandwidth Wall -- ISCA 2009

Conclusions • Contributions • Simple, powerful analytical CMP memory traffic model • Quantify significance of memory BW wall problem • 10% chip area for cores in 4 generations if constant traffic req. • Guide design (cores vs. cache) of future CMPs • Given fixed chip area and BW scaling, how many cores? • Evaluate memory traffic reduction techniques • Combinations can enable ideal scaling for several generations • Need bandwidth-efficient computing: • Hardware/Architecture level: DRAM caches, cache/link compression, prefetching, smarter memory controllers, etc. • Technology level: 3D chips, optical interconnects, etc. • Application level: working set reduction, locality enhancement, data vs. pipelined parallelism, computation vs. communication, etc. Scaling the Bandwidth Wall -- ISCA 2009

Questions ? Thank You Brian Rogers bmrogers@ece.ncsu.edu Scaling the Bandwidth Wall -- ISCA 2009

Brian Rogers †‡ , Anil Krishna †‡ , Gordon Bell ‡ , Ken Vu ‡ , Xiaowei Jiang † , Yan Solihin †