420 likes | 703 Views
3D Implemented SRAM/DRAM Hybrid Cache Architecture for High-Performance and Low Power Consumption. Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto , and Kazuaki Murakami Kyushu University. Outline. Why 3D? Will 3D always work well? Support Adaptive Execution! Conclusions.
E N D
3D Implemented SRAM/DRAM Hybrid Cache Architecturefor High-Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami Kyushu University
Outline • Why 3D? • Will 3D always work well? • Support Adaptive Execution! • Conclusions
Outline • Why 3D? • Will 3D always work well? • Support Adaptive Execution! • Conclusions
Wire-bonding (WB) 3D stacking (System-in-Package or SiP) Package-on-Package (POP) 3D stacking From 2D to 3D! • Stack Multiple Dies • Connect Dies with Through Silicon Vias TSV Multi-Level 3D IC Sensor IO RF Analog DRAM Source: Yuan Xie, “3D IC Design/Architecture,” Coolchips Special Session, 2009 Processor
Chip Implementation Examplesfrom ISSCC’09 • Image Sensors • SRAM for SoCs • DRAM • Multi-core + SRAMconnected with wireless TSVs 8Gb 3D DRAM(Samsung) SRAM+Multicore(Keio Univ.) SRAM for SoCs(NEC) Image Sensor(MIT) U. Kang et al., “8Gb DDR3 DRAM Using Through-Silicon-Via Technology,” ISSCC’09. H. Saito et al., “A Chip-Stacked Memory for On-Chip SRAM-Rich SoCs and Processors, “ ISSCC’09. V. Suntharalingam et al., “A 4-Side Tileable Back Illuminated 3D-Integrated Mpixel CMOS Image Sensor,” ISSCC’09. K. Niitsu et al., “An Inductive-Coupling Link for 3D Integration of a 90nm CMOS Processor and a 65nm CMOS SRAM,” ISSCC’09.
Why 3D? (1/3) • Wire Length Reduction • Replace long, high capacitance wires by TSVs • Low Latency, Low Energy • Small footprint
Why 3D? (2/3) • Integration • From “Off-Chip” to “On-Chip” • Improved Communication • Low Latency, High Bandwidth, and Low Energy • Heterogeneous Integration • E.g. Emerging Devices
100 10 1 0.1 Performance Fine Process Performance Improvement (times) Power Consumption Stacking 180 130 90 65 45 32 22 15 12 Process node(nm) Why 3D? (3/3) N.Miyakawa,”3D Stacking Technology for Improvement of System Performance,” International Trade Partners Conference, Nov.2008
Outline • Why 3D? • Will 3D always work well? • Support Adaptive Execution! • Conclusions
Importance of On-Chip Caches • Memory-Wall Problem • Memory bandwidth does not scale with the # of cores • Growing speed gap between processor cores and DRAMs • So, Becomes more serious • Let’s increase on-chip cache capacity, but… • Requires large chip area Core 2 Duo Pentium4 Core Bus Level 2 cache 4MB L2 Cache 1MB L2 Cache http://www.atmarkit.co.jp/ ,http://www.chip-architect.com/
Will 3D always work well? “Stacking a DRAM Cache” Main-Memory Access Time L1 Miss Rate L1 Hit Time L2 Hit Time L2 Miss Rate Ave. Memory Acc. Time ? 32MB DRAM Cache Core Core 4MB Cache Core Core Tag RAM
Cache-Size Sensitivity Varies among Programs! Sensitive! Insensitive! Sensitive! Sensitive! Insensitive! Insensitive!
2D vs. 3D 172.mgrid LU Profit 171.swim 3D 32MB DRAM Cache 5 FMM 4 Ocean 3 181.mcf 256.bzip2 2 WaterSpatial Better Cholesky 1 188.ammp Barnes 0 0 2D 2MB SRAM Cache 100 50 80 FFT 179.art 300.twolf 301.apsi 60 100 40 HTL2_OVERHEAD[cc] MRL2_REDUTION[points] 20 0 150 200 Profit
Appropriate Cache Size Varies within Programs! The lower, the better Ocean
Outline • Why 3D? • Will 3D always work well? • Adaptive Execution! • Conclusions
Will 3D always work well? “Stacking a DRAM Cache” Main-Memory Access Time L1 Miss Rate L1 Hit Time L2 Hit Time L2 Miss Rate Ave. Memory Acc. Time ? 32MB DRAM Cache Core Core 4MB Cache Core Core Tag RAM
SRAM/DRAM Hybrid Cache Architecture • Support Two Operation Modes • High-Speed, Small Cache Mode (or SRAM Cache Mode) • Low-Speed, Large Cache Mode (or DRAM Cache Mode) • Adapt to variation of application behavior 32MB DRAM Cache Core Core 32MB DRAM Cache (Power Gated) 4MB Tag SRAM Core Core 4MB Cache SRAM Cache Mode DRAM Cache Mode
Microarchitecture (1/2) Way 0 Way 1 Tag Tag 32MB DRAM Cache Core Core 4MB Tag SRAM Way 0 Way 1 2way set-associative DRAM Cache 2way set-associative SRAM Cache
Microarchitecture (2/2) 64b physical address Offset Assume Ld==Ls==64B Tag field SARM(Size: Cs, Block: Ls, Asso. Ws) Index DARM(Size: d Block: Ld, Asso. Wd) = = MUX MUX = = Data (SRAM) MUX Hit/Miss (SRAM) Data (DRAM) Hit/Miss (DRAM)
How to Adapt FFT • Static Approach • Optimizes at program level • Does not change it during execution • Needs a static analysis • Dynamic Approach • Optimizes at interval level (or phase level) • Needs a run-time profiling FMM Barnes Ocean
Experimental Set Up • Processor: In-Order • Benchmarks: SPEC CPU 2000, Splash2 • The operation mode is set at the beginning of the program execution (and is maintained until the end) • Assume an appropriate operation mode is know for each benchmark 2D-BASE 3D-CONV 3D-HYBRID L1D, L1I Caches: 32KB Access Lat.:2clock cycles Core@3GHz Core@3GHz • L2 SRAM Cache • 2MB, 64B Block • 8way • Lat. 6 clock cycles • 3D DRAM Cache • 32MB, 64B Block • 8way • Lat. 28 clock cycles L1 I/D L1 I/D 3D DRAM L2 Cache 2D SRAM L2Cache Lat.:181 clock cycles Main memory Main memory
Evaluation Results 2D-BASE3D-CONV3D-HYBRID Benchmark Programs
How to Adapt FFT • Static Approach • Optimizes at program level • Does not change it during execution • Needs a static analysis • Dynamic Approach • Optimizes at interval level (or phase level) • Needs a run-time profiling FMM Barnes Ocean
Run-Time Mode Selection • Divide Program Execution into “epochs”, e.g. 200K L2 Misses • Predict an Appropriate Operation Mode for Next Epoch • On SRAM mode, a small tag RAM which stores sampled tags is used to predict DRAM mode miss rates Hardware Support for Measurement if then transit from SRAM mode to DRAM mode! N-2 N-1 N N+1 epoch Operation Mode SRAMCache Mode DRAMCache Mode
Results 2D-SRAMDRAM-STACKHYBRID-IDEALHYBRID
Results 2D-SRAMDRAM-STACKHYBRID-IDEALHYBRID
Results 2D-SRAMDRAM-STACKHYBRID-IDEALHYBRID
Outline • Why 3D? • Will 3D always work well? • Adaptive Execution! • Conclusions
Conclusions • The 3D solution is one of the most promising ways to achieve… • High performance • Low energy • It does not ALWAYS work well! • Run-time adaptive execution by considering memory access behavior
Acknowledgement • This research was supported in part by New Energy and Industrial Technology Development Organization
How to Adapt FFT • Static Approach • Optimizes at program level • Does not change it during execution • Needs a static analysis • Dynamic Approach • Optimizes at interval level (or phase level) • Needs a run-time profiling FMM Barnes Ocean
Run-Time Mode Selection • Divide Program Execution into “epochs”, e.g. 200K L2 Misses • Predict an Appropriate Operation Mode for Next Epoch • On SRAM mode, a small tag RAM which stores sampled tags is used to predict DRAM mode miss rates Hardware Support for Measurement if then transit from SRAM mode to DRAM mode! N-1 N N+1 N+2 epoch Operation Mode SRAMCache Mode DRAMCache Mode
Experimental Set Up • Processor: In-Order • Benchmarks: SPEC CPU 2000, Splash2 SRAMCache Mode DRAMCache Mode Size:32KB Access Time : 2clock cycles Core Core • Size : 2MB • Access Time : 6clock cycles • Size: 32MB • Access Time : 28clock cycles L1 Cache L1 Cache L2 Cache L2 Cache Access Time:181clock cycles Main Memory Main Memory
Results 2D-SRAMDRAM-STACKHYBRID-IDEALHYBRID
Results 2D-SRAMDRAM-STACKHYBRID-IDEALHYBRID
Results 2D-SRAMDRAM-STACKHYBRID-IDEALHYBRID