Towards Scalable and Energy-Efficient Memory System Architectures

Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian School of Computing University of Utah

Main Memory Problems 3. Reliability PROCESSOR DIMM DIMM 2. High capacity at high bandwidth 1. Energy

Motivation: Memory Energy • Contributions of memory to overall system energy: • 25-40%, IBM, Sun, and Google server data summarized • by Meisner et al., ASPLOS’09 • HP servers: 175 W out of ~785 W for 256 GB memory • (HP power calculator) • Intel SCC: memory controller contributes 19-69% of • chip power, ISSCC’10

Motivation: Reliability • DRAM data from Schroeder et al., SIGMETRICS’09: • 25K-70K errors per billion device hours per Mbit • 8% of DRAM DIMMs affected by errors every year • DRAM error rates may get worse as scalability limits are • reached; PCM (hard and soft) error rates expected to be • high as well • Primary concern: storage and energy overheads for error • detection and correction • ECC support is not too onerous; chip-kill is much worse

Motivation: Capacity, Bandwidth DIMM DIMM Processor

Motivation: Capacity, Bandwidth DIMM DIMM Cores are increasing, but pins are not Processor

Motivation: Capacity, Bandwidth Will eventually need disruptive shifts: NVM, optics DIMM DIMM Cores are increasing, but pins are not Processor High channel frequency  fewer DIMMs Can’t have high capacity, high bandwidth, and low energy Pick 2 of the 3!

Memory System Basics DIMM M DIMM DIMM M M Processor Multiple on-chip memory controllers that handle multiple 64-bit channels

Memory System Basics: FB-DIMM DIMM M DIMM DIMM DIMM DIMM M M DIMM DIMM DIMM DIMM M M Processor FB-DIMM: Can boost capacity with narrow channels and buffering at each DIMM

What’s a Rank? x8 x8 x8 x8 x8 x8 x8 x8 Processor DIMM M 64b Rank: DRAM chips required to provide the 64b output expected by a JEDEC standard bus For example: 8 x8 DRAM chips

What’s a Bank? BANK x8 x8 x8 x8 x8 x8 x8 x8 Processor DIMM M 64b Bank: A portion of a rank that is tied up when servicing a request; multiple banks in a rank enable parallel handling of multiple requests

What’s an Array? BANK x8 x8 x8 x8 x8 x8 x8 x8 Processor DIMM M 64b Array: Matrix of cells One array provides 1 bit/cycle Each array reads out an entire row Large array  high density

What’s a Row Buffer? Bitlines Wordline … RAS Array Row Buffer CAS Output pin

Row Buffer Management • Row buffer: collection of rows read out by arrays in a bank • Row buffer hits incur low latency and low energy • Bitlines must be precharged before a new row can be read • Open page policy: delays the precharge until a different • row is encountered • Close page policy: issues the precharge immediately

Primary Sources of Energy Inefficiency • Overfetch: 8 KB of data read out for each cache line request • Poor row buffer hit rates: diminished locality in multi-cores • Electrical medium: bus speeds have been increasing • Reliability measures: overhead in building a reliable system • from inherently unreliable parts

SECDED Support 8-bit ECC 64-bit data word • One extra x8 chip per rank • Storage and energy overhead of 12.5% • Cannot handle complete failure in one chip

Chipkill Support I 8-bit ECC 64-bit data word At most one bit from each DRAM chip • Use 72 DRAM chips to read out 72 bits • Dramatic increase in activation energy and overfetch • Storage overhead is still 12.5%

Chipkill Support II 5-bit ECC 8-bit data word At most one bit from each DRAM chip • Use 13 DRAM chips to read out 13 bits • Storage and energy overhead: 62.5% • Other options exist; trade-off between energy and storage

Summary So Far • We now understand… • why memory energy is a problem • - overfetch, row buffer miss rates • why reliability incurs high energy overheads • - chipkill support requires high activation per useful bit • why capacity and bandwidth increases cost energy • - need high frequency and buffering per hop

Crucial Timing • Disruptive changes may be compelling today… • Increasing role of memory energy • Increasing role of memory errors • Impact of multi-core: high bandwidth needs, loss of locality • Emerging technologies (NVM, optics) • will require a revamp of memory architecture • ideas can be easily applied to NVM • role of DRAM may change

Attacking the Problem • Find ways to maximize row buffer utility • Find ways to reduce overfetch • Treat reliability as a first-class design constraint • Use photonics and 3D to boost capacity and bandwidth • Solutions must be very cost-sensitive

Maximizing Row Buffer Locality • Micro-pages (ASPLOS’10) • Handling multiple memory controllers (PACT’10) • On-going work: better write scheduling, better bank • management (data mapping, row closure)

Micro-Pages • Key observation: most accesses to a page are localized • to a small region (micro-page)

Solution • Identify hot micro-pages • Co-locate hot micro-pages in reserved DRAM rows • Memory controller keeps track of re-direction • Low overheads if applications have few hot micro-pages • that account for most memory accesses Processor DIMM M

Results • Overall 9% improvement in performance and 15% • reduction in energy

Handling Multiple Memory Controllers • Data mapping across multiple memory controllers is key: • Must equalize load and queuing delays • Must minimize “distance” • Must maximize row buffer hit rates DIMM M DIMM DIMM M M

Solution • Cost function to guide initial page placement • Similar cost function to guide page migration • Initial page placement improves performance by 7%, • page migration by 9% • Row buffer hit rates can be doubled

Reducing Overfetch Key idea: eliminate overfetch by employing smaller arrays and activating a single array in a single chip Single Subarray Access (SSA), ISCA’10 • Positive effects: • Minimizes activation energy • Small activation footprint: more • arrays can be asleep longer • Enables higher parallelism and • reduces queuing delays • Negative effects: • Longer transfer time • Drop in density • No row buffer hits • Vulnerable to chip failure • Change to standards

Energy Results • Dynamic energy reduction of 6x • In some cases, 3x reduction in leakage

Performance Results • SSA better on half the programs (mem-intensive ones)

Support for Reliability • Checksum support per row allows low-cost error detection • Can build a 2nd tier error-correction scheme, based on RAID Checksum … Data row Parity DRAM chip DRAM chip • Reads: single array read • Writes: two array reads and two array writes

Capacity and Bandwidth • Silicon photonics to break the pin barrier at the processor • But, several concerns at the DIMM: • Breaking the DRAM pin barrier will impact cost! • High capacity  daisy-chaining and loss of power • High static power for photonics; need high utilization • Scheduling for large capacities

Exploiting 3D Stacks (ISCA’11) DRAM chips Memory controller Processor DIMM Interface die + Stack controller Waveguide • Interface die for photonic penetration • Does not impact DRAM design • Few photonic hops; high utilization • Interface die schedules low-level operations

Packet-Based Scheduling Protocol • High capacity  high scheduling complexity • Move to a packet-based interface • Processor issues an address request • Processor reserves a slot for data return • Scheduling minutiae are handled by stack controller • Data is returned at the correct time • Back-up slot in case deadline is not met • Better plug’n’play • Reduced complexity at processor • Can handle heterogeneity

Summary • Treat reliability as a first-order constraint • Possible to use photonics to break pin barrier and not • disrupt memory chip design: boosts bandwidth and • capacity ! • Can reduce memory chip energy by reducing overfetch • and with better row buffer management

Acks • Terrific students in the Utah Arch group • Prof. Al Davis (Utah) and collaborators at HP, Intel, IBM • Funding from NSF, Intel, HP, University of Utah

Towards Scalable and Energy-Efficient Memory System Architectures