1 / 51

Improving DRAM Performance by Parallelizing Refreshes with Accesses

Improving DRAM Performance by Parallelizing Refreshes with Accesses. Kevin Chang. Donghyuk Lee, Zeshan Chishti , Alaa Alameldeen , Chris Wilkerson, Yoongu Kim, Onur Mutlu. Executive Summary. DRAM refresh interferes with memory accesses Degrades system performance and energy efficiency

zocha
Download Presentation

Improving DRAM Performance by Parallelizing Refreshes with Accesses

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving DRAM Performance by Parallelizing Refresheswith Accesses Kevin Chang Donghyuk Lee, ZeshanChishti, AlaaAlameldeen, Chris Wilkerson, Yoongu Kim, OnurMutlu

  2. Executive Summary • DRAM refresh interferes with memory accesses • Degrades system performance and energy efficiency • Becomes exacerbated as DRAM density increases • Goal: Serve memory accesses in parallel with refreshes to reduce refresh interference on demand requests • Our mechanisms: • 1. Enable more parallelization between refreshes and accesses across different banks with new per-bank refresh scheduling algorithms • 2. Enable serving accesses concurrently with refreshes in the same bank by exploiting DRAM subarrays • Improve system performance and energy efficiency for a wide variety of different workloads and DRAM densities • 20.2% and 9.0% for 8-core systems using 32Gb DRAM • Very close to the ideal scheme without refreshes

  3. Outline • Motivation and Key Ideas • DRAM and Refresh Background • Our Mechanisms • Results

  4. Refresh Penalty Accesstransistor Memory Controller Processor Capacitor DRAM Refresh Read Refresh interferes with memory accesses Data Refresh delays requests by 100s of ns

  5. Existing Refresh Modes All-bank refresh in commodity DRAM (DDRx) Bank 7 Time Refresh … Bank 1 Bank 0 Per-bank refresh allows accesses to other banks while a bank is refreshing Per-bank refresh in mobile DRAM (LPDDRx) Round-robin order Bank 7 Time … … Bank 1 Bank 0

  6. Shortcomings of Per-Bank Refresh • Problem 1: Refreshes to different banks are scheduled in a strict round-robin order • The static ordering is hardwired into DRAM chips • Refreshes busy banks with many queued requests when other banks are idle • Key idea: Schedule per-bank refreshes to idle banks opportunistically in a dynamic order

  7. Shortcomings of Per-Bank Refresh • Problem 2: Banks that are being refreshed cannot concurrently serve memory requests Bank 0 Delayed by refresh Per-Bank Refresh RD Time

  8. Shortcomings of Per-Bank Refresh • Problem 2: Refreshing banks cannot concurrently serve memory requests • Key idea: Exploit subarrays within a bank to parallelize refreshes and accesses across subarrays Bank 0 RD Time Subarray 1 Subarray 0 Time Subarray Refresh Parallelize

  9. Outline • Motivation and Key Ideas • DRAM and Refresh Background • Our Mechanisms • Results

  10. DRAM System Organization • Banks can serve multiple requests in parallel DRAM Rank 1 Rank 1 Rank 0 Bank 7 … Bank 1 Bank 0

  11. DRAM Refresh Frequency • DRAM standard requires memory controllers to send periodic refreshes to DRAM Read/Write: roughly 50ns tRefLatency (tRFC): Varies based on DRAM chip density (e.g., 350ns) tRefPeriod (tREFI): Remains constant Timeline

  12. Increasing Performance Impact • DRAM is unavailable to serve requests forof time • 6.7% for today’s 4Gb DRAM • Unavailability increases with higher density due to higher tRefLatency • 23% / 41%for future 32Gb / 64Gb DRAM tRefLatency tRefPeriod

  13. All-Bank vs. Per-Bank Refresh • Shorter tRefLatencythan that of all-bank refresh • More frequent refreshes (shorter tRefPeriod) All-Bank Refresh: Employed in commodity DRAM (DDRx, LPDDRx) Refresh Read Refresh Staggered across banks to limit power Read Refresh Read Per-Bank Refresh: In mobile DRAM (LPDDRx) Read Refresh Can serve memory accesses in parallel with refreshesacross banks Bank 1 Bank 1 Timeline Timeline Refresh Bank 0 Bank 0

  14. Shortcomings of Per-Bank Refresh • 1) Per-bank refreshes are strictly scheduled in round-robin order (as fixed by DRAM’s internal logic) • 2) A refreshing bank cannot serve memory accesses Goal: Enable more parallelization between refreshes and accesses using practical mechanisms

  15. Outline • Motivation and Key Ideas • DRAM and Refresh Background • Our Mechanisms • 1. Dynamic Access-Refresh Parallelization (DARP) • 2. Subarray Access-Refresh Parallelization (SARP) • Results

  16. Our First Approach: DARP • Dynamic Access-Refresh Parallelization (DARP) • An improved scheduling policy for per-bank refreshes • Exploits refresh scheduling flexibility in DDR DRAM • Component 1: Out-of-order per-bank refresh • Avoids poor static scheduling decisions • Dynamically issues per-bank refreshes to idle banks • Component 2: Write-Refresh Parallelization • Avoids refresh interference on latency-critical reads • Parallelizes refreshes with a batch of writes

  17. 1) Out-of-Order Per-Bank Refresh • Dynamic scheduling policy that prioritizes refreshes to idle banks • Memory controllers decide which bank to refresh

  18. 1) Out-of-Order Per-Bank Refresh Baseline: Round robin Request queue (Bank 1) Request queue (Bank 0) Read Read Bank 1 Timeline Read Refresh Bank 0 Read Refresh Reduces refresh penalty on demand requests by refreshing idle banks first in a flexible order Our mechanism: DARP Saved cycles Delayed by refresh Saved cycles Refresh Read Bank 1 Refresh Read Bank 0

  19. Outline • Motivation and Key Ideas • DRAM and Refresh Background • Our Mechanisms • 1. Dynamic Access-Refresh Parallelization (DARP) • 1) Out-of-Order Per-Bank Refresh • 2) Write-Refresh Parallelization • 2. Subarray Access-Refresh Parallelization (SARP) • Results

  20. Refresh Interference on Upcoming Requests • Problem: A refresh may collide with an upcoming request in the near future Bank 1 Time Read Bank 0 Refresh Delayed by refresh Read

  21. DRAM Write Draining • Observations: • 1) Bus-turnaround latency when transitioning from writes to reads or vice versa • To mitigate bus-turnaround latency, writes are typically drained to DRAM in a batch during a period of time • 2) Writes are not latency-critical Turnaround Bank 1 Timeline Write Write Write Read Bank 0

  22. 2) Write-Refresh Parallelization • Proactively schedules refreshes when banks are serving write batches Turnaround Turnaround Baseline Timeline Bank 1 Read Write Write Write Bank 0 Read Refresh Avoids stalling latency-critical read requests by refreshing with non-latency-critical writes Write-refresh parallelization Saved cycles Delayed by refresh Timeline Bank 1 Read Write Write Write Read Refresh Bank 0 Refresh 2. Refresh during writes 1. Postpone refresh

  23. Outline • Motivation and Key Ideas • DRAM and Refresh Background • Our Mechanisms • 1. Dynamic Access-Refresh Parallelization (DARP) • 2. Subarray Access-Refresh Parallelization (SARP) • Results

  24. Our Second Approach: SARP Observations: 1. A bank is further divided into subarrays • Each has its own row buffer to perform refresh operations 2. Some subarrays and bank I/O remain completely idle during refresh Bank 7 … Bank 1 Bank 0 Row Buffer Subarray Idle Bank I/O

  25. Our Second Approach: SARP • Subarray Access-Refresh Parallelization (SARP): • Parallelizes refreshes and accesses within a bank

  26. Our Second Approach: SARP • Subarray Access-Refresh Parallelization (SARP): • Parallelizes refreshes and accesses within a bank Bank 7 … Read Refresh Data Bank 1 Bank 0 Bank 1 Subarray Timeline Subarray 1 Refresh Bank I/O Subarray 0 Read Very modest DRAM modifications: 0.71% die area overhead

  27. Outline • Motivation and Key Ideas • DRAM and Refresh Background • Our Mechanisms • Results

  28. Methodology • 100 workloads: SPEC CPU2006, STREAM, TPC-C/H, random access • System performance metric: Weighted speedup Simulator configurations DDR3 Rank Memory Controller Bank 7 … 8-coreprocessor Bank 1 Memory Controller Bank 0 L1 $: 32KB L2 $: 512KB/core

  29. Comparison Points • All-bank refresh [DDR3, LPDDR3, …] • Per-bank refresh [LPDDR3] • Elastic refresh [Stuecheli et al., MICRO ‘10]: • Postpones refreshes by a time delay based on the predicted rank idle time to avoid interference on memory requests • Proposed to schedule all-bank refreshes without exploiting per-bank refreshes • Cannot parallelize refreshes and accesses within a rank • Ideal (no refresh)

  30. System Performance 7.9% 12.3% 20.2% 1. Both DARP & SARP provide performance gains and combining them (DSARP) improves even more 2. Consistent system performance improvement across DRAM densities (within 0.9%, 1.2%, and 3.8% of ideal)

  31. Energy Efficiency 9.0% 5.2% 3.0% Consistent reduction on energy consumption

  32. Other Results and Discussion in the Paper • Detailed multi-core results and analysis • Result breakdown based on memory intensity • Sensitivity results on number of cores, subarray counts, refresh interval length, and DRAM parameters • Comparisons to DDR4 fine granularity refresh

  33. Executive Summary • DRAM refresh interferes with memory accesses • Degrades system performance and energy efficiency • Becomes exacerbated as DRAM density increases • Goal: Serve memory accesses in parallel with refreshes to reduce refresh interference on demand requests • Our mechanisms: • 1. Enable more parallelization between refreshes and accesses across different banks with new per-bank refresh scheduling algorithms • 2. Enable serving accesses concurrently with refreshes in the same bank by exploiting DRAM subarrays • Improve system performance and energy efficiency for a wide variety of different workloads and DRAM densities • 20.2% and 9.0% for 8-core systems using 32Gb DRAM • Very close to the ideal scheme without refreshes

  34. Improving DRAM Performance by Parallelizing Refresheswith Accesses Kevin Chang Donghyuk Lee, ZeshanChishti, AlaaAlameldeen, Chris Wilkerson, Yoongu Kim, OnurMutlu

  35. Backup

  36. Comparison to Concurrent Work • Zhang et al., HPCA’14 • Ideas: • 1) Sub-rank refresh→ refreshes a subset of banks within a rank • 2) Subarray refresh → refreshes one subarray at a time • 3) Dynamic sub-rank refresh scheduling policies • Similarities: • 1) Leverage idle subarrays to serve accesses • 2) Schedule refreshes to idle banks first • Differences: • 1) Exploit write draining periods to hide refresh latency • 2) We provide detailed analysis on existing per-bank refresh in mobile DRAM • 3) Concrete description on our scheduling algorithm

  37. Performance Impact of Refreshes • Refresh penalty exacerbates as density grows Technology Feature Trend Current Future 43% Potential Range 23% 6.7% (By year 2020*) *ITRS Roadmap, 2011

  38. Temporal Flexibility • DRAM standard allows a few refresh commands to be issued early or late DRAMTimeline 3 4 2 5 6 1 tRefreshPeriod 3 2 4 5 1 Delayed by 1 refresh command 5 2 3 4 6 7 1 Ahead by 1 refresh command

  39. Refresh • Fixed number of refresh commands to refresh entire DRAM: Row1 Row1 DRAMTimeline … 1 2 3 N N+1 Row1 Row1 Row1 DRAMTimeline … 1 N+1 N+1

  40. Unfairness ( ) Our mechanisms do not unfairly slow down specific applications to gain performance

  41. Power Overhead Power overhead to parallelize a refresh operation and accesses over a four-activate window: Activate current Refresh current Extend both tFAW and tRRD timing parameters:

  42. Refresh Interval (7.8μs) 3.3% 5.3% 9.1%

  43. Die Area Overhead • Rambus DRAM model with 55nm • SARP area overhead: 0.71% in a 2Gb DRAM chip

  44. System Performance

  45. Effect of Memory Intensity

  46. DDR4 FGR

  47. Performance Breakdown • Out-of-order refresh improves performance by 3.2%/3.9%/3.0% over 8/16/32Gb DRAM • Write-refresh parallelization provides additional benefits of 4.3%/5.8%/5.2%

  48. tFAW Sweep Baseline

  49. Performance Degradation using Per-Bank Refresh Per-Bank Refresh Pathological latency = 3.5 * tRefLatency_AllBank

  50. Our Second Approach: SARP • Subarray Access-Refresh Parallelization (SARP): • Parallelizes refreshes and accesses within a bank • Problem: Shared address path for refreshes and accesses • Solution: Decouple the shared address path Access or Refresh Subarray Bank I/O

More Related