1 / 21

Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM

Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM. Mohammed Alizadeh , Adel Javanmard, Da Chuang , Sundar Iyer, Yi Lu ( alizade , adelj )@stanford.edu , ( dachuang , sundaes)@memoir-systems.com, yilu4@illinois.edu.

caitir
Download Presentation

Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer, Yi Lu (alizade, adelj)@stanford.edu, (dachuang, sundaes)@memoir-systems.com, yilu4@illinois.edu ACM Sigmetrics/Performance 2012

  2. What Is Embedded DRAM? • 2nd Most Common Embedded Memory • Consists of 1 Transistor, 1 Capacitor cell • 2X-3X denser than SRAM • 2X-4X slower than SRAM • Supported by Key ASIC and IP Vendors • IBM, TSMC, NEC, Mosys, ST • Used in a Number of Applications • Servers, Networking, Storage, Gaming, Mobile • Industry Examples • IBM'sP7 • Sony Playstations, Nintendo GameCube, Wii • Apple iPhone, Microsoft Zune HD, Xbox 360 • Cisco Catalyst 3K-10K Select StorageCapacitor Data eDRAM 1T1C Memory Cell

  3. Problem: eDRAM Refresh Causes Memory Bandwidth Loss DRAM Capacitor has Finite Retention Time (W = Tref) Bank Example: W= 18us @ 100C = 4050 cycles @ 225 MHz 1 All 64 rows will losedata in 4050 cycles! Rows R Example: R = 64 rows R/W Port Refresh Port Solution: Periodic Refresh --- Reserve Refresh Cycles for Every Cell in MemoryCauses Bandwidth Loss = R/W = 64 rows/4050 cycles ~ 1.58%

  4. Trend: Higher Density Multi-banked Macros (Mb/mm2) (2) More Banks are Packed Together and Need to be Refreshed (1) More Rows are Packed Together and Need to be Refreshed (4) Smaller W with Higher Temperature Memory Banks 2 B 1 1 (3) Smaller Capacitor with Lower Geometry → Smaller W Rows R Shared Refresh and R/W Ports 1 M Shared Circuitry to Conserve Area (5) Low Clock Speed Mode Decreases ‘Clock-time’ to Refresh Periodic Refresh Bandwidth Loss ~ RB/W (Note: W ≥ RB) Does not Scale with Larger Macros, Geometry & Low Power Modes

  5. Examples of Periodic Refresh with Multi-banked Macros M = Number of Memory Ports, B = Number of Banks, R = Number of Rows, W = Cell Retention Time The Problem is Only Getting Worse Over Time …

  6. Vendor Solution: Concurrent Refresh Memory Banks 2 B 1 1 Rows R Concurrent Refresh Port 1 R/W Ports M Concurrent Refresh++: Refresh a Bank Which is Not Being Concurrently Accessed ++T. Kirihata et. al.,An 800-MHz embedded DRAM with a concurrent refresh mode. Solid-State Circuits, IEEE Journal of, 40(6):1377–1387, June 2005. Refresh Port

  7. How is Concurrent Refresh Used Today? Memory Banks B 1 2 RP1 RP3 RP4 RP16 RP2 Deficit Register Tracks Non-refreshed Bank(s) Deficit Register Next Concurrent 3 Accessed Bank Bank 2 Count Refresh Pointer Standard Observation: N-1 out of N Banks Get Refreshed for Any PatternConcurrent Refresh Overhead is Proportional to 1 bankConcurrent Refresh Overhead = R/W = 64 rows /4050 cycles = ~1.58% Our Observation: This is Incorrect! In Some Cases, Refresh Overhead can be Very Bad, Close to 100% for ANY Concurrent Refresh Scheduler

  8. Goals of Our Work: An Industry Outlook • Design a Concurrent Refresh Scheduler that can • Provide Deterministic Memory Performance Guarantees • Maximize Memory Throughput (Optimality) • Be Universally Applicable • For any eDRAM macro with B banks, R Rows, M memory ports • For any characteristics of cell retention time W++, and Clock speed • Maximize Memory Burst Tolerance • Have Low Implementation Overhead ++Note that W is itself a function of temperature, process, and the micro-architecture of the eDRAM

  9. Problem Formulation • We consider a general class of algorithms that require X refresh (idle) timeslots in every Y consecutive timeslots. Refresh Refresh Refresh Refresh Refresh Fixed TDM Constraint Refresh Window 1 Refresh Window 2 Refresh Window 3 Refresh Window 4 ...... . . . . . …. Refresh Refresh Any Refresh Window Any Refresh Window Sliding Window Constraint Gives Maximum Flexibility for Handling Bursts, and When to Provide Idle Cycles Sliding Window Constraint Supports X idle cycles in any (t, t+Y)

  10. Key Performance Metrics • Refresh Overhead = X / Y • Memory bandwidth wasted on refresh • Burst Tolerance = Y – X • Maximum number of consecutive memory accesses without interruption for refresh We’ll Consider the Simple Case When the User is Required to Send X = 1 Idle in Y Cycles, and M = 1

  11. Our Solution: Versatile Refresh Algorithm Memory Banks B 1 2 RP1 RP3 RP2 RP1 RP2 RPB RP2 RP3 RP4 RP4 RP1 RPB Max Register Deficit Register Next Concurrent 3 1 1 0 2 Count Pointer Count Refresh Pointer Bank with deficit has priority for refresh. Maximum Allowed Deficit Register Controls Burst Tolerance(Y)

  12. Necessary Refresh Overhead for any Algorithm: Intuition, X=1 • At each time the BR memory cells have distinct ages ≥ (0, …, BR-1) • An adversary keeps reading from a particular bank; idle slots are needed to refresh cells in that bank. • A total of BR inequalities to ensure cells are refreshed in time • Interestingly, only two of these inequalities matter • The one corresponding to the oldest cell • The one corresponding to the oldest “youngest cell in each bank”

  13. Necessary Refresh Overhead for any Algorithm: Derivation, X=1 • How much can the adversary age the oldest cell? • Current age is at least BR-1 • Must wait for at least 1 idle before it is picked up: (BR -1) + Y ≤ W • How much can the adversary age the oldest “youngest cell in each bank”? • Current age is at least B-1 • Must wait for at least R idles before it is picked up:(B-1) + YR ≤ W

  14. Optimality for Versatile Refresh Overhead: Results, X =1 • Necessity: Result for any Algorithm • Sufficiency:Result for VR Algorithm (with parameter X): Nearly Optimal Refresh For X=1

  15. Performance Guarantees of Versatile Refresh Algorithm “Bad” Region with High Overhead 1 Increasing X Worst-case Refresh Overhead (X/Y) Near-optimal Refresh Overhead for X = 1 Refresh Overhead ~ R/W, for W large 1/B R/W 0 RB Wc = RB + B-1 Cell Retention Time (W) Why Would We Ever Use Large X?

  16. Why Would We Ever Use Large X? • Because of Burst Tolerance (large X → large Y – X) • If memory accesses are bursty, refreshes can be hidden • There is a Critical Value of X for Max Burst Tolerance • Example: B = 16, R = 128, W = 2500

  17. Calculations for Customer ASIC++ R = 1024, B = 6 Banks, W = 18.2us @ 100C = 6825 cycles @ 375MHz (++Note that these numbers have been sanitized)

  18. Versatile Refresh Enhancement • Enhancement: • No-conflict slot: A timeslot where the bank the VR scheduler wants to refresh is not being accessed. • Any idle slot is a no-conflict slot; but not vice versa • For VR, no-conflict slots are as good as idle slots. • Observation: • This allows lower refresh overhead (possibly zero) for non-adversarial memory access patterns

  19. Fully Enhanced Versatile Refresh Algorithm Memory Banks B 1 2 RP4 RP1 RP1 RP1 RP2 RP3 RP3 RPB RPB RP2 RP2 RP4 Max Register Next Refresh Deficit Register Repeat for Multiple Memory Ports (M) 3 2 Count Bank Pointer Pointer Count No conflict feedback X idles in Y timeslots Enforcer Module (User Logic)

  20. Simulation: Synthetic Statistical Workload • Parameter Alpha Controls Degree of Temporal Locality • alpha ~ 0 → always read from bank 1 (adversarial) • alpha ~ 1 → read from random banks (benign) VR with X = 4: Min worst-case overhead (best for adversarial) VR with X = 128: Max burst tolerance (best for benign) Refresh Overhead has Disappeared Completely!

  21. Conclusion • With Versatile Refresh A Designer Can … • Exactly Calculate Available Memory Bandwidth • For any eDRAM macro with B banks, R Rows, M memory ports • For any characteristics of Temperature, W= Tref and Clock speed • Achieve Optimal Worst-case Memory Bandwidth • Design for Large Burst Tolerance • Potentially Eliminate Back-pressure • Simplify associated complex design and verification • Maximize Best-case Memory Bandwidth • Avail of a Formally Verified VR Controller • On a suitably reduced memory instance

More Related