Optimizing DRAM Based Main Memories Using Intelligent Data Placement

Optimizing DRAM Based Main Memories Using Intelligent Data Placement Ph.D. Thesis Proposal Kshitij Sudan

Thesis Statement Improving DRAM access latency, power consumption, and capacity by leveraging intelligent data placement.

Overview System Re-design Increasing capacity within a fixed power budget Tiered Memory Under Review Memory Interconnect Narrow, buffered channels to increase capacity Proposed work Memory Controller Maximize DRAM row-buffer utility Micro-pages: ASPLOS 2010 DIMM CPU … MC

Proposed Work Re-architecting memory channels

Challenges in Increasing DRAM Capacity • Slow growth in CPU pin count limits number of memory channels • Signal integrity limits capacity per channel • Use serial, point-to-point links • Drawbacks of using serial, point-to-point links • Increased latency due to signal re-conditioning • Memory controller complexity limits resource use

Increasing DRAM Capacity by Re-Architecting Memory Channel • Re-architect CPU-to-DRAM channel • Many skinny, serial channels vs. few, wide buses • CMPs might have changed the playing field • Improved signal integrity due to re-conditioning • New channel topology to reduce latency • Study effects of channel frequency

Re-Architecting Memory Channel Organize modules as binary tree, and move some MC functionality to “Buffer Chip” • Reduces module depth from O(n) to O(log n) • Reduces worst case latency, improves signal integrity • Buffer chip manages low-level DRAM operations and channel arbitration • Not limited by worst-case latency like FB-DIMM • NUMA like DRAM access – leverage data mapping

Past Work Micro-Pages

Increasing Row-Buffer Utility with Data Placement • Over fetch due to large row-buffers • 8 KB read into row buffer for a 64 byte cache line • Row-buffer utilization for a single request < 1% • Diminishing locality in multi-cores • Increasingly randomized memory access stream • Row-buffer hit rates bound to go down • Open page policy and FR-FCFS request scheduling • Memory controller schedules requests to open row-buffers first Goal Improve row-buffer hit-rates for CMPs

Key Observation Post-L2 Cache Block Access Pattern Within OS Pages For heavily accessed pages in a given time interval, accesses are usually to a few cache blocks

Basic Idea Reserved DRAM Region 4 KB OS Pages 1 KB micro-pages DRAM Memory Coldest micro-pages Hottest micro-pages

Hardware Implementation (HAM) Hardware Assisted Migration (HAM) Baseline 4 GB Main Memory CPU Memory Request Physical Address X Page A X Mapping Table Old Address New Address X Y 4 MB Reserved DRAM region Y New addr . Y

Conclusions • On average, for applications with room for improvement and with our best performing scheme • Average performance ↑9% (max. 18%) • Average memory energy consumption ↓18%(max. 62%). • Average row-buffer utilization ↑ 38% • Hardware assisted migration offers better returns due to fewer overheads of TLB shoot-down and misses

Past Work Tiered Memory

Increase DRAM Capacity in Fixed Power Budget • DRAM power budget increasing steadily with increases in capacity • Memory power budget in large systems already close to 50% of total power budget • DRAM low-power modes hard to use in current systems • Granularity at which low-power modes operate at (a DRAM rank) • Data placement to increase bandwidth reduces opportunities to place ranks in low-power modes

DRAM Power Mgmt. Challenges • DRAM supports low-power modes, but not easy to exploit: • Granularity at which memory can be put in low-power mode is large. • Random distribution of memory accesses across ranks • Memory interleaving. • Little co-ordination between memory managers (library, OS, and hypervisor). • As a result, no rank experiences sufficient idleness to warrant being placed in a low-power modes. Few systems can exploit DRAM low-power modes aggressively

Tiered Memory • Access to 4KB OS pages show a step curve • Leverage this to place frequently accessed pages in • active-mode DRAM ranks • Place “cold” pages in low-power mode ranks

Iso-Power Tiered Memory-I • A DRAM rank in self-refresh mode consumes ~15% of the power of an idle rank in active mode. • 1 rank in active idle mode = 6 ranks in self-refresh. • By maintaining most of the memory in a low-power mode, can build systems with a much larger memory capacity in same power budget.

Iso-Power Tiered Memory-II • 2 tiers of DRAM with heterogeneous power and performance characteristics. • “Hot” tier DRAM always available, “cold” tier DRAM uses self-refresh low-power mode when idle. • Place frequently accessed data in hot tier. • Maintain performance • Fewer accesses to cold tier -> reduces power. • Batch references to cold tier: • Amortize entry/exit overheads of low-power mode. • Stay in low-power mode for longer.

Intelligent Data Placement • Counters keep track of hot pages with low overhead • Every epoch, migrate hot pages in low-power ranks, to active ranks • Requires page-table updates, TLB flushes • Still low overhead - after first few epoch, little change in hot page set

Servicing cold-tier requests in batches • Buffer requests at the memory controller for cold-tier accesses • At most, delay any request by t_g– prevents starvation • t_gchosen to amortize overheads of low-power mode entry/exit • Requires minimal change to the memory controller

Attributions • Re-architecting memory channel: Rajeev Balasubramonian, Al Davis, NiladrishChatterjee, Manu Awasthi • Micro-Pages: Rajeev Balasubramonian, Al Davis, NiladrishChatterjee, Manu Awasthi • Tiered Memory: KarthickRajamani, Wei Huang, John Carter, Freeman Rawson

ThanksQuestions?

Backup Slides

Other Work • Dynamic Hardware-Assisted Software-Controlled Page Placement to Manage Capacity Allocation and Sharing within Large Caches - Manu Awasthi, Kshitij Sudan, Rajeev Balasubramonian, John Carter, HPCA, February 2009. • Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality of Service - Kshitij Sudan, SadagopanSrinivasan, Rajeev Balasubramonian, Ravi Iyer, Under Review. • A Novel System Architecture for Web-Scale Applications Using Lightweight CPUs and Virtualized I/O - Kshitij Sudan, SaisanthoshBalakrishnan, Sean Lie, Min Xu, DhirajMallick, Rajeev Balasubramonian, Gary Lauterbach, Under Review. • Data Locality Optimization of Pthread Applications for Non-Uniform Cache Architectures – Gagan S. Sachdev, Kshitij Sudan, Rajeev Balasubramonian, Mary Hall, Under Review. Contd.

Efficient Scrub Mechanisms for Error-Prone Emerging Memories - Manu Awasthi, ManjunathShevgoor, Kshitij Sudan, BipinRajendran, Rajeev Balasubramonian, VijiSrinivasan, To Appear at HPCA-18, Feb 2012. • Hadoop Jobs Require One-Disk-per-Core, Myth or Fact? - Kshitij Sudan, Min Xu, Sean Lie, SaisanthoshBalakrishnan, Gary Lauterbach, XLDB-5 Lightning Talk, Oct. 2011. • Handling PCM Resistance Drift with Device, Circuit, Architecture, and System Solutions - Manu Awasthi, ManjunathShevgoor, Kshitij Sudan, Rajeev Balasubramonian, BipinRajendran, VijiSrinivasan, Non-Volatile Memory Workshop, March 2011. • Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers - Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian, Al Davis, PACT, September 2010 • Improving Server Performance on Multi-Cores via Selective Off-loading of OS Functionality - David Nellans, Kshitij Sudan, Erik Brunvand Rajeev Balasubramonian, WIOSCA, June 2010. • Hardware Prediction of OS Run-Length For Fine-Grained Resource Customization - David Nellans, Kshitij Sudan, Erik Brunvand, Rajeev Balasubramonian, ISPASS-2010, March 2010.

Iso-Power Memory Configurations 2h,22c:3X baseline 4h,12c:2X baseline • 8 active ranks in baseline • ratio of idle active and self-refresh power, • fraction (u) of memory requests served by hot ranks, • service rate, • bandwidth. Analytical model determines iso-power configurations for a given access rate to the active-mode (“hot”) DRAM ranks

Iso-Power Memory Configurations Analytical model determines iso-power configurations for a given access rate to the active-mode (“hot”) DRAM ranks

Tiered Memory: Iso-Power Memory Architecture to Address Memory Power Wall • Build tiers out of DRAM ranks • Aggressively use low-power (LP) modes • Intelligent data placement to reduce overheads of entry/exit from LP modes • Buffer requests to ranks in LP and service them in batches to amortize entry/exit costs

Optimizing DRAM Based Main Memories Using Intelligent Data Placement

Optimizing DRAM Based Main Memories Using Intelligent Data Placement

Presentation Transcript

Lecture 21: Memories SRAM DRAM

DRAM

Optimizing Search Engines using Clickthrough Data

Intelligent Data Analysis

Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access

Optimizing Average Precision using Weakly Supervised Data

Processor with Integrated DRAM Main Memory

Advanced Placement Data

Optimizing search engines using clickthrough data

OPTIMIZING THE PLACEMENT OF FACTS DEVICES USING EVOLUTIONARY ALGORITHM

Optimizing of data access using replication technique

Lecture: DRAM Main Memory

Lecture: Virtual Memory, DRAM Main Memory

Intelligent Data Mining

Processor with Integrated DRAM Main Memory

Lecture: DRAM Main Memory

Lecture 15: DRAM Main Memory Systems

Optimizing End-User Data Delivery Using Storage Virtualization

Lecture: DRAM Main Memory

Lecture: Virtual Memory, DRAM Main Memory

Lecture: DRAM Main Memory