Korea University, VLSI Signal Processing Lab. Jinil Chung ( 정진일 ) ( jinil_chung@korea.ac.kr )

[Paper Review] Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis+, Jeffrey Stuecheli*+, and Lizy Kurian John+ MICRO’11 + The University of Texas at Austin * IBM Corp. Korea University,VLSI Signal Processing Lab. Jinil Chung (정진일) (jinil_chung@korea.ac.kr)

Abstract DRAM: balance between performance, power, and storage density To realize good performance, Must mange the structural and timing restrictionsof the DRAM devices Use of “Page-mode” feature can mitigate many DRAM constraints Aggressive page-mode results in many conflicts (e.g. bank conflict) when multiple workloads in many-core systems map to the same DRAM [IEEE Spectrum(link)] In this paper, Minimalist approach “just enough” page-mode accesses to get benefits, avoiding unfairness  Proposed address hashing + data prefetch engine + per request priority

1. Introduction Row buffer (or “page-mode”) Access • This paper proposed combination of open/closed-page policy based on … • Page-mode gain with only a small number of page accesses •  Propose a fair DRAM address mapping scheme: low RBL & high BLP • Page-mode hit with spatial locality which can be captured in prefetch engines •  Propose an intuitive criticality-based memory request priority scheme NOT temporal locality! RBL: Row-buffer Locality BLP: Bank-level Parallelism

2. Background DRAM timing constraintresults in“dead time” before and after random access MC(Memory Controller)’s job is to reduce performance-limiting gaps using parallelism 1) tRC (row cycle time; ACT-to-ACT @same BK) : MC activates a page  wait for tRC @same BK : multiple threads access diff. row @same BK  latency overhead (tRC delay) 2) tRP (row precharge time; PRE-to-ACT @same BK) : In open-page policy, MC activates other page  tRP penalty @same BK (=close current page before new page is opened) tRC (e.g. 48ns) tRP (e.g. 12ns) tRAS (e.g. 36ns) ACT PRE ACT @same bank

3. Motivation Use of “page-mode” … Next page Latency Effects: Due to tRC & tRP, overall latency increase  small # of access? Power Reduction: only Activate Power reduction  small # of access is enough Bank Utilization: drop off quickly as access increase  small # of access is enough Other DRAM complexities: small # of access is needed for soften restrictions ex) tFAW (Four page Activate time Window; 30ns), cache block transfer delay=3ns -. single access per ACT: limited peak utilization (6ns*4/30ns=80%) -. two~ accesses per ACT: not limited peak utilization (12ns*4/30ns>100%) If B/U is high, the probability that new request will conflict w/ a busy bank is greater. 62% 16% Closed-page policy Closed-page policy

3. Motivation 3.1 Row-buffer locality in Modern Processors : in current WS/Server class designs  large last-level cache (e.g. IBM PowerPC 7) Temporal locality: hits to the large Last-level cache Row buffers exploit only Spatial locality Using prefetch engines, It can be predict spatial locality RBL: Row-buffer Locality

3. Motivation 3.2 Bank and Row Buffer Locality Interplay with Address Mapping -. DRAM device address: row, column, and bank Workload A: long sequential access seq. Workload B: single operation Workload A: higher priority  Slow B0 (DRAM all col.  low order real addr.) e.g. FR-FCFS Workload B: higher priority  Slow A4 (DRAM all col.  low order real addr.) e.g. ATLAS, PAR-BS (DRAM col. & bank  low order real addr.) High BLP (Bank-level Parallelism)  B0 can be serviced w/o degrading traffic to the workload A e.g. Minimalist

4. Minimalist Open-page Mode 4.1 DRAM Address Mapping Scheme -. The basic difference that the Column access bits are split in two places. +. 2 LSB bits are located right after the Block bits +. 5 MSB bits are located just before the Row bits -. (Not shown in the figure) higher order address bits are XOR-ed with the bank bits produce the actual bank selection bits  reducing row buffer conflict [Zhang et al./MICRO’00] For sequential access of 4 cache lines 7-bit 5-bit 2-bit

4. Minimalist Open-page Mode 4.2 Data Prefetch Engine [IBM PowerPC 6] : predictable “page-mode” opportunities  need for accurate prefetch engine : each core includes HW prefetcherw/ prefetch depth distance predictor 1) Multi-line Prefetch Requests -. Multi-line prefetch operation: single request (to indicate specific seq. of cache lines) -. Reducing command BW and queue resource

4. Minimalist Open-page Mode 4.3 Memory Request Queue Scheduling Scheme : In OOO execution, the importance of each request can vary both between and within applications  need for dynamic priority scheme 1) DRAM Memory Requests Priority Calculation -. different priority based on criticality to performance -. Increase priority of each request every 100ns time interval  time-based -. 2 categories: read (normal) and prefetch  read request is higher priority -. MLP information from MSHR in each core: many misses  less important -. Distance information from Prefetch engine (4.2) Read request MLP: Memory Level Parallelism MSHR: Miss Status Holding Register

4. Minimalist Open-page Mode 4.3 Memory Request Queue Scheduling Scheme (cont.) 2) DRAM Page Closure (Precharge) Policy -. Using autoprecharge increasing command BW 3) Overall Memory Requests Scheduling Scheme (Priority Rules 1) -. Same rules are used by all of MC  No need for communication among MC -. if MC is servicing the multiple transfers from a multi-line prefetch request, it can be interrupted by a higher priority request  very critical request can be serviced w/ the smallest latency 4) Handling write operations -. dynamic priority scheme not apply to write -. Using VWQ(Virtual Write Queue)  causing minimal write instructions

5. Evaluation -. 8 core CMP system using the Simics functional model extended w/ the GEMS toolset -. Simulate DDR3 1333MHz DRAM using memory controller policy for each experiment -. Minimalist open-page scheme is compared against three open-page policies: Table 5 1) PAR-BS (Parallelism-aware Batch Scheduler) 2) ATLAS (Adaptive per-Thread Least-Attained-Service) memory scheduler 3) FR-FCFS (First-Ready, First-Come-First-Served): baseline

5. Evaluation 5.1 Throughput -. Overall, “Minimalist Hash+Priority" demonstrated the best throughput improvement over the other schemes, achieving a 10% improvement. -. This is compared against ATLAS and PAR-BS that achieved 3.2% and 2.8% throughput improvements over the whole workload suite.

5. Evaluation 5.2 Fairness -. Minimalist improves fairness up to 15% with an overall improvement of 7.5%, 3.4% and 2.5% for FR-FCFS, PAR-BS and ATLAS, respectively.

5. Evaluation 5.3 Row Buffer Access per Activation -. The observed page-access rate for the aggressive open-page policies fall significantly short  The high page hit rate is simply not possible given the interleaving of requests between the eight executing programs. -. With the Minimalist scheme, the achieved page-access rate is close to 3.5, compared to the ideal rate of four.

5. Evaluation 5.4 Target Page-hit Count Sensitivity -. The Minimalist system requires a target number of page hits to be selected that indicates the maximum number of pages hits the scheme attempts to achieve per row activation. -. a target number of 4 pages hits provides the best results. (that different system configuration may shift the optimal page-mode hit count.)

5. Evaluation 5.5 DRAM Energy Consumption -. To estimate the power consumption we used the Micron power calculator -. Approximately the same as FR-FCFS. PAR-BS, ATLAS and “Minimalist Hash+Priority" provide a small decrease of approximately 5% to the overall energy consumption. -. The energy results are essentially a balance between the decrease in page-mode hits (resulting in high DRAM activation power) and the increase in system performance (decreasing runtime).

Conclusions Minimalist Open-page memory scheduling policy -. Page-mode gain w/ small number of page accesses for each page activation -. Assign per-request priority using request stream information in MLPanddata prefetch engine Improving throughput and fairness -. Throughput increased by 10% on average (compared to FR-FCSC) -. No need for thread based priority information -. No need for communication/coordination among multiple MC or OS

Appendix. Detailed simulation information

Thanks,

Korea University, VLSI Signal Processing Lab. Jinil Chung ( 정진일 ) ( jinil_chung@korea.ac.kr )