280 likes | 291 Views
This study explores methods to boost Phase Change Memory (PCM) performance by mitigating slow write issues through adaptive write cancellation and pausing techniques. Evaluation shows a significant reduction in read latency and overall system efficiency.
E N D
Improving Read Performance of PCM via Write Cancellation and Write Pausing Moinuddin QureshiMichele Franceschini and Luis Lastras IBM T. J. Watson Research Center, Yorktown Heights, NY HPCA – 2010
Introduction More cores in system More concurrency Larger working set DRAM-based memory system hitting: power, cost, scaling wall Phase Change Memory (PCM): Emerging technology, projected to be more scalable, higher density, power-efficient
RESET Large Current Small Current Temperature Memory Element SET SET Low resistance RESET High resistance Access Device PCM Operation Tmelt Switching by heating using electrical pulses RESET state: amorphous (high resistance) SET state: crystalline (low resistance) Tcryst Time Read latency 2x-4x of DRAM. Write latency much higher Photo Courtesy: Bipin Rajendran, IBM
Problem of Contention from Slow Writes PCM writes 4x-8x slower than reads Writes not latency critical. Typical response: Use large buffers and intelligent scheduling. But once write is scheduled to a bank, later arriving read waits Write request causes contention for reads increased read latency
Outline • Introduction • Quantifying the Problem • Adaptive Write Cancellation • Write Pausing • Combining Cancellation & Pausing • Summary
Configuration: Hybrid Memory Processor Chip DRAM Cache (256MB) PCM-Based Main Memory Each bank has a separate RDQ and WRQ (32-entry) Baseline uses read priority scheduling if WRQ < 80% full. If WRQ>80% full, oldest-first policy “forced write” (rare <0.1%)
Norm. Execution Time Problem Read Latency=1k cycles Write Latency=8k cycles (sensitivity in paper) 12 workloads: each with 8 benchmarks from SPEC06 Baseline No Read Priority Effective Read Latency (Cycles) Write Latency=1K Write Latency=0 Writes significantly increase read latency (Problem only for asymmetric memories)
Outline • Introduction • Problem: Writes Delaying Reads • Adaptive Write Cancellation • Write Pausing • Combining Cancellation & Pausing • Summary
Write Cancellation Write Cancellation: “abort” on-going write to Improve read latency Line in non-deterministic state: read matching read request from WRQ Perform write cancellation as soon as a read request arrives at a bank (as long as the write is not done in forced-mode)
Write Cancellation with Static Threshold Canceling a write request close to completion is wasteful and causes episodes of forced-writes (low performance) WCST: Cancel write request only if less than K% service done 2365 (NeverCancel) (AlwaysCancel)
High 100% Threshold 50% ForcedWrites Low 0% 30 10 20 Num Entries in WRQ Adaptive Write Cancellation Best threshold depends on num pending entries in WRQ. Fewer entries Higher threshold (best read latency) More entries Lower threshold (reduces forced writes) Write Cancellation with Adaptive Threshold (WCAT) Threshold = 100 – (4*NumEntriesInWRQ)
Adaptivity of WCAT We sampled all WRQ every 2M cycles to measure occupancy WCAT uses higher threshold initially with empty WRQ but Lower threshold later reduces the episodes of forced-writes
Results for WCAT Baseline: 2365 cycles Ideal:1K cycles Adaptive threshold reduces latency and incurs half the overhead
Outline • Introduction • Problem: Writes Delaying Reads • Adaptive Write Cancellation • Write Pausing • Combining Cancellation & Pausing • Summary
Iterative Write in PCM devices In Multi-Level Cells (MLC), the programming precision requirement increases linearly with the number of levels PCM cells respond differently to same programming pulse Acknowledged solution to address uncertainty: Iterative writes Each iteration consists of steps of: write-read-verify Not done Verify Write Read Done
Model for Iterative Writes We develop an analytical model to capture number of iterations: In terms of bits/cell, num levels written in one shot, and learning Time required to write a line is worst-case of all cells in line MLC:3 bits/cell Avg number of iterations: 8.3 (consistent with MLC literature)
Rd X Iter 1 Iter 2 Rd X Iter 3 Iter 4 Better read latency with negligible write overhead Concept of Write Pausing Iterative writes can be paused to service pending read requests Potential Pause Points Iter 1 Iter 2 Iter 3 Iter 4 Reads can be performed at the end of each iteration (potential pause point) We extend the iterative write algorithm of Nirschl et al. [IEDM’07] to support Write Pausing
Results for Write Pausing Write Pausing at end of iteration gets 85% of benefit of “Anytime” Pause
Outline • Introduction • Problem: Writes Delaying Reads • Adaptive Write Cancellation • Write Pausing • Combining Cancellation & Pausing • Summary
Write Pausing + WCAT Rd X Iter 1 Iter 2 Rd X Iter 3 Iter 4 Rd X Iter 1 Iter 2 Iter 3 Iter 4 Rd X Iter 1 Rd X Iter 2 Iter 3 Iter 4 Iter2 Cancelled Only one iteration is cancelled “micro-cancellation” has low overhead
Results Baseline: 2365 cycles Ideal:1K cycles Write Pause + Micro Cancellation very close to Anytime Pause (re-execution overhead of micro cancellation <4% extra iterations)
Impact of Write Queue Size Speedup wrt Baseline (32-entry) We will need large buffers to best exploit the benefit of Pausing
Outline • Introduction • Problem: Writes Delaying Reads • Adaptive Write Cancellation • Write Pausing • Combining Cancellation & Pausing • Summary
Summary • Slow writes increase the effective read latency (2.3x) • Write Cancellation: Cancel ongoing write to service read • Threshold based write cancellation • Adaptive Threshold: better performance, half the overhead • Write Pausing exploits iterative write to service pending reads • Write Pausing + Micro Cancellation close to optimal pause • Effective read latency: from 2365 to 1330 cycles (1.45x speedup) • We will need large write buffers to exploit the benefit of Pausing
Write Pausing in Iterative Algorithms (Nirschl+ IEDM’07)
Workloads and Figure of Merit • 12 memory-intensive workloads from SPEC 2006: • 6 rate-mode (eight copies of same benchmark) • 6 mix-mode (two copies of four benchmarks) Key metric: Effective Read Latency Tin = Time at which read request enters RDQ Tout = Time at which read request finishes service at memory Effective Read Latency = Tout – Tin (average reported)
Sensitivity to Write Latency At WriteLatency=4K, the speedup is 1.35x instead of 1.45x (at 8K latency)