1 / 18

Writeback -Aware Bandwidth Partitioning for Multi-core Systems with PCM

Writeback -Aware Bandwidth Partitioning for Multi-core Systems with PCM. Miao Zhou, Yu Du, Bruce Childers, Rami MeLHEM , Daniel Mossé University of Pittsburgh. http:// www.cs.pitt.edu /PCM. Introduction. DRAM memory is not energy efficient Data centers are energy hungry

Download Presentation

Writeback -Aware Bandwidth Partitioning for Multi-core Systems with PCM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Writeback-Aware Bandwidth Partitioning for Multi-core Systems with PCM Miao Zhou, Yu Du, Bruce Childers, Rami MeLHEM, Daniel Mossé University of Pittsburgh http://www.cs.pitt.edu/PCM

  2. Introduction • DRAM memory is not energy efficient • Data centers are energy hungry • DRAM memory consumes 20-40% of the energy • Apply PCM as main memory • Energy efficient but • Slower read, much slower write and shorter lifetime • Hybrid memory: add a DRAM cache • Improve performance ( LLC miss rate) • Extend lifetime ( LLC writeback rate) • How to manage the shared resources? C0 C1 C2 C3 L1 L1 L1 L1 L2 L2 L2 L2 DRAM LLC DRAM LLC PCM DRAM

  3. Shared Resource Management • CMP systems • Shared resources: - last-level cache - memory bandwidth • Unmanaged resources interference  poor performance • Partition resources:  interference,  performance C0 C1 C2 C3 L1 L1 L1 L1 L2 L2 L2 L2 LLC Memory Cache Partitioning Bandwidth Partitioning DRAM main memory RBP [Liu et. al., HPCA’10] UCP [Qureshi et. al., MICRO 39] Hybrid main memory This work WCP [Zhou et. al., HiPEAC’12] Utility-based Cache Partitioning (UCP) Tracks utility (LLC hit/miss) and minimizes overall LLC misses Questions: 1. Is read-only (LLC miss) information enough? 2. Is bus bandwidth still the bottleneck? Writeback-aware Cache Partitioning (WCP) Tracks and minimizes LLC miss & writebacks Read-only Bandwidth Partitioning (RBP) Partitions the bus bandwidth based on LLC miss information

  4. Bandwidth Partitioning • Analytic model guides the run time partitioning • Use queuing theory to model delay • Monitor performance to estimate the parameters of the model • Find the partition that maximizes the system’s performance • Enforce the partition at run time • DRAM vs. Hybrid main memory • PCM writes are extremely slow and power hungry • Issues specific to hybrid main memory • Bottleneck: bus bandwidth or device bandwidth • Can we ignore the bandwidth consumed by LLC writebacks

  5. Device Bandwidth Utilization DRAM PCM DRAM PCM DRAM Memory 1. Low device bandwidth utilization 2. Memory reads (LLC misses) dominate Hybrid Memory 1. High device bandwidth utilization 2. Memory writes (LLC writebacks) often dominate

  6. RBP on Hybrid Main Memory 90% 10% Percentage of Device Bandwidth Consumed by PCM Writes (LLC Writebacks) RBP vs. SHARE 1. RBP outperforms SHARE for workloads dominated by PCM read (LLC miss) 2. RBP lost against SHARE for workloads dominated by PCM write (LLC writeback) A new bandwidth partitioning scheme is necessary for hybrid memory

  7. Writeback-Aware Bandwidth Partitioning • Focus on collective bandwidth of PCM devices • Considers LLC writeback information • Token bucket algorithm • Device service units = tokens • Allocate tokens among app. every epoch (5 million cycles) • Analytic model • Maximize weighted speedup • Model the contention on bandwidth as queuing delay • Difficulty: write is blocking only when write queue is full

  8. Analytic Model for bandwidth partitioning • For a single core • Additive CPI formula: CPI = CPILLC∞ + LLC miss freq. * LLC miss penalty • memory ≈ queue, memory service time ≈ queuing delay • For a CMP CPI due to LLC misses CPI with a infinite LLC Memory bandwidth α LLC miss rate λm request service rate request arrival rate … … … … Time to serve requests CPI due to LLC misses LLC miss rate λm,1 Memory bandwidth α1 LLC miss rate λm,1 Maximize Weighted Speedup Memory bandwidth α2 … LLC miss rate λm,N Memory bandwidth αN Memory

  9. Analytic Model for WBP • Taking into account the LLC writebacks • CPI = CPILLC∞ + LLC miss freq. * LLC miss penalty + LLC writeback freq. * LLC writeback penalty CPI due to LLC writebacks • Prob. that writebacksare on the critical path * P p … … … … … … LLC miss rate λm,1 Read memory bandwidth α1 RQ LLC writeback rate λw,1 Write memory bandwidth β1 WQ Memory LLC miss rate λm,2 Read memory bandwidth α2 How to determine P? CPI due to LLC misses and writebacks LLC writeback rate λw,2 Maximize Weighted Speedup Write memory bandwidth β2 … LLC miss rate λm,N Read memory bandwidth αN LLC writeback rate λw,N Write memory bandwidth βN Memory

  10. Dynamic Weight Adjustment Choose P based on the expected number of executed instructions (EEI) BU1 BU2 BUN … pm p1 p2 λm,1 α1,1 Bandwidth Utilization ratio (BU): utilized bandwidth : allocated bandwidth α2,1 αm,1 λw,1 β1,1 β2,1 βm,1 λm,2 α1,2 α2,2 αm,2 EEI WBP λw,2 β1,2 β2,2 βm,2 λm,N α1,N α2,N αm,N λw,N β1,N β2,N βm,N Actual EEI EEI1 EEIm P EEI2

  11. Architecture Overview • BUMon tracks info during an epoch • DWA and WBP compute bandwidth partition for the next epoch • Bandwidth Regulator enforces the configuration

  12. Enforcing Bandwidth Partitioning

  13. Simulation Setup • Configurations • 8-core CMP, 168-entry instruction window • Private 4-way 64KB L1, Private 8-way 2MB L2 • Partitioned 32MB LLC, 12.5 ns latency • 64GB PCM, 4 channels of 2 ranks each, 50ns read latency, 1000ns write latency • Benchmarks • SPEC CPU2006 • Classified into 3 types (W, R, RW) based on whether PCM reads/writes dominate bandwidth consumption • Creates 15 workloads (Light, High) • Sensitivity study on write latency, #channels and #cores

  14. Effective Read Latency 1. Different workloads favor different policy (partitioning weight) 2. WBP+DWA can match the best static policy (partitioning weight) 3. WBP+DWA reduces the effective read latency by 31.9% over RBP

  15. Throughput 1. The best weight varies for different workloads (writebacksweight) 2. WBP+DWA achieves comparable performance to the best static weight 3. WBP+DWA improves the throughput by 24.2% over RBP

  16. Fairness (Harmonic IPC) WBP+DWA improves fairness by an average of 16.7% over RBP

  17. Conclusions PCM device bandwidth is the bottleneck in hybrid memory Writeback information is important (LLC writebacks consume a substantial portion of memory bandwidth) WBP can better partition the PCM bandwidth WBP outperforms RBP by an average of 24.9% in terms of weighted speedup

  18. Thank you Questions ?

More Related