1 / 34

Apr 9, 2012

MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isolation in Multi-core Platforms. Apr 9, 2012 Heechul Yun + , Gang Yao + , Rodolfo Pellizzoni * , Marco Caccamo + , Lui Sha + + University of Illinois at Urbana-Champaign * University of Waterloo.

cosima
Download Presentation

Apr 9, 2012

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MemGuard: Memory Bandwidth ReservationSystem for Efficient Performance Isolation inMulti-core Platforms Apr 9, 2012 Heechul Yun+, Gang Yao+, Rodolfo Pellizzoni*, Marco Caccamo+, LuiSha+ +University of Illinois at Urbana-Champaign*University of Waterloo

  2. Real-Time Applications • Resource intensive real-time applications • Multimedia processing(*), real-time data analytic(**), object tracking • Requirements • Need more performance and cost less Commercial Off-The Shelf (COTS) • Performance guarantee (i.e., temporal predictability and isolation) (*) ARM, QoS for High-Performance and Power-Efficient HD Multimedia, 2010 (**) Intel, The Growing Importance of Big Data and Real-Time Analytics, 2012

  3. Modern System-on-Chip (SoC) • More cores • Freescale P4080 has 8 cores • More sharing • Shared memory hierarchy (LLC, MC, DRAM) • Shared I/O channels More performance Less cost But, isolation?

  4. Problem: Shared Memory Hierarchy Part 3 Part 4 Part 2 Part 1 • Shared hardware resources • OS has little control Core3 Core1 Core4 Core2 Space contention Shared Last Level Cache (LLC) Memory Controller (MC) Access contention DRAM

  5. Memory Performance Isolation Part 3 Part 4 Part 2 Part 1 • Q. How to guarantee worst-case performance? • Need to guarantee memory performance Core3 Core1 Core4 Core2 LLC LLC LLC LLC Memory Controller DRAM

  6. Inter-Core Memory Interference Slowdown ratio due to interference • Significant slowdown (Up to 2x on 2 cores) • Slowdown is not proportional to memory bandwidth usage background 470.lbm foreground X-axis Runtime slowdown Core Core Core L2 L2 Shared Memory (1.6GB/s) (1.5GB/s) (1.5GB/s) (1.4GB/s) Intel Core2

  7. Background: DRAM Chip Bank 4 Bank 3 • State dependent access latency • Row miss: 19 cycles, Row hit: 9 cycles Bank 2 READ (Bank 1, Row 3, Col 7) Bank 1 Row 1 precharge Row 2 Col7 Row 3 Row 4 Row 5 activate Row Buffer Read/write (*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting)

  8. Background: Memory Controller(MC) Bruce Jacob et al, “Memory Systems: Cache, DRAM, Disk” Fig 13.1. • Request queue • Buffer read/write requests from CPU cores • Unpredictable queuing delay due to reordering

  9. Background: MC Queue Re-ordering Initial Queue Reordered Queue • Improve row hit ratio and throughput • Unpredictable queuing delay Core1: READ Row 1, Col 1 Core2: READ Row 2, Col 1 Core1: READ Row 1, Col 2 Core1: READ Row 1, Col 1 Core1: READ Row 1, Col 2 Core2: READ Row 2, Col 1 DRAM DRAM 2 Row Switch 1 Row Switch

  10. Challenges for Real-Time Systems • Memory controller(MC) level queuing delay • Main source of interference • Unpredictable (re-ordering) • DRAM state dependent latency • DRAM row open/close state • State of Art • Predictable DRAM controller h/w: [Akesson’07] [Paolieri’09] [Reineke’11]  not in COTS Core3 Core1 Core4 Core2 Predictable Memory Controller Memory Controller Memory Controller DRAM DRAM

  11. Our Approach Core3 Core1 Core4 Core2 • OS level approach • Works on commodity h/w • Guarantees performance of each core • Maximizes memory performance if possible OS control mechanism Memory Controller Memory Controller DRAM DRAM

  12. MemGuard Operating System MemGuard • Memory bandwidth reservation and reclaiming Reclaim Manager 0.9GB/s BW Regulator 0.1GB/s BW Regulator 0.1GB/s BW Regulator 0.1GB/s BW Regulator Multicore Processor PMC Core1 PMC Core2 Core3 PMC Core4 PMC Memory Controller DRAM DIMM

  13. Memory Bandwidth Reservation • Idea • OS monitor and enforce each core’s memory bandwidth usage Enqueue tasks 2 1 Budget Core activity 0 1ms 2ms Dequeue tasks Dequeue tasks computation memory fetch

  14. Memory Bandwidth Reservation • Key Insight • B/W regulators control memory request rates • (Cores)request rate (DRAM) service rate  (Memory controller) minimal queuing delay • System-wide reservation rule • up to the guaranteed bandwidth rmin • m: #of cores • Bi: Core i’s b/w reservation

  15. Guaranteed Bandwidth: rmin • Worst-case DRAM performance (service rate) • All memory requests go to the same bank (no bank-level parallelism) and cause row miss • Example (PC6400-DDR2*) • Peak B/W: 6.4GB/s • 64bytes I/O = 10ns, hide command latency by interleaving • Calculated guaranteed B/W: 1.3GB/s • PRE + ACT + RD + I/O (8x8bytes) = 47.5ns • Measured guaranteed B/W: 1.2GB/s • Performance Isolation • Sum of memory b/w reservation guaranteed b/w (*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting)

  16. Memory Access Pattern Memory requests Memory requests • Memory access patterns vary over time • Static reservation is inefficient Time(ms) Time(ms)

  17. Memory Bandwidth Reclaiming • Key objective • Redistribute excess bandwidth to demanding cores • Improve memory b/w utilization • Predictive bandwidth donation and reclaiming • Donate unneeded budget predictively • Reclaim on-demand basis

  18. Reclaim Example • Time 0 • Initial budget 3 for both cores • Time 3,4 • Decrement budgets • Time 10 (period 1) • Predictive donation (total 4) • Time 12,15 • Core 1: reclaim • Time 16 • Core 0: reclaim • Time 17 • Core 1: no remaining budget; dequeue tasks • Time 20 (period 2) • Core 0 donates 2 • Core 1 do not donate

  19. Best-effort Bandwidth Sharing • Key objective • Utilize best-effort bandwidth whenever possible • Best-effort bandwidth • After all cores use their budgets (i.e., delivering guaranteed bandwidth), before the next period begins • Sharing policy • Maximize throughput • Broadcast all cores to continue to execute

  20. Evaluation Platform Intel Core2Quad • Intel Core2Quad 8400, 4MB L2 cache, PC6400 DDR2 DRAM • Prefetchers were turned off for evaluation • Power PC based P4080 (8core) and ARM based Exynos4412(4core) also has been ported • Modified Linux kernel 3.6.0 + MemGuard kernel module • https://github.com/heechul/memguard/wiki/MemGuard • Used the entire 29 benchmarks from SPEC2006 and synthetic benchmarks Core 2 Core 3 Core 0 Core 1 I D I D I D I D L2 Cache L2 Cache Memory Controller DRAM

  21. Evaluation Results Foreground (462.libquantum) 462.libquantum (foreground) memory hogs (background) Normalized IPC C2 C0 C2 L2 L2 Shared Memory Intel Core2 53% performance reduction

  22. Evaluation Results Foreground (462.libquantum) Guaranteed performance 462.Libquantum (foreground) memory hogs (background) Normalized IPC C2 C0 C2 .2GB/s 1GB/s L2 L2 Shared Memory Intel Core2 Reservation provides performance isolation

  23. Evaluation Results Foreground (462.libquantum) Guaranteed performance 462.Libquantum (foreground) memory hogs (background) Normalized IPC C2 C0 C2 .2GB/s 1GB/s L2 L2 Shared Memory Intel Core2 Reclaiming and Sharing maximize performance

  24. SPEC2006 Results 470.lbm (background) Normalized IPC Normalized to guaranteed performance X-axis (foreground) C2 C0 C2 • Guarantee(soft) performance of foreground (x-axis) • W.r.t. 1.0GB/s memory b/w reservation .2GB/s 1GB/s L2 L2 Shared Memory Intel Core2

  25. SPEC2006 Results • Improve overall throughput • background: 368%, foreground(X-axis): 6% Normalized to guaranteed performance 470.lbm (background) Normalized IPC X-axis (foreground) C2 C0 C2 .2GB/s 1GB/s L2 L2 Shared Memory Intel Core2

  26. Conclusion • Inter-Core Memory Interference • Big challenge for multi-core based real-time systems • Sources: queuing delay in MC, state dependent latency in DRAM • MemGuard • OS mechanism providing efficient per-core memory performance isolation on COTS H/W • Memory bandwidth reservation and reclaiming support • https://github.com/heechul/memguard/wiki/MemGuard

  27. Thank you.

  28. Effect of Reclaim • IPC improvement of background (lbm@0.2GB/s) is 3.8x • IPC reduction of foreground (SPEC@1.0GB/s) is 3%

  29. Reclaim Underrun Error

  30. Effect of Spare Sharing • IPC of background (lbm@0.2GB/s) improves 40% • IPC of foreground (SPEC@1.0GB/s) also improves 9%

  31. Isolation and Throughput Effect of rmin • 4 core configuration

  32. Isolation Effect of Reservation Isolation • Sum b/w reservation < rmin (1.2GB/s) Isolation • 1.0GB/s(X-axis) + 0.2GB/s(lbm) = rmin Core 2: 0.2 – 2.0 GB/s for lbm Solo IPC@1.0GB/s Core 0: 1.0 GB/s for X-axis

  33. Effect of MemGuard • Soft real-time application on each core. • Provides differentiated memory bandwidth • weight for each core=1:2:4:8 for the guaranteed b/w, spare bandwidth sharing is enabled

  34. Hard/Soft Reservation on MemGuard • Hard reservation (w/o reclaiming) • Can guarantee memory bandwidth Bi regardless of other cores at each period • Wasted if not used • Soft reservation (w/ reclaiming) • Misprediction can caused missed b/w guarantee at each period • Error rate is small---less than 5%. • Selectively applicable on per-core basis

More Related