1 / 56

Outline for Today

Outline for Today. Objective Physical Page Placement matters Power-aware memory Superpages Announcements Deadline extended (wrong kernel was our fault). Memory System Power Consumption. Laptop Power Budget 9 Watt Processor. Handheld Power Budget 1 Watt Processor.

altessa
Download Presentation

Outline for Today

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Outline for Today • Objective • Physical Page Placement matters • Power-aware memory • Superpages • Announcements • Deadline extended (wrong kernel was our fault)

  2. Memory System Power Consumption Laptop Power Budget 9 Watt Processor Handheld Power Budget 1 Watt Processor • Laptop: memory is small percentage of total power budget • Handheld: low power processor, memory is more important

  3. Opportunity: Power Aware DRAM Read/Write Transaction • Multiple power states • Fast access, high power • Low power, slow access • New take on memory hierarchy • How to exploit opportunity? RambusRDRAM Power States Active 300mW +6000 ns +6 ns Power Down 3mW Standby 180mW +60 ns Nap 30mW

  4. Each chip can be independently put into appropriate power mode Number of chips at each “level” of the hierarchy can vary dynamically. Policy choices initial page placement in an “appropriate” chip dynamic movement of page from one chip to another transitioning of power state of chip containing page RDRAM as a Memory Hierarchy Active Active Nap

  5. RAMBUS RDRAM Main Memory Design Part of Cache Block CPU/$ • Single RDRAM chip provides high bandwidth per access • Novel signaling scheme transfers multiple bits on one wire • Many internal banks: many requests to one chip • Energy implication: Activate only one chip to perform access at same high bandwidth as conventional design Chip 0 Chip 1 Chip 3 Chip 2 Active Power Down Standby

  6. Conventional Main Memory Design Part of Cache Block • Multiple DRAM chips provide high bandwidth per access • Wide bus to processor • Few internal banks • Energy implication: Must activate all those chips to perform access at high bandwidth CPU/$ Chip 0 Chip 1 Chip 3 Chip 2 Active Active Active Active

  7. Exploiting the Opportunity Interaction between power state model and access locality • How to manage the power state transitions? • Memory controller policies • Quantify benefits of power states • What role does software have? • Energy impact of allocation of data/text to memory.

  8. Power-Aware DRAM Main Memory Design CPU/$ Software control • Properties of PA-DRAM allow us to access and control each chip individually • 2 dimensions to affect energy policy: HW controller / OS • Energy strategy: • Cluster accesses to already powered up chips • Interaction between power state transitions and data locality Page Mapping Allocation OS Hardware control ctrl ctrl ctrl Chip 0 Chip 1 Chip n-1 Power Down Active Standby

  9. tl->h th->l phigh phigh tbenefit plow ph->l pl->h constant Power State Transitioning completionof last request in run requests time gap Ideal case:Assume we wantno added latency (th->l + tl->h + tbenefit ) * phigh > th->l * ph->l + tl->h * pl->h + tbenefit * plow

  10. Power State Transitioning completionof last request in run requests time gap th->l tl->h phigh phigh On demand case- adds latency oftransition back up plow ph->l pl->h

  11. Power State Transitioning completionof last request in run requests time gap threshold th->l tl->h phigh phigh Threshold based- delays transition down ph->l plow pl->h

  12. Page Allocation Polices Virtual to Physical Page Mapping • Random Allocation – baseline policy • Pages spread across chips • Sequential First-Touch Allocation • Consolidate pages into minimal number of chips • One shot • Frequency-based Allocation • First-touch not always best • Allow (limited) movement after first-touch

  13. Power-Aware Virtual Memory Based On Context Switches Huang, Pillai, Shin, “Design and Implementation of Power-Aware Virtual Memory”, USENIX 03. • Power state transitions under SW control (not HW controller) • Treated explicitly as memory hierarchy: a process’s active set of nodes is kept in higher power state • Size of active node set is kept small by grouping process’s pages in nodes together – “energy footprint” • Page mapping - viewed as NUMA layer for implementation • Active set of pages, ai, put on preferred nodes, ri • At context switch time, hide latency of transitioning • Transition the union of active sets of the next-to-run and likely next-after-that processes to standby (pre-charging) from nap • Overlap transitions with other context switch overhead

  14. Power-Aware DRAM Main Memory Design CPU/$ Software control • Properties of PA-DRAM allow us to access and control each chip individually • 2 dimensions to affect energy policy: HW controller / OS • Energy strategy: • Cluster accesses to preferred memory nodes per process • OS triggered power state transitions on context switch Page Mapping Allocation OS ctrl ctrl ctrl Chip 0 Chip 1 Chip n-1 Nap Active Standby

  15. Rambus RDRAM Read/Write Transaction RambusRDRAM Power States Active 313mW +20 ns +3 ns Standby 225mW Power Down 7mW +22510 ns +20 ns Nap 11mW +225 ns

  16. RDRAM Active Components

  17. A node is active iff at least one page from the node is mapped into process i’s address space. Table maintained whenever page is mapped in or unmapped in kernel. Alternativesrejected due to overhead: Extra page faults Page table scans Overhead is onlyone incr/decrper mapping/unmapping op Determining Active Nodes

  18. Implementation Details Problem: DLLs and files shared by multiple processes (buffer cache) become scattered all over memory with a straightforward assignment of incoming pages to process’s active nodes – large energy footprints afterall.

  19. Implementation Details Solutions: • DLL Aggregation • Special case DLLs by allocating Sequential first-touch in low-numbered nodes • Migration • Kernal thread – kmigrated – running in background when system is idle (waking up every 3s) • Scans pages used by each process, migrating if conditions met • Private page not on • Shared page outside 3 ri

  20. Evaluation Methodology • Linux implementation • Measurements/counts taken of events and energy results calculated (not measured) • Metric – energy used by memory (only). • Workloads – 3 mixes: light (editting, browsing, MP3), poweruser (light + kernel compile), multimedia (playing mpeg movie) • Platform – 16 nodes, 512MB of RDRAM • Not considered: DMA and kernel maintenance threads

  21. Results • Base – standby when not accessing • On/Off –nap when system idle • PAVM

  22. Results • PAVM • PAVMr1 - DLL aggregation • PAVMr2 – both DLL aggregation & migration

  23. Results

  24. Conclusions • Multiprogramming environment. • Basic PAVM: save 34-89% energy of 16 node RDRAM • With optimizations: additional 20-50% • Works with other kinds of power-aware memory devices

  25. Discussion: What about page replacement policies? Should (or how could) they be power-aware?

  26. Related Work • Lebeck et al, ASPLOS 2000 – dynamic hardware controller policies and page placement • Fan et al • ISPLED 2001 • PACS 2002 • Delaluz et al, DAC 2002

  27. Dual-state HW Power State Policies access Active • All chips in one base state • Individual chip Active while pending requests • Return to base power state if no pending access No pending access access Standby/Nap/Powerdown Active Access Base Time

  28. Quad-state HW Policies access access no access for Ta-s Active STBY • Downgrade state if no access for threshold time • Independent transitions based on access pattern to each chip • CompetitiveAnalysis • rent-to-buy • Active to nap 100’s of ns • Nap to PDN 10,000 ns no access for Ts-n access access PDN Nap no access for Tn-p Active STBY Nap Access PDN Time

  29. Page Allocation Polices Virtual to Physical Page Mapping • Random Allocation – baseline policy • Pages spread across chips • Sequential First-Touch Allocation • Consolidate pages into minimal number of chips • One shot • Frequency-based Allocation • First-touch not always best • Allow (limited) movement after first-touch

  30. Random Allocation Sequential Allocation Dual-state Hardware Quad-state Hardware Summary of Results (Energy*Delay product, RDRAM, ASPLOS00) Nap is best dual-state policy 60%-85% Additional 10% to 30% over Nap 2 state model Best Approach: 6% to 55% over dual-nap-seq, 80% to 99% over all active. Improvement not obvious, Could be equal to dual-state 4 state model

  31. OS Support for SuperpagesJuan Navarro, Sitaram Iyer, Peter Druschel, Alan CoxOSDI 2002 • Increasing cost in TLB miss overhead • growing working sets • TLB size does not grow at same pace • Processors now provide superpages • one TLB entry can map a large region • OSs have been slow to harness them • no transparent superpage support for apps • Proposed: a practical and transparent solution to support superpages

  32. Translation look-aside buffer • TLB caches virtual-to-physical address translations • TLB coverage • amount of memory mapped by TLB • amount of memory that can be accessed without TLB misses

  33. 30% TLB miss overhead: 5% 5-10% TLB coverage trend TLB coverage as percentage of main memory Factor of 1000 decrease in 15 years

  34. How to increase TLB coverage • Typical TLB coverage  1 MB • Use superpages! • large and small pages • Increase TLB coverage • no increase in TLB size • no internal fragmentation If only large pages: larger working sets, more I/O.

  35. What are these superpages anyway? • Memory pages of larger sizes • supported by most modern CPUs • Otherwise, same as normal pages • power of 2 size • use only one TLB entry • contiguous • aligned (physically and virtually) • uniform protection attributes • one reference bit, one dirty bit

  36. Alpha: 8,64,512KB; 4MB Itanium: 4,8,16,64,256KB; 1,4,16,64,256MB A superpage TLB virtual memory base page entry (size=1) physical address virtual address superpage entry (size=4) TLB physical memory

  37. The superpage problem

  38. A B C D A C D D A C Issue 1: superpage allocation virtual memory B superpage boundaries physical memory B • How / when / what size to allocate?

  39. Wait for app to touch pages? May lose opportunity to increase TLB coverage. Create small superpage? May waste overhead. Issue 2: promotion • Promotion: create a superpage out of a set of smaller pages • mark page table entry of each base page • When to promote? Forcibly populate pages? May cause internal fragmentation.

  40. Issue 3: demotion Demotion: convert a superpage into smaller pages • when page attributes of base pages of a superpage become non-uniform • during partial pageouts

  41. Issue 4: fragmentation • Memory becomes fragmented due to • use of multiple page sizes • persistence of file cache pages • scattered wired (non-pageable) pages • Contiguity: contended resource • OS must • use contiguity restoration techniques • trade off impact of contiguity restoration against superpage benefits

  42. Design

  43. Key observation Once an application touches the first page of a memory object then it is likely that it will quickly touch every page of that object • Example: array initialization • Opportunistic policies • superpages as large and as soon as possible • as long as no penalty if wrong decision

  44. A C D A C D Superpage allocation Preemptible reservations virtual memory B superpage boundaries physical memory B reserved frames How much do we reserve? Goal: good TLB coverage,without internal fragmentation.

  45. Allocation: reservation size Opportunistic policy • Go for biggest size that is no larger than the memory object (e.g., file) • If size not available, try preemption before resigning to a smaller size • preempted reservation had its chance

  46. Allocation: managing reservations largest unused (and aligned) chunk 4 2 1 best candidate for preemption at front: • reservation whose most recently populated frame was populated the least recently

  47. Incremental promotions Promotion policy: opportunistic 2 4 4+2 8

  48. Speculative demotions • One reference bit per superpage • How do we detect portions of a superpage not referenced anymore? • On memory pressure, demote superpages when resetting ref bit • Re-promote (incrementally) as pages are referenced

  49. Demotions: dirty superpages • One dirty bit per superpage • what’s dirty and what’s not? • page out entire superpage • Demote on first write to clean superpage write • Re-promote (incrementally) as other pages are dirtied

More Related