1 / 35

Outline

Learn how to leverage remote memory and low latency networks to overcome I/O bottlenecks and improve data retrieval speed. Explore topics such as distributed shared memory, memory paging, and global memory services.

johndking
Download Presentation

Outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Outline • Objective for Lecture • Finish up memory issues: • Remote memory in the role of backing store • Power-aware memory (Duke research)

  2. Topic: Paging to Remote Memory

  3. Motivation: How to exploit remote memory? P $ Distributed Shared Memory (DSM) Memory Low latency networks make retrieving data across the network faster than local disk “I/O bottleneck” VM and file caching RemoteMemory

  4. Motivation: How to exploit remote memory? P $ Memory Low latency networks make retrieving data across the network faster than local disk “I/O bottleneck” VM and file caching RemoteMemory

  5. Pageout/pagein-peer-to-peer style Compute nodes Pageout/pagein M M M M M M M P P P P P P P Switched LAN “Idle” workstations or memory servers Global Memory Service (GMS) • GMS kernel module for Digital Unix and FreeBSD • remote paging, cooperative caching

  6. What does it take to page out to remote memory (and back in)? • How to uniquely name and identify pages/blocks? • 128-bit Unique Identifiers (UIDs) Global Page Numbers (GPN). • How to maintain consistent information about the location of pages in the network? • Each node keeps a Page Frame Directory of pages cached locally. • A distributed Global Cache Directory identifies the caching site(s) for each page or block. • Hash on page ID’s to get GCD site. • How to keep a consistent view of the membership in the cluster and the status of participating nodes? • How to globally coordinate memory usage for the page cache? • “Think globally, act locally.” – choice of replacement algorithm

  7. GMS Client GMS Client M M M M M M P P P P P P Directory Site (GCD) Directory Site (GCD) Caching Site (PFD) Caching Site (PFD) Global Memory System (GMS) in a Nutshell • Basic operations: putpage, getpage, movepage, discardpage. • Directory site is self for private pages, hashed for sharable. • Caching site selected locally based on global information. request page update page getpage putpage request GCD: Global Cache Directory, PFD: Page Frame Directory, P: Processor, M: Memory

  8. getpage reply getpage request getpage forward (delegated RPC) caching site directory site GMS Page Fetch (getpage) Page fetches are requested with a getpage RPC call to page directory site. Getpage can result from a demand fault or a prefetch policy decision (asynchronous). Caching site sends page directly to requesting node. Directory site delegates request to the correct caching site.

  9. Local LRU Global LRU Faulted page Global -- Local Scenarios In each case Node B faults on some Virtual Addr. Node A Node B Case 1: Page in another node’s Global memory. Case 2: case 1, but B has no Global pages. Case 3: Page is on disk -- going active . . . Or copy Case 4: Shared page in local memory of node A. Or

  10. Global Replacement • Fantasy: on a page fetch, choose the globally “coldest” page in the cluster to replace (i.e., global LRU). • may demote a local page to global • Reality: approximate global LRU by making replacement choices across epochs of (at most) M evictions in T seconds. • trade off accuracy for performance • nodes send summary of page values to initiator at start of epoch • initiator determines which nodes will yield the epoch’s evictions • favor local pages over global pages • victims are chosen probabilistically

  11. update putpage update movepage directory site caching site #1 caching site #2 GMS Putpage/Movepage Page/block evictions trigger an asynchonous putpage request from the client; the page is sent as a payload piggybacked on the message. Directory (GCD) site is selected by a hash on the global page number; every node is GCD site for its own private pages. Caching site is selected locally based on global information distributed each epoch.

  12. Topic: Energy Efficiency of Memory

  13. Motivation: What can the OS do about the energy consumption of memory? Laptop Power Budget 9 Watt Processor Handheld Power Budget 1 Watt Processor

  14. Energy Efficiency Metrics • Power consumption in watts (mW). • Battery lifetime in hours (seconds, months). • Energy consumption in Joules (mJ). • Energy * Delay • Watt per megabyte

  15. Physics Basics • Voltage is amount of energy transferred from a power source (e.g., battery) by the charge flowing through the circuit. • Power is the rate of energy transfer • Current is the rate of flow of charge in circuit

  16. Terminology and Symbols Concept Symbol Unit Abbrev. Current I amperes A Charge Q coulombs C Voltage V volts V Energy E joules J Power P watts W Resistance R ohms W

  17. Relationships Energy (Joules) = Power (watts) * Time (sec) E = P * t Power (watts) = Voltage (volts) * Current (amps) P = V * I Current (amps) = Voltage (volts) / Resistance (ohms) I = V / R

  18. Power Management (RAMBUS memory) Read or write 60ns 300mW active +6s +60ns Powered down nap +6ns 3mW 30mW standby 180 mW 2 chips v.a. -> p.a mapping could exploit page coloring ideas

  19. Power Management (RAMBUS memory) Read or write 60ns 300mW active +6ls +60ns Powered down nap +6ns 3mW 30mW standby 180 mW v.a. -> p.a mapping could exploit page coloring ideas 2 chips

  20. Power Management (RAMBUS memory) Read or write 60ns 300mW active +6ls +60ns Powered down nap +6ns 3mW 30mW standby 180 mW 2 chips v.a. -> p.a mapping could exploit page coloring ideas

  21. Each chip can be independently put into appropriate power mode Number of chips at each “level” of the hierarchy can vary dynamically. Policy choices initial page placement in an “appropriate” chip dynamic movement of page from one chip to another transitioning of power state of chip containing page RDRAM as a Memory Hierarchy Active Active Nap

  22. Two Dimensions to Control Energy CPU/$ Software control Page Mapping Allocation OS Hardware control ctrl ctrl ctrl Chip 1 Chip 2 Chip n Power Down Active Standby

  23. Dual-state (Static) HW Power State Policies access • All chips in one base state • Individual chip Active while pending requests • Return to base power state if no pending access Active No pending access access Standby/Nap/Powerdown Active Access Base Time

  24. Quad-state (Dynamic) HW Policies access access • Downgrade state if no access for threshold time • Independent transitions based on access pattern to each chip • CompetitiveAnalysis • rent-to-buy (Ski rental) • Active to nap 100’s of ns • Nap to PDN 10,000 ns no access for Ta-s Active STBY no access for Ts-n access access PDN Nap no access for Tn-p Active STBY Nap Access PDN Time

  25. Ski Rental Analogy • Week 1: “Am I gonna like skiing? I better rent skis.” • Week 10: “I’m skiing black diamonds and loving it. I could’ve bought skis for all the rental fees I’ve paid. I should have bought them in the first place. Guess I’ll buy now. • If he buys exactly when rentalsPaid==skiCostthen pays 2*optimal

  26. Page Allocation Polices • Random Allocation • Pages spread across chips • Sequential First-Touch Allocation • Consolidate pages into minimal number of chips • One shot • Frequency-based Allocation • First-touch not always best • Allow movement after first-touch

  27. Dual-state + Random Allocation* 2 state model • Active to perform access, return to base state • Nap is best ~85% reduction in E*D over full power • Little change in run-time, most gains in energy/power *NT Traces

  28. Quad-state HW + Sequential Allocation* • Base: Dual-state Nap Sequential Allocation • Thresholds: 0ns A->S; 750ns S->N; 375,000 N->PSubsequent analysis shows 0ns threshold S->N is better. • Quad-state + Sequential 30% to 55% additional improvement over dual-state nap sequential • Sophisticated HW not enough *SPEC

  29. Random Allocation Sequential Allocation Dual-state Hardware (static) Quad-state Hardware (dynamic) The Design Space Nap is best dual-state policy 60%-85% Additional 10% to 30% over Nap Best Approach: 6% to 55% over dual-nap-seq, 99% to 80% over all active. No compelling benefits over simpler policies

  30. Spin-down Disk Model Spinning & Seek Spinningup Spinning & Access Request Trigger:request or predict Predictive NotSpinning Spinning & Ready Spinningdown Inactivity Timeout threshold*

  31. Spin-down Disk Model Etransition = Ptransition * Ttransition ~1- 3s delay Spinning & Seek Spinningup Spinning & Access Request Trigger:request or predict Predictive Tidle Etransition = Ptransition * Ttransition NotSpinning Spinning & Ready Spinningdown Pdown Pspin ToutInactivity Timeout threshold* Tdown

  32. Energy = S Poweri x Timei Reducing Energy Consumption Energy = S Poweri x Timei To reduce energy used for task: • Reduce power cost of power state I through better technology. • Reduce time spent in the higher cost power states. • Amortize transition states (spinning up or down) if significant. Pdown*Tdown + 2*Etransition + Pspin * Tout < Pspin*Tidle Tdown = Tidle - (Ttransition + Tout) i e powerstates

  33. IBM Microdrive (1inch) writing 300mA (3.3V)1W standby 65mA (3.3V).2W IBM TravelStar (2.5inch) read/write 2W spinning 1.8W low power idle .65W standby .25W sleep .1W startup 4.7 W seek 2.3W Power Specs

  34. Spin-down Disk Model 2.3W 4.7W 2W Spinning & Seek Spinningup Spinning & Access Request Trigger:request or predict Predictive NotSpinning Spinning & Ready Spinningdown .2W .65-1.8W

  35. Spin-Down Policies • Fixed Thresholds • Tout = spin-down cost s.t. 2*Etransition = Pspin*Tout • Adaptive Thresholds: Tout = f (recent accesses) • Exploit burstiness in Tidle • Minimizing Bumps (user annoyance/latency) • Predictive spin-ups • Changing access patterns (making burstiness) • Caching • Prefetching

More Related