1 / 34

Nesting Paging in VM Replay for MPs

Nesting Paging in VM Replay for MPs. Jaehyuk Huh Computer Science, KAIST. Address Translation in VM. Need to translate guest VA ( gVA ) to machine address gVA (guest VA)  gPA (guest PA)  sPA (system PA) Paravirtualization

walden
Download Presentation

Nesting Paging in VM Replay for MPs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nesting Paging in VMReplay for MPs Jaehyuk Huh Computer Science, KAIST

  2. Address Translation in VM • Need to translate guest VA (gVA) to machine address • gVA (guest VA)  gPA (guest PA)  sPA (system PA) • Paravirtualization • Guest page table (managed by guest OS) directly maps gVA to sPA • Hypervisor validates guest page table • Full virtualization • SW technique: shadow paging • HW-assisted technique: nested paging

  3. X86 4KB page tables in long mode

  4. Shadow Page Table • Shadow page table (sPT) • translate from gVA to sPA • maintained by VMM (hypervisor) • VMM intercepts the updates of page table base address • CR3 updates in x86 • Set CR3 with sPT base address instead of gPT base address • must be consistent with guest page table (gPT)  gPT updates must be reflected in sPT • Any page fault must be intercepted by VMM • VMM must tell guest-induced page-faults from VMM-induced ones • Vectors guest-induced page-faults to guest OS • High overheads for page fault handling

  5. How to make gPT and sPT consistent? • Write-protecting gPT • Any modification of gPT (add or remove a translation) causes a fault • VMM updates sPT accordingly • Exploiting page-fault behavior and TLB consistency rules • Adding a page translation • Guest OS can add a new translation to gPT without interception by VMM • Later accesses by guest VM causes a page fault on the new translation • VMM updates sPT on the page fault: must inspect gPT to find out the new page • Deleting a page translation • Guest OS executes INVLPG to invalidate TLB entry • VMM intercept the execution and remove the entry from sPT

  6. Overheads of Shadow Paging • Any page fault requires the expensive VMM intervention • Guest-induced page fault • Hypervisor-induced page faults • Accessed and dirty bit updates • HW page walker sets bits in sPT(not gPT) • Guest OS need the information to make paging decision • Dirty bit example: set pages pointed by sPT read-only • Problems in MPs • What if a VM uses multiple processors? • Replicating sPT for each processor?  memory overheads • Sharing sPT?  synchronizing sPT for any change

  7. Shadow Paging Overheads

  8. Nesting Page Table • A source of address translation overheads in traditional x86 VMM • a fixed hardware page walker to handle a TLB miss • Can walk from only one page table (pointed by CR3) • Nested paging • Separate HW states affecting paging (two copies of CR3 etc … ) for guest OS and VMM • HW page walker can walk both gPT and sPT • TLB can holds a translation from gVA to sPT directly • Benefits: No more traps on Guest Page Table accesses • Drawback: Extra page table steps add latency to TLB miss • May add extra caching for page translation • Nested TLB • 2D page walk cache

  9. Nested Paging

  10. Nested Paging

  11. Address Space IDs • Old x86 did not support address space IDs (ASID) in TLBs • must flush TLBs for VM switch • Assign ASID for each VM • Still need to flush TLBs for context switch within a VM

  12. Replay Papers • VM-based replay • Execution Replay for Multiprocessor Virtual Machines • Dunlap et al • HW-based replay • Rerun: Exploiting Episodes for Lightweight Memory Race Recording • Hower and Hill • ODR: Output-Deterministic Replay for Multicore Debugging • Altekar and Stoica • Slides adapted from the presentation slides by the paper authors

  13. Big ideas • Detection and replay of memory races is possible on commodity hardware • Overhead high for some workloads • …but surprisingly low for other workloads

  14. Execution Replay CPU Interrupts Network Memory Keyboard, mouse Disk

  15. Deterministic Replay • Deterministic Replay • Faithfully replay an execution such that all instructions appear to complete in the same order and produce the same result • Valuable • Debugging [LeBlanc, et al. - COMP ’87] • e.g., time travel debugging, rare bug replication • Fault tolerance [Bressoud, et al. - SIGOPS ‘95] • e.g., hot backup virtual machines • Security [Dunlap et al. – OSDI ‘02] • e.g., attack analysis • Tracing [Xu et al. – WDDD ‘07] • e.g., unobtrusive replay tracing

  16. Single-processor Replay • Basic principles well understood • Log all non-deterministic inputs • Timing of asynchronous events • Minimal overhead (Dunlap02) • 13% worst case • Log for months or years • Available commercially • VMWare: Record/Replay

  17. The Multiprocessor Challenge • Interleaved reads and writes • Fine-grained non-determinism • Much more difficult • Existing solutions • Hardware modification • Software instrumentation • SMP-ReVirt • Hardware MMU to detect sharing

  18. Multiprocessor Replay P2 P1 P1 P2 n=5 n=3 Memory if (n<4)

  19. Ordering Memory Accesses • Preserving order will reproduce execution • a→b: “a happens-before b” • Ordering is transitive: a→b, b→cmeans a→c • Two instructions must be ordered if: • they both access the same memory, and • one of them is a write

  20. To guarantee a→d: a→d b→d a→c b→c Suppose we need b→c b→c is necessary a→d is redundant Constraints: Enforcing order P1 P2 a b overconstrained c d

  21. CREW Protocol • Each shared object in one of two states: • Concurrent-Read:all processors can read, none can write • Exclusive-Write: one processor (the owner) can read and write; others have no access • Enforced with hardware MMU • Read/write • Read-only • None • Change CREW states on demand • Fault, fixup, re-execute • CREW event • Increasing or reducing permission due to CREW state changes

  22. CREW Property • If two instructions on different processors: • access the same page, • and one of them is a write, • there will be a CREW event on each processor between them.

  23. Generating Constraints • State: Concurrent Read • All processors read-only • d*: CREW fault • New state: P2 Exclusive • r: privilege reduction • Read to None • i: privilege increase • Read to Read/write • Log timing of r and i • Constraint: • r → i P1 P2 a d* r i d

  24. Predicting results • Key changes in sharing attributes • 4096-byte sharing granularity • “Miss” is very expensive • SPLASH2 • Good: high spatial locality / low false sharing • Bad: random access patterns / high false sharing • The Linux kernel • Tuned to 16-byte cacheline • Involving the kernel may be expensive

  25. Single-processor Xen guests

  26. 2-processor Xen guests

  27. 2-processor, con’t

  28. 4-processor Xen guests

  29. HW Memory Race Recording • SW only approach • Too slow to be turned on always • SW alter execution path • Want • Small log – record longer for same state • Small hardware – reduce cost, especially when not used • Unobtrusive – should not alter execution • Rerun: Exploiting Episodes for Lightweight Memory Race Recording

  30. Episodic Recording • Most code executes without races • Use race-free regions as unit of ordering • Episodes: independent execution regions • Defined per thread • Identified passively  does not affect execution • Encompass every instruction T0 T1 T2 ST V LD A ST E ST Z ST B LD B LD W ST C ST X LD J LD F LD R LD J LD X ST T LD V LD Q ST C ST Q ST E ST C ST X LD Z

  31. Capturing Causality • Via scalar Lamport Clocks [Lamport ‘78] • Assigns timestamps to events • Timestamp order implies causality • Replay in timestamp order • Episodes with same timestamp can be replayed in parallel T0 T1 T2 60 43 22 61 23 23 44 44 62 45

  32. Episode Benefits • Multiple races can be captured by a single episode • Reduces amount of information to be logged • Episodes are created passively • No speculation, no rollback • Episodes can end early • Eases implementation • Episode information is thread-local • Promotes scalability, avoids synchronization overheads

  33. Rerun L2/Memory State Hardware • Rerun requirements: • Detect races  track r/w sets • Mark episode boundaries • Maintain logical time Data Tags Directory Coherence Controller Base System Total State: 166 bytes/core L2 0 L2 1 L2 14 L2 15 … DRAM Memory Timestamp(MTS) DRAM Interconnect 32 bytes 4 bytes Core 0 Core 1 … Core 14 Core 15 Write Filter (WF) Read Filter (RF) Coherence Controller References (REFS) 128 bytes Timestamp (TS) 2 bytes L1 I L1 D 4 bytes Rerun Core State Pipeline

  34. HW Replay Summary • Require some modification to existing HW • will CPU manufacturers add the support any time soon?  not likely • Other low overhead approaches with SW-based replay • ODR: Output-Deterministic Replay for MulticoreDebugging, Altekarand Stoica, SOSP 09

More Related