1 / 31

Operating Systems Engineering Virtual Machines

Operating Systems Engineering Virtual Machines. By Dan Tsafrir, 25/5/2011. What’s a virtual machine?. A VM is a simulation of a full computer With its disk & NIC & OS & user-level apps, … Running as an application On some “ host ” computer Simulation is called a “ guest ”.

estese
Download Presentation

Operating Systems Engineering Virtual Machines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Operating Systems EngineeringVirtual Machines By Dan Tsafrir, 25/5/2011

  2. What’s a virtual machine? • A VM is a simulation of a full computer • With its disk & NIC & OS & user-level apps, … • Running as an application • On some “host” computer • Simulation is called a “guest”

  3. VMs – requirements • Simulation needs to be accurate  • Emulate HW faithfully, handle weird quirks of kernels & such • Reproduce bugs exactly • Simulation needs to be isolated  • Guest must not break out of VM • SW inside guest might be faulty and/or malicious • Simulation needs to be fast  • Well, as fast as possible… • Simulation needs to be believable • Guest shouldn’t be able to distinguish VM from real computer • The “blue pill” saga [ http://en.wikipedia.org/wiki/Blue_Pill_(malware) ] • In reality, if guests can accurately time stuff, they can know • (And indeed, viruses often refuse to work when virtualized)

  4. VMs – origin • Late 1960s • IBM used VMs to share mainframes • Late 1990s • VMWare re-popularized VMs (for x86 HW) • Economic boom: nowadays billions of $s business • Everyone is playing • SW: Microsoft, IBM, Redhat, Oracle, … • HW: Intel, AMD, ARM, IBM, Oracle, …

  5. VMs – why? • For developers & power users • One computer w/ multiple OSes • My Win 7 laptop also runs Ubuntu • My MacBook Pro @ home also runs XP (for office) • Kernel development • Like QEMU, but performs reasonably

  6. VMs – why? • Business case: saves money! • Server consolidation • Once we had underutilized machines per service… • Reduces cost of HW, power consumption, cooling • Portability (why should Intel/AMD/IBM care about consolidation?) • Decouples OS from HW and makes upgrades easy • Increased robustness • Can backup entire machine + easily restore if HW breaks • No need to reinstall all SW • Can isolate important apps in their own VM (safety) • Makes cloud models possible • Such as Amazon’s EC2 (“elastic cloud”) • Certain costly sys-admin chores made much easier • Provisioning a new machine (just clone ready image)

  7. What’s in a name • SW that runs the show (3 names referring to same thing): • VMM • Virtual machine monitor • Hypervisor • (Of IBM origin) • Sometimes denoted “HV” • ~Host • VMMs • Citrix Xen, KVM, VMWare ESXi, MS HyperV, IBM pHyp,… • 2 possible settings • Next 2 slides…

  8. Hosted VMM (“type 2 hypervisor”) • Like VMWare Workstation,Parallels, VirtualBox, QEMU,… • Typically personal use

  9. Bare metal / native VMM(“type 1 hypervisor”) • XenServer, VMWare ESXi, MS HyperV, IBM pHyp,… • Typically for servers, data centers, clouds

  10. VMM multiplexes HW • Just like an OS… • Divides memory among guests • Related: de-duplication, balloon-ing • Time-shares CPU among guests • Related: notion of VCPU vs. PCPU (can hot-plug) • Simulates per-guest virtual devices • Disk • Network, • …

  11. Virtualization refinement • Paravirtualization • Guest OS is aware it is being virtualized • For performance purposes • Paravirtualized devices • HW support • Intel-VT • AMD-V

  12. How to virtualize x86… Assuming no HW Support

  13. VMs – how? • SW interpretation, instruction by instruction • Can do it, but much, much too slow • Idea1: when possible, execute VM’s instructions on real CPU • Works fine for most instructions (e.g., add %eax %ebx) • But what about isolation? (e.g., VM writes outside its memory) • Idea2: run VMs at CPL=3 • Ordinary instructions work fine • Writing to %cr3 traps to VMM • VMM examines guest’s page table • VMM can manipulate page table if it wants • Only then set %cr3 and resume VM • This virtualization model is called: “trap & emulate”

  14. VMM hides real machine • Virtual vs. real resources • Virtual vs. real cr3 • Virtual cr3: the VM (thinks it) sets the real cr3 • Real cr3: exclusively managed (= virtualized) by VMM • Virtual vs. real machine-defined data structures • Virtual page table: VM thinks it’s real • Real page table: real page tables virtualized by VMM • VMM’s job • Make guest see only virtual machine state • Completely hide & protect real machine state • Problems • Trap-&-emulate is tricky on x86 • Not all privileged instructions trap at CPL=3 • All those traps can be slow…

  15. x86 state we must virtualize

  16. Terminology • Letters • H = host • G = guest • P = physical • V = virtual • A = address • Combinations • GVA = guest virtual address • GP = guest physical • HP = host physical • …

  17. Providing guest with illusion of physical memory (simplistic) • Guest view • Wants to start at PA=0 • Wants to use all “installed” DRAM • Host opposing view • Must support several guests, they can’t all start at 0 • Must protect on VM’s memory from the others • Idea • Fake a smaller DRAM size than real DRAM • Ensure paging is enabled • Rewrite guest’s PTEs

  18. Providing guest with illusion of physical memory (simplistic) • Example • VMM allocates a guest phys mem 0x1000000 to 0x2000000 • VMM gets trap if guest changes cr3 (guest @ CPL=3) • VMM copies guest's page table to "shadow" page table • While copying, VMM adds 0x1000000 to each PA in shadow tab • VMM checks that each resulting HPA is < 0x2000000 • Must copy the guest's page table • So guest doesn't see VMM's modifications to PAs

  19. Address translation (reminder) 9bits 9bits 9bits 9bits 12bits p0 p1 p2 p3 offset 48bit VA CR3 0 0 0 0 1 1 1 1 Q 2 2 2 2 W Q W K p1 p0 p2 p3 K 511 511 511 511 PA 4KB page-table page => 512 PTEs (8B each)

  20. Providing guest with illusion of physical memory (realistic) • Host allocates N pages to guest • No need for them to be contiguous in phys mem • Host maintains a GPA_to_HPA mapping (say, using a hash) • GPAs are contiguous • What happens when guest changes cr3 • Assume guest assigns GPA1 to cr3 • A trap will occur and host will gain control • Host’s goal: • Generate, on the fly, the shadow page table hierarchy • From GVA to HPA • There’s only one such shadow hierarchy at any given timeper core

  21. Providing guest with illusion of physical memory (realistic) • The host’s actions • Saves GPA1 internally • Allocates brand new zeroed page = root of the shadow hierarchy • Let base of new page be HPA1 • Assigns HPA1 to cr3 • Resumes guest, which immediately faults on GVA2 • GVA2 = virtual address of 1st fetched command of guest • Takes 9 most significant bits from GVA2 • Assume 48bit VA = 4 levels hierarchy (9bits each) + 4KB page • 8 bytes per PTE • Computes GPA_to_HPA(GPA1) + 9bits * 8 • = HPA of 2nd-level guest’s hierarchy • …

  22. Providing guest with illusion of physical memory (realistic) • The host’s actions (cont.) • … • Continue like so with next 9bits, repeatedly, • Until reaching the HPA of the request page = HPA2 • Now, there needs to be a GVA2=>HPA2 mapping in the shadow hierarchy • Adds the translation GVA2=>HPA2 to shadow hierarchy • Starting at HPA1 and allocating the rest of the levels in the hierarchy as needed • Resumes guest • Repeats same procedure when next fault occurs • This continues until all address space is mapped • Or until next context switch (=> need to start over)

  23. Providing guest with illusion of physical memory (realistic) • Building shadow page tables is costly • Can we cache? • Yes, but need to write protect all pages involved • Will generate trap whenever pages are modified • Host would be able to respond accordingly • The problem • How do we know when to stop write-protecting? • Solution • Must employ some heuristic • Can be not perfect as long as maintains correctness

  24. Not all sensitive CPL=3 read/write trap • Push CS • Will show CPL=3 (not 0) if guest reads pushed value • sgdt (save gdtr) • Reveals real gdtr is guest reads it • pushf • Pushes real IF • Always on in guest mode (why?) • Host injects interrupts to guest as needed • popf • Ignores IF in CPL=3 • => no trap => host won’t know if guest wants interrups • iret • Invoked, e.g, after handling a system call • No ring change => SS/ESP will not be restored

  25. How can we cope? • Solution: binary translation • Rewrite guest code • Change every problematic instruction to INT 3 • Keep track of original instructions + emulate in VMM • Note: INT 3 is 1-byte long => small enough to overwrite any inst • Must be done dynamically at runtime • Need to know what if bytes are code or data • Need to know where instructions start (x86 is CISC) • Consequently, scan code only as executed

  26. Binary translation – example • Rewrite INT3 instead of • Bad instructions (popf) • First jump (jnz) • Then start guest kernel • INT3 traps to host • Emulates popf • Look where jump could go • For each jump • Translate upon the 1st encounter of block • Keep track of translated code • Next time, replace INT3 with original instructions if target is known (when j is direct) • Assume guest kernel starts like so: pushl %ebp … popf … jnz x … j?? yx: … j?? z

  27. BT: indirect jumps & ret • Same, but • Can’t replace INT3 with original jump • Since we’re not sure address will be the same next time • ret  indirect jump via pointer on the stack • must take trap every time (slow!) • Can we speed up? • Yes, by write our own code rather than hack original=> more aggressive translation, addresses change • See VMWare’s“A Comparison of Software and Hardware Techniques for x86 Virtualization”, by Adams & Agesen, in ASPLOS 2006http://www.vmware.com/pdf/asplos235_adams.pdf • Read it to make sure you know how!

  28. Intel/AMD HW support for VMs • Much easier to implement VMM w/ reasonable performance • HW itself directly maintains per-guest virtual state • CS (w/ CPL), EFLAGS, idtr, etc. • In-memory HW struct can be loaded/unloaded like context swt • HW knows it’s in guest mode • Instructions directly modify virtual state • Avoids lots of traps to VMM • HW basically adds a new privilege level • VMM mode, CPL=0, ..., CPL=3 • Guest-mode/CPL=0 isn’t fully privileged • No traps to VMM on system calls • HW handles CPL transition • No need to shadow page • Next slide…

  29. Nested paging • In guest mode, there are *2* page tables in effect • Guest page table & host page table • Guest memory refs go through multiple lookups • Guest tables hold GVA=>GPA translations • HW knows this, so in every level of the hierarchy • HW automatically translates GPA to HPA • Continues the table walk process • HW table walk can take ~20 memory refs • => There’s a new “page table cache” (in addition to the TLB), which caches partial parts of the GVA in an attempt to skip levels (shown to be very effective) • Thus, guest can directly modify its page table w/o VMM having to shadow it • No need for VMM to write-protect guest page tables • No need for VMM to track cr3 changes

  30. Nested paging • Is nested paging faster than shadow paging? • Depends… (on what?)

  31. Devices • trap INB and OUTB • DMA addresses are physical, • VMM must trust devices or utilize HW support (IOTLOB) • Device nowadays is typically shared (=> virtualized) • If you want to share between multiple guests • Each guest gets a part of the disk • Each guest looks like a distinct Internet host • Each guest gets an X window • VMM might mimic some standard (or legacy) devices • Regardless of actual h/w on host computer • Guest might run paravirtualized drivers • Typically aggregate messages before switching to VMM • For high-performance I/O => device assignment • Sharing through SRIOV (new standard)

More Related