310 likes | 466 Views
Operating Systems Engineering Virtual Machines. By Dan Tsafrir, 25/5/2011. What’s a virtual machine?. A VM is a simulation of a full computer With its disk & NIC & OS & user-level apps, … Running as an application On some “ host ” computer Simulation is called a “ guest ”.
E N D
Operating Systems EngineeringVirtual Machines By Dan Tsafrir, 25/5/2011
What’s a virtual machine? • A VM is a simulation of a full computer • With its disk & NIC & OS & user-level apps, … • Running as an application • On some “host” computer • Simulation is called a “guest”
VMs – requirements • Simulation needs to be accurate • Emulate HW faithfully, handle weird quirks of kernels & such • Reproduce bugs exactly • Simulation needs to be isolated • Guest must not break out of VM • SW inside guest might be faulty and/or malicious • Simulation needs to be fast • Well, as fast as possible… • Simulation needs to be believable • Guest shouldn’t be able to distinguish VM from real computer • The “blue pill” saga [ http://en.wikipedia.org/wiki/Blue_Pill_(malware) ] • In reality, if guests can accurately time stuff, they can know • (And indeed, viruses often refuse to work when virtualized)
VMs – origin • Late 1960s • IBM used VMs to share mainframes • Late 1990s • VMWare re-popularized VMs (for x86 HW) • Economic boom: nowadays billions of $s business • Everyone is playing • SW: Microsoft, IBM, Redhat, Oracle, … • HW: Intel, AMD, ARM, IBM, Oracle, …
VMs – why? • For developers & power users • One computer w/ multiple OSes • My Win 7 laptop also runs Ubuntu • My MacBook Pro @ home also runs XP (for office) • Kernel development • Like QEMU, but performs reasonably
VMs – why? • Business case: saves money! • Server consolidation • Once we had underutilized machines per service… • Reduces cost of HW, power consumption, cooling • Portability (why should Intel/AMD/IBM care about consolidation?) • Decouples OS from HW and makes upgrades easy • Increased robustness • Can backup entire machine + easily restore if HW breaks • No need to reinstall all SW • Can isolate important apps in their own VM (safety) • Makes cloud models possible • Such as Amazon’s EC2 (“elastic cloud”) • Certain costly sys-admin chores made much easier • Provisioning a new machine (just clone ready image)
What’s in a name • SW that runs the show (3 names referring to same thing): • VMM • Virtual machine monitor • Hypervisor • (Of IBM origin) • Sometimes denoted “HV” • ~Host • VMMs • Citrix Xen, KVM, VMWare ESXi, MS HyperV, IBM pHyp,… • 2 possible settings • Next 2 slides…
Hosted VMM (“type 2 hypervisor”) • Like VMWare Workstation,Parallels, VirtualBox, QEMU,… • Typically personal use
Bare metal / native VMM(“type 1 hypervisor”) • XenServer, VMWare ESXi, MS HyperV, IBM pHyp,… • Typically for servers, data centers, clouds
VMM multiplexes HW • Just like an OS… • Divides memory among guests • Related: de-duplication, balloon-ing • Time-shares CPU among guests • Related: notion of VCPU vs. PCPU (can hot-plug) • Simulates per-guest virtual devices • Disk • Network, • …
Virtualization refinement • Paravirtualization • Guest OS is aware it is being virtualized • For performance purposes • Paravirtualized devices • HW support • Intel-VT • AMD-V
How to virtualize x86… Assuming no HW Support
VMs – how? • SW interpretation, instruction by instruction • Can do it, but much, much too slow • Idea1: when possible, execute VM’s instructions on real CPU • Works fine for most instructions (e.g., add %eax %ebx) • But what about isolation? (e.g., VM writes outside its memory) • Idea2: run VMs at CPL=3 • Ordinary instructions work fine • Writing to %cr3 traps to VMM • VMM examines guest’s page table • VMM can manipulate page table if it wants • Only then set %cr3 and resume VM • This virtualization model is called: “trap & emulate”
VMM hides real machine • Virtual vs. real resources • Virtual vs. real cr3 • Virtual cr3: the VM (thinks it) sets the real cr3 • Real cr3: exclusively managed (= virtualized) by VMM • Virtual vs. real machine-defined data structures • Virtual page table: VM thinks it’s real • Real page table: real page tables virtualized by VMM • VMM’s job • Make guest see only virtual machine state • Completely hide & protect real machine state • Problems • Trap-&-emulate is tricky on x86 • Not all privileged instructions trap at CPL=3 • All those traps can be slow…
Terminology • Letters • H = host • G = guest • P = physical • V = virtual • A = address • Combinations • GVA = guest virtual address • GP = guest physical • HP = host physical • …
Providing guest with illusion of physical memory (simplistic) • Guest view • Wants to start at PA=0 • Wants to use all “installed” DRAM • Host opposing view • Must support several guests, they can’t all start at 0 • Must protect on VM’s memory from the others • Idea • Fake a smaller DRAM size than real DRAM • Ensure paging is enabled • Rewrite guest’s PTEs
Providing guest with illusion of physical memory (simplistic) • Example • VMM allocates a guest phys mem 0x1000000 to 0x2000000 • VMM gets trap if guest changes cr3 (guest @ CPL=3) • VMM copies guest's page table to "shadow" page table • While copying, VMM adds 0x1000000 to each PA in shadow tab • VMM checks that each resulting HPA is < 0x2000000 • Must copy the guest's page table • So guest doesn't see VMM's modifications to PAs
Address translation (reminder) 9bits 9bits 9bits 9bits 12bits p0 p1 p2 p3 offset 48bit VA CR3 0 0 0 0 1 1 1 1 Q 2 2 2 2 W Q W K p1 p0 p2 p3 K 511 511 511 511 PA 4KB page-table page => 512 PTEs (8B each)
Providing guest with illusion of physical memory (realistic) • Host allocates N pages to guest • No need for them to be contiguous in phys mem • Host maintains a GPA_to_HPA mapping (say, using a hash) • GPAs are contiguous • What happens when guest changes cr3 • Assume guest assigns GPA1 to cr3 • A trap will occur and host will gain control • Host’s goal: • Generate, on the fly, the shadow page table hierarchy • From GVA to HPA • There’s only one such shadow hierarchy at any given timeper core
Providing guest with illusion of physical memory (realistic) • The host’s actions • Saves GPA1 internally • Allocates brand new zeroed page = root of the shadow hierarchy • Let base of new page be HPA1 • Assigns HPA1 to cr3 • Resumes guest, which immediately faults on GVA2 • GVA2 = virtual address of 1st fetched command of guest • Takes 9 most significant bits from GVA2 • Assume 48bit VA = 4 levels hierarchy (9bits each) + 4KB page • 8 bytes per PTE • Computes GPA_to_HPA(GPA1) + 9bits * 8 • = HPA of 2nd-level guest’s hierarchy • …
Providing guest with illusion of physical memory (realistic) • The host’s actions (cont.) • … • Continue like so with next 9bits, repeatedly, • Until reaching the HPA of the request page = HPA2 • Now, there needs to be a GVA2=>HPA2 mapping in the shadow hierarchy • Adds the translation GVA2=>HPA2 to shadow hierarchy • Starting at HPA1 and allocating the rest of the levels in the hierarchy as needed • Resumes guest • Repeats same procedure when next fault occurs • This continues until all address space is mapped • Or until next context switch (=> need to start over)
Providing guest with illusion of physical memory (realistic) • Building shadow page tables is costly • Can we cache? • Yes, but need to write protect all pages involved • Will generate trap whenever pages are modified • Host would be able to respond accordingly • The problem • How do we know when to stop write-protecting? • Solution • Must employ some heuristic • Can be not perfect as long as maintains correctness
Not all sensitive CPL=3 read/write trap • Push CS • Will show CPL=3 (not 0) if guest reads pushed value • sgdt (save gdtr) • Reveals real gdtr is guest reads it • pushf • Pushes real IF • Always on in guest mode (why?) • Host injects interrupts to guest as needed • popf • Ignores IF in CPL=3 • => no trap => host won’t know if guest wants interrups • iret • Invoked, e.g, after handling a system call • No ring change => SS/ESP will not be restored
How can we cope? • Solution: binary translation • Rewrite guest code • Change every problematic instruction to INT 3 • Keep track of original instructions + emulate in VMM • Note: INT 3 is 1-byte long => small enough to overwrite any inst • Must be done dynamically at runtime • Need to know what if bytes are code or data • Need to know where instructions start (x86 is CISC) • Consequently, scan code only as executed
Binary translation – example • Rewrite INT3 instead of • Bad instructions (popf) • First jump (jnz) • Then start guest kernel • INT3 traps to host • Emulates popf • Look where jump could go • For each jump • Translate upon the 1st encounter of block • Keep track of translated code • Next time, replace INT3 with original instructions if target is known (when j is direct) • Assume guest kernel starts like so: pushl %ebp … popf … jnz x … j?? yx: … j?? z
BT: indirect jumps & ret • Same, but • Can’t replace INT3 with original jump • Since we’re not sure address will be the same next time • ret indirect jump via pointer on the stack • must take trap every time (slow!) • Can we speed up? • Yes, by write our own code rather than hack original=> more aggressive translation, addresses change • See VMWare’s“A Comparison of Software and Hardware Techniques for x86 Virtualization”, by Adams & Agesen, in ASPLOS 2006http://www.vmware.com/pdf/asplos235_adams.pdf • Read it to make sure you know how!
Intel/AMD HW support for VMs • Much easier to implement VMM w/ reasonable performance • HW itself directly maintains per-guest virtual state • CS (w/ CPL), EFLAGS, idtr, etc. • In-memory HW struct can be loaded/unloaded like context swt • HW knows it’s in guest mode • Instructions directly modify virtual state • Avoids lots of traps to VMM • HW basically adds a new privilege level • VMM mode, CPL=0, ..., CPL=3 • Guest-mode/CPL=0 isn’t fully privileged • No traps to VMM on system calls • HW handles CPL transition • No need to shadow page • Next slide…
Nested paging • In guest mode, there are *2* page tables in effect • Guest page table & host page table • Guest memory refs go through multiple lookups • Guest tables hold GVA=>GPA translations • HW knows this, so in every level of the hierarchy • HW automatically translates GPA to HPA • Continues the table walk process • HW table walk can take ~20 memory refs • => There’s a new “page table cache” (in addition to the TLB), which caches partial parts of the GVA in an attempt to skip levels (shown to be very effective) • Thus, guest can directly modify its page table w/o VMM having to shadow it • No need for VMM to write-protect guest page tables • No need for VMM to track cr3 changes
Nested paging • Is nested paging faster than shadow paging? • Depends… (on what?)
Devices • trap INB and OUTB • DMA addresses are physical, • VMM must trust devices or utilize HW support (IOTLOB) • Device nowadays is typically shared (=> virtualized) • If you want to share between multiple guests • Each guest gets a part of the disk • Each guest looks like a distinct Internet host • Each guest gets an X window • VMM might mimic some standard (or legacy) devices • Regardless of actual h/w on host computer • Guest might run paravirtualized drivers • Typically aggregate messages before switching to VMM • For high-performance I/O => device assignment • Sharing through SRIOV (new standard)