480 likes | 492 Views
This article discusses the concept of virtualization and explores its various types and advantages. It explains how virtual machines are created and how they run multiple operating systems and applications on the same hardware. The article also delves into the requirements of a hypervisor and the challenges of trap and emulation in classic virtualization.
E N D
Virtualization Operating Systems, 2016, Meni Adler, Danny Hendler & Amnon Meisels
What is virtualization? “The construction of an isomorphism between a guest system and a host” [Popek, Goldberg, ’74] Creating a virtual version of something • Hardware, operating system, application, network, memory, storage
Example: virtual disk • Partition a single hard disk to multiple virtual disks • Virtual disk has virtual tracks & sectors • Implement virtual disk by file • Map between virtual disk and real disk contents • Virtual disk write/read mapped to file write/read in host system
What is virtualization? (continued) A way to run multiple operating systems and applications on the same hardware (virtual machines) Only virtual machine manager (a.k.a. hypervisor) has full system control Virtual machines completely isolated from each other (or so we hope)
Basic concepts Virtual Machine (VM) Host Guest Hypervisor (type ||) / Virtual Machine Monitor
Basic concepts Virtual Machine (VM) Host Guest Hypervisor (type ||) / Virtual Machine Monitor
Basic concepts Virtual Machine (VM) Host Guest Hypervisor (type ||) / Virtual Machine Monitor
Basic concepts Virtual Machine (VM) Host Guest Hypervisor (type ||) / Virtual Machine Monitor
Basic concepts Virtual Machine (VM) Host Guest Hypervisor (type ||) / Virtual Machine Monitor
Types of virtualization Our focus is on full virtualization Full virtualization – guest OS runs unmodified Para-virtualization – guest OS must be aware of virtualization, source-code modifications required Hardware virtualization support may be used for both
Virtualization advantages • Cost-effectiveness – less hardware • Multiple virtual machines / operating systems / services on single physical machine (server consolidation) • Various forms of computation as a service • Isolation • Good for security • Great for reliability and recovery: If VM crashes it can be rebooted, does not affect other services (fault containment) • VM migration • Development tool • Work on multiple OS in parallel • Develop and debug OS in user mode • Origins of VMware as a tool for developers
Virtualization vs. Multi-Processing Process1 Process2 ∙∙∙ User space/ kernel separation OS HW interface HW (disk, NIC,…) Multi-processing Pr1 ∙∙∙ Pr2 Pr1 Pr2 VM OS1 OS2 ∙∙∙ Virtual HW interface VMM/Hypervisor Virtualization Real HW interface HW (disk, NIC,…)
Type 1 and type 2 hypervisors VMware ESX, Microsoft Hyper-V, Xen VMware Workstation, Microsoft Virtual PC, Sun VirtualBox, QEMU, KVM Figure 7-1. Location of type 1 and type 2 hypervisors. Operating Systems, 2016, Meni Adler, Danny Hendler & Amnon Meisels
Type 1 and type 2 hypervisors (continued) Figure 7-2. Examples of the various combinations of virtualization type and hypervisor. Type 1 hypervisors always run on the bare metal whereas type 2 hypervisors use the services of an existing host operating system. Operating Systems, 2016, Meni Adler, Danny Hendler & Amnon Meisels
What's required of a (classic) hypervisor • Hypervisor should provide the following: • Safety: have full control of virtualized resources • Fidelity: program behavior on VM should be identical to its behavior on bare hardware • Efficiency: As much as possible, run directly on hardware without hypervisor intervention • Full interpretation isn't efficient Operating Systems, 2016, Meni Adler, Danny Hendler & Amnon Meisels
Classic virtualization: trap and emulate VM1 VM2 Return to process (3) VMM HW emulation HW Interrupt handler (2) Trap (1) Emulation is the process of implementing the functionality/interface of one system on a system having different functionality/interface
Trap and emulate: difficulties • Sensitive instructions: behave differently in kernel/supervisor and user mode • I/O instructions, enable/disable interrupts, … • Privileged instructions: cause a trap if executed in user mode Theorem [Popek and Goldberg, 1974] A machine can be virtualized [using trap and emulate] if every sensitive instruction is privileged. Not supported by x86 processors prior to 2005 In 2005, Intel/AMD introduced virtualization HW support. Operating Systems, 2016, Meni Adler, Danny Hendler & Amnon Meisels
What is sensitive? • CPU – registers • MMU • Page table • Segments • Interrupts • Timers • IO devices
X86 virtualization problem I • The x86 architecture (w/o virtualization extensions) can't be virtualized by trap and emulate. • Some sensitive instructions are not privileged. • Example: the popf instruction • Pops 16 bits from stack to flags register • One of the flags masks (i.e. disables) interrupts • The instruction is not privileged • What happens if the OS of a VM runs popf?
X86 virtualization problem II • Some instructions: push, pop, mov can have code segment selectors (cs, ds, ss) as arguments even in user mode, so they can be read • The selectors have two bits that are their current privilege level • In x86 (beginning with 386), four privilege levels (ring 0 to ring 3) • Each resource is assigned a level. • The two lower bits of the cs register are the Current Privilege Level (CPL) of the code. • Guest OS thinks that it is in ring 0. • Guest OS is actually in ring 1 • Result - guest OS confusion.
Implementation options • Emulation – • Full emulation – hypervisor executes code of VM step by step, testing each instruction – prohibitive overhead. • Trap and emulate if sensitive instructions privileged instructions • Change sensitive instructions • Interpretation – equivalent to emulation (BOCHS, JSLinux). • Binary translation – change (VMware, QEMU). • Para-virtualization – re-compile guest OS (XEN, Denali). • Hardware assistance– Intel VT-x and AMD-V (used by KVM, XEN, Vmware).
Outline • Concepts, classical CPU virtualization • Basic interpretation • Memory virtualization
Binary translation • Binary translation is the process of translating one instruction set to another one. • Approach I: translate statically all code base. • In our case the result is para-virtualization. • Problems • Dynamically linked libraries are not known at compile time. • Self-modifying code, e.g. program generating code and running it, is not covered.
Dynamic binary translation • Approach II: translate code on the fly (Just In Time). • Simplest approach • Keep table mapping old instructions to new instructions. • Fetch old instruction. • Use table to translate. • Execute new instruction(s) • Problem: performance • Overhead for every instruction similarly to interpretation.
Dynamic BT with caching • Cache translated code region: • After translation run from cache. • Translation occurs only once. • Static translation cannot handle dynamic control transfer, when: • Jump depending on memory address. • Indirect function call (by function pointer). • Translation of dynamic control transfer must be done at execution time.
Virtualization prior to HW support Figure 7-4. The binary translation rewrites the guest operating system running in ring 1, while the hypervisor runs in ring 0
VMWare binary translation: example C code 64-bit binary Invoking isPrime(49), logging all code translated Binary (hex) representation
VMWare binary translation: example First TU Compiled code fragment (CCF) • Translator reads guest memory at the address indicated by guest PC • Decodes instructions, creates Intermediate Representation - IR objects • Accumulates IR objects to translation units (TUs) • Basic blocks (BB), stops upon control flow
VMWare binary translation: example Identical code First TU Compiled code fragment (CCF) • Translator reads guest memory at the address indicated by guest PC • Decodes instructions, creates Intermediate Representation - IR objects • Accumulates IR objects to translation units (TUs) • Basic blocks (BB), stops upon control flow
VMWare binary translation: example Translation of jump BB First TU Compiled code fragment (CCF) • Translator reads guest memory at the address indicated by guest PC • Parses instructions, creates Intermediate Representation - IR objects • Accumulates IR objects to translation units (TUs) • Basic blocks (BB), stops upon control flow
VMWare binary translation: example Translation of fall through BB First TU Compiled code fragment (CCF) • Translator reads guest memory at the address indicated by guest PC • Parses instructions, creates Intermediate Representation - IR objects • Accumulates IR objects to translation units (TUs) • Basic blocks (BB), stops upon control flow
VMWare binary translation: example C code 64-bit binary Invoking isPrime(49), logging all code translated Which basic block will be translated next?
VMWare binary translation: example C code 64-bit binary Invoking isPrime(49), logging all code translated Which basic block will be translated next?
VMWare binary translation operation Translation cache (TC) stores translations done so far A hash table tracks the input to output correspondence Chaining optimization allows one CCF to jump directly to another without calling out of the translation cache As TC gradually captures guest's working set, proportion of translation decreases User code does not have to be translated
Dealing with privileged instructions: example The cli (clear interrupts) instruction is privileged Translated to: “vcpu.flags.IP=0” Much faster than source binary!
Outline • Concepts, classical CPU virtualization • Basic interpretation • Memory virtualization
Memory allocation • Each VM usually receives a contiguous set of physical addresses. • 512 Mbyte – 4 Gbyte are typical values. • As far as VM is concerned, this is the physical memory of the machine. • The guest OS allocates pages or segments to guest processes.
Memory management • Assumptions of OS in VM: • Physical memory is a contiguous block of addresses from 0 to some n. • OS can map any virtual page to any page frame. • Hypervisor must: • Partition memory among VMs. • Ensure virtual page mapping only to assigned page frames. • TLB – page fault in HW-managed TLB (e.g. x86) causes HW to select a page from page table. • VM OS must not manage real page table.
Option 1: brute force Define these pages as not R/W Guest OS Hypervisor Page table VMM SW VM memory layout Page dir. Interrupt & VMM corrects address. TLB CPU CR3 HW
Brute force – description • Guest page tables are read and write protected in host system. • If guest OS reads page table (e.g. for page eviction) writes page table (e.g. after page fault), or changes CR3, the system traps. • The hypervisor then uses a VM memory layout to: • Return answers to VM • Update the layout • Hypervisor switches VM memory layout when new VM is scheduled.
Option 2: shadow page tables Guest OS Hypervisor Page table VMM SW Shadow page table Page dir. Interrupt & VMM corrects page table. G-CR3 TLB CPU CR3 HW
Shadow page tables – description • Hypervisor maintains “shadow page tables”. • Guest page tables map: Guest VA Guest PA • Shadow tables: Guest VA Host PA. • Hypervisor does not trap guest updates to its page table. • Result – inconsistent guest page table and shadow page table. • When guest process accesses virtual address • The physical address is not in the guest page table, but in the shadow page table. • HW translates correctly, because it is aware only of shadow tables.
Shadow page tables – description (continued) • If address in TLB – TLB hit and no problem. • When guest process causes a page fault • Hypervisor begins execution. • Hypervisor updates guest page table with new page. • Hypervisor updates shadow page table. • Performance is as good as native execution as long as there are no page faults. • Shadow page tables should be cached so that once a VM is re-scheduled the page table does not have to be rebuilt from scratch.
Option 3: nested page tables Guest OS Hypervisor Page table VMM SW Host page table Page dir. TLB CPU CR3 EPTP HW
Nested page tables - description • The name implies having page tables within page tables. • The essence of the idea is a hardware assist. • Hardware has an extra pointer and the ability to walk an extra set of page tables. • Idea is called Extended Page Tables (EPT) by Intel • Guest page tables hold Guest VA Guest PA mapping, access by standard CR3 • Extended page tables hold Host VA Host PA mapping, access by EPTP (EPT pointer). • Host VA=Guest PA
Nested page tables – description (cont'd) • TLB as usual holds Guest VA Host PA • On memory access • If found in TLB – no problem. • If not in TLB, but no page fault, hardware walks both tables andupdates TLB. • If page fault, then hardware hypervisor gets host physical page and provides host virtual page (guest physical) to VM.
Sources “Modern operating systems”, 4‘th edition, A. Tanenbaum and H. Bos “Virtual machines”, J. E. Smith and R. Nair A presentation by Niv Gilboa from CSE@BGU “Formal requirements for virtualizable third generation architectures”, G. J. Popek and R. P. Goldberg, CACM, 1974 “A comparison of software and hardware techniques for x86 virtualization”, K. Adams and O. Ageson, ASPLOS 2006