Xen and the Art of Virtualization

Xen and the Art of Virtualization University of Cambridge Presenter: Ashish Gupta

Features • An open infrastructure for global distributed computing • Run multiple services on a single Xenoserver • Envisage running up to 100 per server • Secure and accountable execution • Strong isolation, logging and auditing • Flexible: low-level execution environment • Economical: execute on commodity hardware • (x86)

Virtualization techniques • Single OS image (Ensim, VServers) • Group user processes into resource container. • Implement new schedulers in the OS to ensure isolation • Hard to retrofit isolation to conventional Oses • Full virtualization (VMware, Connectix, Bochs) • Run full OSes as unmodified guests • The VMM enforces resource isolation • But it’s hard to efficiently virtualize uncooperative architectures

Paravirtualization Goals • Low Virtualization Overhead • Performance Isolation Also(Flexibility) • Support full-featured multi-user multi-application OSes

System Performance

Para-virtualization – Principles ? • Para-virtualization vs. full-virtualization • Expose guest OS to “real resources” (time, MMU etc.) • Better support time sensitive tasks • Allows guest OS optimizations • Correctness issues • The Downside

Para-virtualization Mechanisms

Three broad aspects • Memory Management • CPU • Device I/O

Memory Management • The VMWare approach – shadow page tables

Modifications • Paravirtualization obviates the need for shadow page tables • Guest OSes allocate and manage their own page tables HOW ?

Mechanism • Updates to page tables must be passed to Xen for • validation • Updates may be queued and processed in batches • Validation rules (applied to each PTE): • 1. only map a page if owned by the requesting guest OS • 2. only map a page containing PTEs for read-only access • Xen tracks page ownership and current use

Memory Management • The Xen approach

Memory benchmarks

CPU • Efficient because - Four privilege levels • OS – Ring 1, Applications – Ring 3 • Privileged instructions required to be validated and executed by Xen Exceptions • Guest OS registers handlers with Xen • Para-virtualization  Unchanged handlers • “fast handlers” for most exceptions, Xen isn’t involved • Page faults – CR2 register read by Xen, so must enter Xen

Xen uses the 4-ring model

VM ↔ VMM • Guest OS  Xen : Hypercalls • Like system calls • Xen  Guest OS : Events • Like UNIX signals

I/O Virtualization • Need to minimize cost of transferring bulk data via Xen • Copying costs time • Copying pollutes caches • Copying requires intermediate memory • Device classes • Net • Disk • Graphics

I/O Virtualization • Use rings of buffer descriptors • Descriptors are small: cheap to copy and validate • Descriptors refer to bulk data • No need to map or copy the data into Xen’s address space • Exception: checking network packet headers prior to TX • Use zero-copy DMA to transfer bulk data between hardware and guest OS • Net TX: DMA packet payload separately from validated packet header • Net RX: Page-flip receive buffers into guest address space

TCP Benchmarks

Effect of I/O and OS interaction SPEC INT2000 score CPU Intensive Little I/O and OS interaction SPEC WEB99 180Mb/s TCP traffic Disk read-write on 2GB dataset

Scalability

Performance Isolation • 4 domains • 2 PostgreSQL, SPECWEB99 workloads • 2 anti-social workloads • Disk bandwidth hog: huge number of small file creations • Fork Bomb • The Bad guys could not kill the Good guys • In Native Linux: Rendered the machine completely unusable !

Denali Isolation Kernel University of Washington

Motivation • Functionality pushed into the network: • Google, IMDB, Hotmail, Amazon, EBay, online banking, …lots! • Major players use dedicated hardware. • Lesser services find that cumbersome, expensive and limiting: • Hardware, rack space, bandwidth • Big deployment barrier for little services.

Virtual hosting • Third-party hardware, with small services multiplexed on machines. • Need the ability to run untrusted code. • Likewise for CDNs for dynamic content.

Goals: • strong security • resource control. • Don’t need: resource sharing. • Conventional OSs do not isolate enough • Spectrum of Ideas ! • #1: OSs with Perf isolation • #2: OSs and sandboxing • #3: Exo- / Micro- kernels • #4: Conventional VMs • #5: Isolation Kernel

Isolation Kernel • Focus here is on • Performance with Scaling and • Isolation/Security • Reconsider the exposed Virtual Architecture • Downside • (Linux port ?)

Scaling Arguments

Denali Mechanism

Overall Architecture

ISA • Biggest challenge for x86 virtualization: • Ambiguous instruction semantics • No support for ambiguous instructions • Two virtual Instructions • Idle-with-timeout • Terminate execution

Memory Architecture • Simple DOS-like architecture: No virtual MMU • Why ? • TLB Problems on x86 : Hardware mapped: Inflexible • Avoids TLB Flushes • Optional Virtual MMU ?

I/O and Interrupt Model • Simpler interfaces to NIC, Disk, keyboard, console and Timer • Avoid the “chatty” interfaces • Interrupt Model • Physical Interrupts  Virtual interrupts • Interrupt Dispatch Model • Delays and batches interrupts for non-running VMs • Timing related interrupts ?? Real time apps, games etc ?

Implementation • Round robin scheduling • Idle-with-timeout compensated with a higher priority for next quantum. • Can use existing compilers (gcc) to generate code • VMs are paged in on demand. VMM always in core

Memory • Virtualized 16MB of physical address space per VM (since no virtual MMU). • Recently they added a virtual MIPS-style virtual MMU, so guest OS can virtualize its apps’ space. Overhead? • Pre-allocated, strided swap space. No sharing, so each VM’s space is contiguous.

Networked IO • Ethernet driver moved from guest OS to Denali. Rest of TCP/IP stack stays. • This suffices for early-demuxing received packets into the appropriate VM. • Virtual packet send/recv is 1 PIO each

Guest OS • Guest OS: currently only a library, with no simulated protection boundary there. • Supports a POSIX subset. • Different from a traditional VM : OS more like a process: single user, single task OS ? • Flexibility ?

Evaluation • Network Latency

TCP, HTTP throughput • TCP: BSD-Linux 607Mb/s Denali-Linux 569Mb/

Fair comparison? • Denali with library kernel compared against BSD: both have one protection boundary • Denali-Linux will have one real and one simulated protection boundary: different ?

Batching • Reduction in context switching frequency

Idle-with-timeout

Scalability • In-core regime – constant performance • disk bound regime - problems

Scalability and block size • Internal fragmentation!

Evaluation summary Good performance and scalability due to architectural modifications various techniques Is the lib OS representative of a real OS?

Xen and the Art of Virtualization