550 likes | 694 Views
An Introduction to the V3VEE Project and the Palacios Virtual Machine Monitor. Peter A. Dinda Department of Electrical Engineering and Computer Science Northwestern University http://v3vee.org. Overview. We are building a new, public, open-source VMM for modern architectures
E N D
An Introduction to the V3VEE Project and the Palacios Virtual Machine Monitor Peter A. Dinda Department of Electrical Engineering and Computer Science Northwestern University http://v3vee.org
Overview • We are building a new, public, open-source VMM for modern architectures • You can download it now! • We are leveraging it for research in HPC, systems, architecture, and teaching • You can join us
Outline • V3VEE Project • Why a new VMM? • Palacios VMM • Research leveraging Palacios • Virtualization for HPC • Virtual Passthrough I/O • Adaptation for Multicore and HPC • Overlay Networking for HPC • Alternative Paging • Participation
V3VEE Overview • Goal: Create a new, public, open source virtual machine monitor (VMM) frameworkfor modern x86/x64 architectures (those with hardware virtualization support) that permits the compile-time/run-time composition of VMMs with structures optimized for… • High performance computing research and use • Computer architecture research • Experimental computer systems research • Teaching Community resource development project (NSF CRI) X-Stack exascale systems software research (DOE)
Example Structures App App LibOS App App LibOS LibOS App App Guest OS VMM API VMM VMM | MUX VMM | MUX VMM | MUX VMM | MUX | Driver HW HW HW HW HW App App LibOS App App LibOS LibOS App Guest OS VMM API Driver VMM | MUX | Driver HW
V3VEE People and Organizations • Northwestern University • Peter Dinda, Jack Lange, Russ Joseph, Lei Xia, Chang Bae, Giang Hoang, Fabian Bustamante, Steven Jaconette, Andy Gocke, Mat Wojick, Peter Kamm, Robert Deloatch, Yuan Tang, Steve Chen, Brad Weinberger Mahdav Suresh, and more… • University of New Mexico • Patrick Bridges, Zheng Cui, Philip Soltero, Nathan Graham, Patrick Widener, and more… • Sandia National Labs • Kevin Pedretti, Trammell Hudson, Ron Brightwell, and more… • Oak Ridge National Lab • Stephen Scott, GeoffroyVallee, and more… • University of Pittsburgh • Jack Lange, and more…
Why a New VMM? Research in Systems Commercial Space Research and Use In High Performance Computing Teaching Virtual Machine Monitor Research in Computer Architecture Open Source Community broad user and developer base
Why a New VMM? Research in Systems Commercial Space Research and Use In High Performance Computing Teaching Virtual Machine Monitor Research in Computer Architecture Open Source Community Well-served broad user and developer base
Why a New VMM? Research in Systems Commercial Space Research and Use In High Performance Computing Teaching Virtual Machine Monitor Research in Computer Architecture Open Source Community Ill-served broad user and developer base
Why a New VMM? • Code Scale • Compact codebase by leveraging hardware virtualization assistance • Code decoupling • Make virtualization support self-contained • Linux/Windows kernel experience unneeded • Make it possible for researchers and students to come up to speed quickly
Why a New VMM? • Freedom • BSD license for flexibility • Raw performance at large scales
Under development since early 2007 • Third major release (1.2) in January, 2010 • Public git access to current development branches • Successfully used on Cray XT supercomputers, clusters (Infiniband and Ethernet), servers, desktops, etc.
What Does It Do? • System virtualization • No paravirtualization necessary • Run an OS as an application • Run multiple OS environments on a single machine • Start, stop, pause • Trap and emulate model • Privileged instructions/events are trapped by VMM using hardware mechanisms • Emulated in software • Palacios acts as an event handler loop • Minimal Linux boot: ~2 million events (“exits”) Page Tables CPU state Hardware Application Application Application Guest Application Guest OS Guest OS Guest OS OS Page Tables CPU state Hardware HostOS/VMM VMM Emulate Hardware Hardware Hardware
Leveraging Modern Hardware • Requires and makes extensive use of modern x86/x64 virtualization support • AMD SVM (AMD-V) • Intel VT-x • Nested paging support in addition to several varieties of shadow paging • Straightforward passthrough PCI • IOMMU/PCI-SIG in progress
Design Choices • Minimal interface • Suitable for use with a lightweight kernel (LWK) • Compile- and run-time configurability • Create a VMM tailored to specific environments • Low noise • No deferred work • Contiguous memory pre-allocation • No surprising memory system behavior • Passthrough resources and resource partitioning
Embeddability • Palacios compiles to a static library • Clean, well-defined host OS interface allows Palacios to be embedded in different Host OSes • Palacios adds virtualization support to an OS • Palacios + lightweight kernel = traditional “type-I” “bare metal” VMM • Current embeddings • Kitten, GeekOS, Minix 3, Linux [partial]
Sandia Lightweight Kernel Timeline 1991 –Sandia/UNM OS (SUNMOS), nCube-2 1991 – Linux 0.02 1993 –SUNMOS ported to Intel Paragon (1800 nodes) 1993 –SUNMOS experience used to design Puma First implementation of Portals communication architecture 1994 – Linux 1.0 1995 –Puma ported to ASCI Red (4700 nodes) Renamed Cougar, productized by Intel 1997 –Stripped down Linux used on Cplant (2000 nodes) Difficult to port Puma to COTS Alpha server Included Portals API 2002 –Cougar ported to ASC Red Storm (13000 nodes) Renamed Catamount, productized by Cray Host and NIC-based Portals implementations 2004 – IBM develops LWK (CNK) for BG/L/P (106000 nodes) 2005 – IBM & ETI develop LWK (C64) for Cyclops64 (160 cores/die)
Kitten: An Open Source LWK http://software.sandia.gov/trac/kitten • Better match for user expectations • Provides mostly Linux-compatible user environment • Including threading • Supports unmodified compiler toolchains and ELF executables • Better match vendor expectations • Modern code-base with familiar Linux-like organization • Drop-in compatible with Linux • Infiniband support • End-goal is deployment on future capability system
A Compact Type-I VMM KVM: 50k-60k lines + Kernel dependencies (??) + User level devices (180k) Xen: 580k lines (50k – 80k core)
Outline • V3VEE Project • Why a new VMM? • Palacios VMM • Research leveraging Palacios • Virtualization for HPC • Virtual Passthrough I/O • Adaptation for Multicore and HPC • Overlay Networking for HPC • Alternative Paging • Participation
HPC Performance Evaluation • Virtualization is very useful for HPC, but… Only if it doesn’t hurt performance Virtualizing a supercomputer? Are we crazy? • Virtualized RedStorm with Palacios • Evaluated with Sandia’s system evaluation benchmarks • Initial (48 node) evaluation in IPDPS 2010 paper • Large scale (>4096 node) evaluation in submission
Cray XT3 38208 cores ~3500 sq ft 2.5 MegaWatts $90 million 17th fastest supercomputer
Virtualized performance(Catamount) Within 5% Scalable HPCCG: conjugant gradient solver
Different Guests Need Different Paging Models Shadow Paging Compute Node Linux Catamount HPCCG: conjugant gradient solver
Different Guests Need Different Paging Models Compute Node Linux Catamount CTH: multi-material, large deformation, strong shockwave simulation
Large Scale Study • Evaluation on full RedStorm system • 12 hours of dedicated system time on full machine • Largest virtualization performance scaling study to date • Measured performance at exponentially increasing scales • 4096 nodes • Publicity • New York Times • Slashdot • HPCWire • Communications of the ACM • PC World
Scalability at Large Scale(Catamount) Within 3% Scalable CTH: multi-material, large deformation, strong shockwave simulation
Infiniband on Commodity Linux (Linux guest on IB cluster) 2 node Infiniband Ping Pong bandwidth measurement
Observations • Virtualization can scaleon the fastest machines on in the world running tightly coupled parallel applications • Best virtualization approach depends on the guest OS and the workload • Paging models, I/O models, scheduling models, etc… • Useful for VMM to have asynchronous and synchronous access to guest knowledge • Symbiotic Virtualization (Lange’s thesis)
VPIO: Virtual Passthrough I/O • A modeling-based approach to high performance I/O virtualization for commodity devices • Intermediate option between passthrough I/O, self-virtualizing devices, and emulated I/O • Security of emulated I/O • …with some of the performance of self-virtualizing devices or passthrough I/O • …on commodity devices 31 [WIOV 2008, Operating Systems Review 2009]
I/O virtualization – full emulated I/O • No guest software change • All New Device Drivers • High performance overhead [Sugerman01] 32
I/O virtualization – paravirtualized I/O • Performance • Reuse Device Driver • Change guest device driver [Barham03, Levasseur04]
I/O virtualization – Self-virtualizing Devices • Native performance • Guest responsible for device driver • Specialized hardware support • Self-virtualizing devices [Liu06, Raj07,Shafer07]
I/O virtualization – Passthrough I/O Guest OS • Reusablity/multiplexing • Security Issue • Makes sense in HPC contexts, not so much in other contexts VMM Mapping Device 35
VMM maintains a formal model of device • keeps track of the physical device status • driven by guest/device interactions • simpler than a device driver/virtual device • Model determines • Reusable state – whether device is currently serially reusable • DMA – whether a DMA is about to starting and what memory locations will be used • Other interesting states Core Idea of VPIO
VPIO In The Context Of Palacios Guest OS Guest OS Unmodified Driver Unmodified Driver Hooked I/O Unhooked I/O Interrupt DMA Palacios VMM Device Modeling Monitor (DMM) Physical Device
Model is small ~700 LOC Example Model: NE2K NIC Checking Function Transitions driven by guest’s interactions with hooked I/O resources and interrupts 38
Virtuoso Project (2002-2007)virtuoso.cs.northestern.edu • “Infrastructure as a Service” distributedgridcloud virtual computing system • Particularly for HPC and multi-VM scalable apps • First adaptive virtual computing system • Drives virtualization mechanisms to increase the performance of existing, unmodified apps running in collections of VMs • Focus on wide-area computation R. Figueiredo, P. Dinda, J. Fortes, A Case for Grid Computing on Virtual Machines, Proceedings of the 23rd International Conference on Distributed Computing (ICDCS 2003), May, 2003. Tech report version: August, 2002.
Virtuoso: Adaptive Virtual Computing • Providers sell computational and communication bandwidth • Users run collections of virtual machines (VMs) that are interconnected by overlay networks • Replacement for buying machines • That continuously adapts…to increase the performance of your existing, unmodified applications and operating systems See virtuoso.cs.northwestern.edu for many papers, talks, and movies
Core Results (Patent Covered) • Monitor application traffic: Use application’s own traffic to automatically and cheaply produce a view of the application’s network and CPU demands and of parallel load imbalance • Monitor physical network using the application’s own traffic to automatically and cheaply probe it, and then use the probes to produce characterizations • Formalize performance optimization problems in clean, simple ways that facilitate understanding their asymptotic difficulty. • Adapt the application to the network to make it run faster or more cost-effectively with algorithms that make use of monitoring information to drive mechanisms like VM->host mapping, scheduling of VMs, overlay network topology and routing, etc. • Adapt the network to the application through automatic reservations of CPU (incl. gang scheduling) and optical net paths • Transparently add network services to unmodified applications and OSes to fix design problems
Extending Virtuoso: Adaptation at the Small Scale • Targeting current and future multicores • Example: specialized multicore guest environment suitable for parallel language run-time • Avoid general-purpose Intel SMP-spec model or NUMA models • Specific assists for run-times implemented in Palacios • Run parallel components on the “bare metal”, while sequential components run on top of the traditional stack • Example: joint software/hardware inference of communication/memory system activity • Extending hardware performance counters • Example: automatic configuration of routing/forwarding on a network-on-chip multicore processor • “Dial in” topology suited for application
Extending Virtuoso:Adaptation at the Large Scale • Targeting a parallel supercomputer • Extending Virtuoso-style adaptation into tightly coupled, performance critical computing environments • Example: seamless, performance-neutral, migration of parallel application from WAN to LAN to Cluster to Super depending on communication behavior • Leverage our VNET overlay network concept
Illustration of VNET Overlay Topology Adaptation in Virtuoso Fast-path links between the VNETs hosting VMs Resilient Star Backbone User’s LAN Foreign host LAN 1 VM 1 Host 1 + VNET IP network Proxy+ VNET Merged matrix as inferred by VTTIF Foreign host LAN 2 VM 2 VM 4 VM 3 Host 3 + VNET Host 4 + VNET Host 2 + VNET Foreign host LAN 4 Foreign host LAN 3
VNET Overlay Networking for HPC • Virtuoso project built “VNET/U” • User-level implementation • Overlay appears to host to be simple L2 network (Ethernet) • Ethernet in UDP encapsulation (among others) • Global control over topology and routing • Works with any VMM • Plenty fast for WAN (200 Mbps), but not HPC • V3VEE project is building “VNET/P” • VMM-level implementation in Palacios • Goal: Near-native performance on Clusters and Supers for at least 10 Gbps • Also: easily migrate existing apps/VMs to specialized interconnect networks
Current Performance (2 node, ttcp)(4x speedup of VNET/P over VNET/U, Native Bandwidth on 1 Gbps Ethernet)
Current Performance (2 node, latency)(<0.5 ms latency, but room for improvement)
Current Performance (2 node, ttcp)(We are clearly not done yet!)