670 likes | 815 Views
Introduction to. Thomas J. Hacker. Associate Professor, Computer & Information Technology Co-Leader for Information Technology, Network for Earthquake Engineering Simulation (NEES). Virtual high performance Computing clusters. July 30, 2012. Outline.
E N D
Introduction to • Thomas J. Hacker • Associate Professor, Computer & Information Technology • Co-Leader for Information Technology, Network for Earthquake Engineering Simulation (NEES) • Virtual high performance • Computing clusters July 30, 2012
Outline • Motivation for the use of virtualization • Overview of virtualization technology • Overview of cloud computing technology • Relation of cloud computing to HPC • Practical notes on virtualization and cloud computing • Virtual HPC clusters • How to get started
Motivation for virtualization Why virtualization, and when does it make sense? • Clock speed increases following Moore’s law have ceased • Hardware is going to multicore with many cores • E.g. Intel MIC is the new Xeon Phi (Knight's Corner) with 50+ cores • Memory capacity of systems increasing • Max 512 GB on systems today
Motivation for virtualization • Traditional approach has been to tie a single application to a single server • An application runs in its own OS image on its own server for manageability and serviceability • This approach doesn’t make sense anymore if you have 50+ cores that can’t be effectively used by an application • It’s also difficult to share OS and various library versions for running multiple apps on the same system if OS/lib version requirements are conflicting • VMs are being used to partition large scale servers to run many OSs and VMs independently from each other
Motivation for virtualization • Virtualization is now commodity technology • Ideas were first developed in the 1960s at IBM for their mainframe computers • Virtualization is used frequently for administrative applications to reduce the hardware footprint in the data center and reduce costs • This represents a commodity trend that like other commodity trends that is worth exploiting for HPC • Especially useful substitute for small-scale lab clusters that are used early in the life cycle of a parallel application.
Software ecosystem for applications Software requires a functional ecosystem (similar to Mazlow’s Needs hierarchy) • Basic “physiological” needs • Reliable computing platform • Functional operating system platform that is needed by the application • If software isn’t kept up to date, can conflict with OS upgrades • Adequate disk space, memory, and CPU cores • “Safety” needs • Secure computing environment – no attackers, compromised accounts, etc. • “sense of security and predictability in the world” • Predictability is essential for replicating results and debugging • “Sense of community” • All of the nodes in the cluster need to be consistent • Same OS version, libraries, etc. • Especially critical for MPI applications • Meeting these basic needs ensures a consistent software ecosystem • Stable platform facilitates software development, testing, and validation of results • Developers and users can begin to trust the software, and results from software • Provides a strong base for future growth and development of the application
Software ecosystem for applications • Problems: difficult for users to control their computing environment for scientific applications • Scientific apps used in projects such as CMS require a lot of specific packages and versions, and it can be very difficult to get central IT organizations to customize and install the necessary software, due to the need to provide a generic and reliable system for the rest of the user base. • Scientific applications go through a life-cycle in which they evolve from single processor to running on a few workstations to small scale clusters and then finally scaling up to very large systems. • Building small scale physical clusters as a part of this life cycle is very expensive both in equipment, time, and grad student effort wasted to run these systems. • Scientific users can really benefit from having root access on their own systems to work on getting their codes working and installing any necessary packages.
Software ecosystem for applications • Virtual HPC clusters are an attractive and viable alternative to small scale lab clusters when applications that need these types of resources are still “young” and require a lot of customization. • On larger systems, virtual clusters are a promising approach to provide system level checkpointing for large-scale applications. • Imagine if you could use a virtualization system on your laptop to develop a 2 or 3 VM virtual cluster with all the packages and optimizations you needed, then transfer that VM image to a virtual cluster platform and instantiate dozens (or more) VM images to run a virtual cluster. • Fault tolerance is a critical problem for applications as they scale up. • There are several levels of checkpointing: • Application level • “On the system” level (e.g. condor, blcr) • “below the system” level using live migration or checkpointing/saving VM images
Reliability • One of the “safety” needs of software in its ecosystem • Problems with reliability and techniques to improve reliability • Large systems can fail often • Severely affects large and/or long running jobs • Very expensive to just restart computation from the beginning • Lots of wasted time on the computer system, and wasted power and cooling
Reliability • A technique to overcome this problem is to frequently save critical program data – called checkpointing • Your program will need to read the saved data when your program is restarted and resume computational from the saved state • There is some guidance as to how often you need to checkpoint to find a good balance between spending time on saving state for “safety” vs. making forward progress in your computation • Daly’s checkpoint formula is a good start
Reliability Daly checkpoint formula • Used to estimate the optimal compute time between writing checkpoints
Reliability • Research exploring alternative methods of performing checkpoint operations • System level checkpointing - BLCR • MPI level checkpointing • VM level checkpointing and live migration • Idea is to periodically save the VM state, or to live migrate the VM from sick to healthier systems
Reliability • Be aware of the need to integrate reliability practices in your application as you design and write your code • At a minimum structure your code so that you can periodically save the current state of computation, and develop a capability to restart computation from that saved state if your program is restarted
Overview of Virtualization Technologies • Virtualization is a technique that separates the operating system from the physical computer hardware, and interposes a layer of controlling software (hypervisor) between the hardware and operating system. • Different types of virtualization systems (from Goldberg) • Type 1: hypervisor between “bare metal” and guest operating systems • Type 2: hypervisor between host operating system and guest operating systems • Type 1 examples • VMware, Xen, KVM • OpenVZ • Type 2 examples • Virtual Box, VMware Workstation, Parallels for Mac
Type 1 Virtualization • VMware • High quality commercial product • We use VMware extensively for NEES • Very useful for transitioning IT infrastructure from SDSC to Purdue for the NEES project • Simply created VM images for each service/server on a few physical servers • We were able to archive the VM images of the services/servers when NEES brought up NEEShubcyberinfrastructure • Windows • Hyper-V
Type 1 Virtualization • Virtualization systems for Linux • Xen and KVM • Open source virtualization systems based on Linux • Xen • First major virtualization system • Older, seems to be less reliable • KVM • Kernel-based Virtual Machine • Newer, supported by RedHat • OpenVZ • Container based virtualization system
Xen • First version in 2003, and the first popular Linux hypervisor • Integrated into the Linux kernel • Uses paravirtualization • Guest OSs run a modified operating system to interact with hypervisor • Different from VMware, which uses a custom kernel you load on the bare harware • Host OS runs as Domain0 • Guest OSs run • Used to be supported in a limited form in RedHat and Ubuntu • Has been replaced with KVM in RedHat • Citrix has a commercial version of Xen • Personal experiences using Xen • Works OK for simple virtualization • Complex operations didn’t work as well
KVM • Kernel-based Virtual Machine (KVM) • Built into Linux kernel • Supported by RedHat • More recent than Xen • Uses QEMU for virtual processor emulation • Allows you to emulated CPU architectures other than Intel • E.g. ARM and SPARC • Supports a wide variety of guest operating systems • Linux • Windows • Solaris • Provides a useful set of management utilities • Virtual Machine Manager • ConVirt
OpenVZ • Container based virtualization system • Secure isolated Linux containers • Think of this as a “cage” for an application running in an OpenVZ container • OpenVZ terminology: Virtual Private Servers (VPS), Virtual Environments (VE) • Two major differences from Xen and KVM • Guest OS shares kernel with host OS • File system of Guest OS visible on Host OS and is part of the directory tree on the Host OS • Doesn’t use a virtual disk drive (no 15 GB files to manage) • Benefits compared with Xen and KVM • Very fast container creation • Very fast live migration • Easy to externally modify container file system (e.g. install software in the container) • Scales very well (no big virtual disk images) • Downsides • Must use the same OS as the Host OS • Sharing kernel with the Host OS
Type 2 Examples • Oracle Virtual Box • Free VM environment that you can use on Windows, Linux, Mac OS X, and Solaris • Simple to use, good way to get started • VM images can be exported • VM images can be exported • In theory…. • Depends on the ability of the target virtualization system to import VM disk images • Exports in OVF (Open Virtualization Framework) format • My personal experience is that you often need to use a Linux utility to try to convert the disk image and VM metadata to an acceptable format for another virtualization system (often complex).
Type 2 Examples • VMware Workstation • Runs as an application on top of Windows • NOT VMware ESX (which is a hypervisor) • Another good way to get started in working with virtualization technology • Parallels for Mac • Can be used to run Windows on a Mac • Commercial software • Personal experience: Works “OK”, but Windows can be slow running on Parallels
OpenVZ vs. KVM • I am using OpenVZ and KVM for two different projects • NEES / NEEShub • Based on HUBzero • Using OpenVZ as a virtual container or “jail” in which to run applications that interfaces with user through a vnc window on a webpage • OpenNebula cluster to run parallel applications • Distributed rendering using Maya • batchrendering animations • OpenSees building simulation program for NEES • Parallel version that uses parallel solvers and MPI • Running on a virtual cluster on OpenNebula in my lab and on FutureGrid • The choice depends on the type of application who wish to run and the environment in which it will be run.
Virtualization on linux • Additional mechanisms in Linux • Libvirt / virtio • Veneer library and utilities over virtualization systems • Brctl • Linux virtual network bridge control package • Cgroups • Linux feature for controlling resource use of processes • Network virtualization • Network control is a constant problem • VLANs are best, but hard to configure • OpenFlow is supposed to address network management to simplify it and make it scalable.
Moving up from Virtualization • We talked about virtualization on a system level • How can we manage a collection on virtual machines on a single system? • How can we manage a distributed network of computers than host virtual machines? • How can we manage the network and storage for this distributed network of virtual machines? • This is the basis for one aspect of what is called “cloud computing” today • Infrastructure-as-a-Service (IaaS) • The technology used for IaaS is the basis for building virtual HPC clusters, which is a collection of virtual machines running on a distributed network of computers.
Cloud Computing • Emerging technology that leverages virtualization • Distributed computing of the 201Xs • Initial idea of a “computing utility” from Multics in the 1960s • Computing utility that provides services over a network • Computing • Storage • Pushes functionality from devices at the edge (e.g. laptops and mobile phones) to centralized servers
Cloud Computing Architecture • User interface • How users interact with the services running on the cloud • Very simple client hardware • Resources and services index • What services are in the cloud, and where they are located • System Management and Monitoring • Storage and servers
Types of cloud computing systems • Infrastructure as a service (IaaS) • Software as a service (SaaS) • Platform as a service (PaaS) • There are some fundamental difference between these approaches that lead to confusion when talking about “cloud computing” • A cloud computing infrastructure can include one or all of these
Infrastructure as a Service (Iaas) • Virtualization environment • Cloud service provider offers capability of hosting virtual machines as a service • Cloud computing infrastructure for IaaS focuses on systems software needed to load, start, and manage virtual machines • Amazon EC2 is one example of IaaS
IaaS • Enabling technologies used to provide IaaS • Virtualization layer • VMware • Xen/KVM • OpenVZ • Networking layer • Need to provide a VPN and network security for private VMs • Scheduling layer • Managing the mapping of IaaS requests to physical and virtual infrastructure • Amazon EC2 provide this • OpenNebula, Eucalyptus, and Nimbus also provide scheduling services
Iaas Benefits • User doesn’t need to own infrastructure • No servers, data center, etc. required • Very low cost of entry • Pay-as-you-go computing • No upfront capital investments needed • Leasing a solution instead of a box • No systems administration staff/operations staff needed • Cloud computing provided leverages economies of scale
Examples of IaaS • BlueLock in Indianapolis • Commercial IaaS provider • Eucalyptus • Started as a research project at UCSB • Based on Java and Web Services • OpenNebula • Developed in Europe • Leverages usual Linux technologies • Ssh, NFS, etc. • Uses a scheduler named Haizea • Nimbus • Research project at Argonne National Lab • Linked with Globus
Platform as a Service (PaaS) • Builds on virtualization platform • Provides a software stack in addition to the virtualization service • OS, web server, authentication, etc. • APIs and middleware • For example, if you needed a web server and you didn’t want to install apache, linux, etc.
Benefits of PaaS • Supported software stack • Don’t need to focus efforts on getting software infrastructure working • Pooled expertise in use of the software at the cloud computing provider • You can focus service and development efforts on just your product • Pay-as-you-go
Examples of PaaS • Amazon Web Service • Wikileaks was using this • You buy a web service that runs on Amazon’s virtualization infrastructure • Downside: outages can take out a lot of services. • Netflix also uses Amazon EC2 • Other Examples • Google App Engine • Microsoft Azure
Software as a Service (SaaS) • Provides access to software over the Internet • No download/installation of the software is needed • Users can lease or rent software • Was a big idea about a decade ago, seems to be coming back • Software runs remotely and displays back to the users computer • Think ‘vnc’ • NEEShub is an example of this • Researchers can run tools in a window without download/install
Benefits of SaaS • No user download/install • Many corporate users don’t have access on their computers to install software • Easier to support • Control the computing environment centrally • Can be faster • As long as server hardware is fast and users have a good network connection • Efficient use of centralized computing infrastructure
Relation of cloud computing to HPC • Use of cloud computing depends on how the HPC application is used • SaaS • NEEShubbatchsubmit capability • Allows uses to run parallel applications through the NEEShub as a service • Users don’t need to be concerned about underlying infrastructure • IaaS • HPC clusters on an infrastructure level • The problem here is to deploy, operate, and use a collection of VMs to constitute a virtual HPC cluster • The capabilities in this area are focused on VM image and network management and deployment
Relation of cloud computing to HPC From a user’s perspective, what do you need to do to use the technology? • SaaS • Discover the application • Launch the application • Monitor execution • Collect and analyze the results • IaaS • Discover the resources needed • Provide a VM image or create a new one built from provided VM images • Deploy the image on the cloud computing system • Setup the networking among the VM instances • Setup an MPI ring • Launch your application • Monitor execution • Collect and analyze the results • SaaS is a lot simpler that IaaS for users • HUB based systems such as NEEShub and nanoHUB provide a specific set up applications as a service • However, it limits what a user can do • The problem is how to establish a virtual HPC cluster that can be used by users to develop, test, and prepare a parallel application for production use or to eventually transition to application to a service (SaaS) than can be run in a HUB environment.
Example of NEEShubSaaS Windows application
Example of NEEShubSaaS Linux application You can create an account on nees.org and try these tools
Practical notes on using virtualization linux • Use virt-manager to create and manage VM images • Images usually stored in /var/lib/libvirt • Make sure you have enough storage for /var/lib • Or you can change the default location using virsh • Networking can be complicated due to the use of virtual network bridges in Linux • Networking can be very complex – be prepared to work on it to make it work • Simplest to start with NAT to get your VM on the network • Be cautious about computer security
Practical notes on using virtualization • Managing network can be tricky • Bridge-utils yum package provides brctl utilities to create and manage virtual network switches and connections • External interface connects to the virtual network switch • VMs will connect to the virtual switch to share the connection • Virt-manager provides some functionality for this, but basically relies on what is created and managed by bridge-utils