1 / 67

Introduction to

Introduction to. Thomas J. Hacker. Associate Professor, Computer & Information Technology Co-Leader for Information Technology, Network for Earthquake Engineering Simulation (NEES). Virtual high performance Computing clusters. July 30, 2012. Outline.

tiva
Download Presentation

Introduction to

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to • Thomas J. Hacker • Associate Professor, Computer & Information Technology • Co-Leader for Information Technology, Network for Earthquake Engineering Simulation (NEES) • Virtual high performance • Computing clusters July 30, 2012

  2. Outline • Motivation for the use of virtualization • Overview of virtualization technology • Overview of cloud computing technology • Relation of cloud computing to HPC • Practical notes on virtualization and cloud computing • Virtual HPC clusters • How to get started

  3. Motivation for virtualization Why virtualization, and when does it make sense? • Clock speed increases following Moore’s law have ceased • Hardware is going to multicore with many cores • E.g. Intel MIC is the new Xeon Phi (Knight's Corner) with 50+ cores • Memory capacity of systems increasing • Max 512 GB on systems today

  4. Motivation for virtualization • Traditional approach has been to tie a single application to a single server • An application runs in its own OS image on its own server for manageability and serviceability • This approach doesn’t make sense anymore if you have 50+ cores that can’t be effectively used by an application • It’s also difficult to share OS and various library versions for running multiple apps on the same system if OS/lib version requirements are conflicting • VMs are being used to partition large scale servers to run many OSs and VMs independently from each other

  5. Motivation for virtualization • Virtualization is now commodity technology • Ideas were first developed in the 1960s at IBM for their mainframe computers • Virtualization is used frequently for administrative applications to reduce the hardware footprint in the data center and reduce costs • This represents a commodity trend that like other commodity trends that is worth exploiting for HPC • Especially useful substitute for small-scale lab clusters that are used early in the life cycle of a parallel application.

  6. Software ecosystem for applications Software requires a functional ecosystem (similar to Mazlow’s Needs hierarchy) • Basic “physiological” needs • Reliable computing platform • Functional operating system platform that is needed by the application • If software isn’t kept up to date, can conflict with OS upgrades • Adequate disk space, memory, and CPU cores • “Safety” needs • Secure computing environment – no attackers, compromised accounts, etc. • “sense of security and predictability in the world” • Predictability is essential for replicating results and debugging • “Sense of community” • All of the nodes in the cluster need to be consistent • Same OS version, libraries, etc. • Especially critical for MPI applications • Meeting these basic needs ensures a consistent software ecosystem • Stable platform facilitates software development, testing, and validation of results • Developers and users can begin to trust the software, and results from software • Provides a strong base for future growth and development of the application

  7. Software ecosystem for applications • Problems: difficult for users to control their computing environment for scientific applications • Scientific apps used in projects such as CMS require a lot of specific packages and versions, and it can be very difficult to get central IT organizations to customize and install the necessary software, due to the need to provide a generic and reliable system for the rest of the user base. • Scientific applications go through a life-cycle in which they evolve from single processor to running on a few workstations to small scale clusters and then finally scaling up to very large systems. • Building small scale physical clusters as a part of this life cycle is very expensive both in equipment, time, and grad student effort wasted to run these systems. • Scientific users can really benefit from having root access on their own systems to work on getting their codes working and installing any necessary packages.

  8. Software ecosystem for applications • Virtual HPC clusters are an attractive and viable alternative to small scale lab clusters when applications that need these types of resources are still “young” and require a lot of customization. • On larger systems, virtual clusters are a promising approach to provide system level checkpointing for large-scale applications. • Imagine if you could use a virtualization system on your laptop to develop a 2 or 3 VM virtual cluster with all the packages and optimizations you needed, then transfer that VM image to a virtual cluster platform and instantiate dozens (or more) VM images to run a virtual cluster. • Fault tolerance is a critical problem for applications as they scale up. • There are several levels of checkpointing: • Application level • “On the system” level (e.g. condor, blcr) • “below the system” level using live migration or checkpointing/saving VM images

  9. Reliability • One of the “safety” needs of software in its ecosystem • Problems with reliability and techniques to improve reliability • Large systems can fail often • Severely affects large and/or long running jobs • Very expensive to just restart computation from the beginning • Lots of wasted time on the computer system, and wasted power and cooling

  10. Reliability • A technique to overcome this problem is to frequently save critical program data – called checkpointing • Your program will need to read the saved data when your program is restarted and resume computational from the saved state • There is some guidance as to how often you need to checkpoint to find a good balance between spending time on saving state for “safety” vs. making forward progress in your computation • Daly’s checkpoint formula is a good start

  11. Reliability Daly checkpoint formula • Used to estimate the optimal compute time between writing checkpoints

  12. Reliability • Research exploring alternative methods of performing checkpoint operations • System level checkpointing - BLCR • MPI level checkpointing • VM level checkpointing and live migration • Idea is to periodically save the VM state, or to live migrate the VM from sick to healthier systems

  13. Reliability • Be aware of the need to integrate reliability practices in your application as you design and write your code • At a minimum structure your code so that you can periodically save the current state of computation, and develop a capability to restart computation from that saved state if your program is restarted

  14. Overview of Virtualization Technologies • Virtualization is a technique that separates the operating system from the physical computer hardware, and interposes a layer of controlling software (hypervisor) between the hardware and operating system. • Different types of virtualization systems (from Goldberg) • Type 1: hypervisor between “bare metal” and guest operating systems • Type 2: hypervisor between host operating system and guest operating systems • Type 1 examples • VMware, Xen, KVM • OpenVZ • Type 2 examples • Virtual Box, VMware Workstation, Parallels for Mac

  15. Type 1 Virtualization • VMware • High quality commercial product • We use VMware extensively for NEES • Very useful for transitioning IT infrastructure from SDSC to Purdue for the NEES project • Simply created VM images for each service/server on a few physical servers • We were able to archive the VM images of the services/servers when NEES brought up NEEShubcyberinfrastructure • Windows • Hyper-V

  16. NEES VMware infrastructure

  17. Type 1 Virtualization • Virtualization systems for Linux • Xen and KVM • Open source virtualization systems based on Linux • Xen • First major virtualization system • Older, seems to be less reliable • KVM • Kernel-based Virtual Machine • Newer, supported by RedHat • OpenVZ • Container based virtualization system

  18. Xen • First version in 2003, and the first popular Linux hypervisor • Integrated into the Linux kernel • Uses paravirtualization • Guest OSs run a modified operating system to interact with hypervisor • Different from VMware, which uses a custom kernel you load on the bare harware • Host OS runs as Domain0 • Guest OSs run • Used to be supported in a limited form in RedHat and Ubuntu • Has been replaced with KVM in RedHat • Citrix has a commercial version of Xen • Personal experiences using Xen • Works OK for simple virtualization • Complex operations didn’t work as well

  19. KVM • Kernel-based Virtual Machine (KVM) • Built into Linux kernel • Supported by RedHat • More recent than Xen • Uses QEMU for virtual processor emulation • Allows you to emulated CPU architectures other than Intel • E.g. ARM and SPARC • Supports a wide variety of guest operating systems • Linux • Windows • Solaris • Provides a useful set of management utilities • Virtual Machine Manager • ConVirt

  20. OpenVZ • Container based virtualization system • Secure isolated Linux containers • Think of this as a “cage” for an application running in an OpenVZ container • OpenVZ terminology: Virtual Private Servers (VPS), Virtual Environments (VE) • Two major differences from Xen and KVM • Guest OS shares kernel with host OS • File system of Guest OS visible on Host OS and is part of the directory tree on the Host OS • Doesn’t use a virtual disk drive (no 15 GB files to manage) • Benefits compared with Xen and KVM • Very fast container creation • Very fast live migration • Easy to externally modify container file system (e.g. install software in the container) • Scales very well (no big virtual disk images) • Downsides • Must use the same OS as the Host OS • Sharing kernel with the Host OS

  21. Type 2 Examples • Oracle Virtual Box • Free VM environment that you can use on Windows, Linux, Mac OS X, and Solaris • Simple to use, good way to get started • VM images can be exported • VM images can be exported • In theory…. • Depends on the ability of the target virtualization system to import VM disk images • Exports in OVF (Open Virtualization Framework) format • My personal experience is that you often need to use a Linux utility to try to convert the disk image and VM metadata to an acceptable format for another virtualization system (often complex).

  22. Type 2 Examples • VMware Workstation • Runs as an application on top of Windows • NOT VMware ESX (which is a hypervisor) • Another good way to get started in working with virtualization technology • Parallels for Mac • Can be used to run Windows on a Mac • Commercial software • Personal experience: Works “OK”, but Windows can be slow running on Parallels

  23. OpenVZ vs. KVM • I am using OpenVZ and KVM for two different projects • NEES / NEEShub • Based on HUBzero • Using OpenVZ as a virtual container or “jail” in which to run applications that interfaces with user through a vnc window on a webpage • OpenNebula cluster to run parallel applications • Distributed rendering using Maya • batchrendering animations • OpenSees building simulation program for NEES • Parallel version that uses parallel solvers and MPI • Running on a virtual cluster on OpenNebula in my lab and on FutureGrid • The choice depends on the type of application who wish to run and the environment in which it will be run.

  24. Virtualization on linux • Additional mechanisms in Linux • Libvirt / virtio • Veneer library and utilities over virtualization systems • Brctl • Linux virtual network bridge control package • Cgroups • Linux feature for controlling resource use of processes • Network virtualization • Network control is a constant problem • VLANs are best, but hard to configure • OpenFlow is supposed to address network management to simplify it and make it scalable.

  25. Moving up from Virtualization • We talked about virtualization on a system level • How can we manage a collection on virtual machines on a single system? • How can we manage a distributed network of computers than host virtual machines? • How can we manage the network and storage for this distributed network of virtual machines? • This is the basis for one aspect of what is called “cloud computing” today • Infrastructure-as-a-Service (IaaS) • The technology used for IaaS is the basis for building virtual HPC clusters, which is a collection of virtual machines running on a distributed network of computers.

  26. Overview of Cloud Computing Technologies

  27. Cloud Computing • Emerging technology that leverages virtualization • Distributed computing of the 201Xs • Initial idea of a “computing utility” from Multics in the 1960s • Computing utility that provides services over a network • Computing • Storage • Pushes functionality from devices at the edge (e.g. laptops and mobile phones) to centralized servers

  28. Cloud Computing Architecture • User interface • How users interact with the services running on the cloud • Very simple client hardware • Resources and services index • What services are in the cloud, and where they are located • System Management and Monitoring • Storage and servers

  29. Types of cloud computing systems • Infrastructure as a service (IaaS) • Software as a service (SaaS) • Platform as a service (PaaS) • There are some fundamental difference between these approaches that lead to confusion when talking about “cloud computing” • A cloud computing infrastructure can include one or all of these

  30. Infrastructure as a Service (Iaas) • Virtualization environment • Cloud service provider offers capability of hosting virtual machines as a service • Cloud computing infrastructure for IaaS focuses on systems software needed to load, start, and manage virtual machines • Amazon EC2 is one example of IaaS

  31. IaaS • Enabling technologies used to provide IaaS • Virtualization layer • VMware • Xen/KVM • OpenVZ • Networking layer • Need to provide a VPN and network security for private VMs • Scheduling layer • Managing the mapping of IaaS requests to physical and virtual infrastructure • Amazon EC2 provide this • OpenNebula, Eucalyptus, and Nimbus also provide scheduling services

  32. Iaas Benefits • User doesn’t need to own infrastructure • No servers, data center, etc. required • Very low cost of entry • Pay-as-you-go computing • No upfront capital investments needed • Leasing a solution instead of a box • No systems administration staff/operations staff needed • Cloud computing provided leverages economies of scale

  33. Examples of IaaS • BlueLock in Indianapolis • Commercial IaaS provider • Eucalyptus • Started as a research project at UCSB • Based on Java and Web Services • OpenNebula • Developed in Europe • Leverages usual Linux technologies • Ssh, NFS, etc. • Uses a scheduler named Haizea • Nimbus • Research project at Argonne National Lab • Linked with Globus

  34. Platform as a Service (PaaS) • Builds on virtualization platform • Provides a software stack in addition to the virtualization service • OS, web server, authentication, etc. • APIs and middleware • For example, if you needed a web server and you didn’t want to install apache, linux, etc.

  35. Benefits of PaaS • Supported software stack • Don’t need to focus efforts on getting software infrastructure working • Pooled expertise in use of the software at the cloud computing provider • You can focus service and development efforts on just your product • Pay-as-you-go

  36. Examples of PaaS • Amazon Web Service • Wikileaks was using this • You buy a web service that runs on Amazon’s virtualization infrastructure • Downside: outages can take out a lot of services. • Netflix also uses Amazon EC2 • Other Examples • Google App Engine • Microsoft Azure

  37. Software as a Service (SaaS) • Provides access to software over the Internet • No download/installation of the software is needed • Users can lease or rent software • Was a big idea about a decade ago, seems to be coming back • Software runs remotely and displays back to the users computer • Think ‘vnc’ • NEEShub is an example of this • Researchers can run tools in a window without download/install

  38. Benefits of SaaS • No user download/install • Many corporate users don’t have access on their computers to install software • Easier to support • Control the computing environment centrally • Can be faster • As long as server hardware is fast and users have a good network connection • Efficient use of centralized computing infrastructure

  39. Relation of cloud computing to HPC • Use of cloud computing depends on how the HPC application is used • SaaS • NEEShubbatchsubmit capability • Allows uses to run parallel applications through the NEEShub as a service • Users don’t need to be concerned about underlying infrastructure • IaaS • HPC clusters on an infrastructure level • The problem here is to deploy, operate, and use a collection of VMs to constitute a virtual HPC cluster • The capabilities in this area are focused on VM image and network management and deployment

  40. Relation of cloud computing to HPC From a user’s perspective, what do you need to do to use the technology? • SaaS • Discover the application • Launch the application • Monitor execution • Collect and analyze the results • IaaS • Discover the resources needed • Provide a VM image or create a new one built from provided VM images • Deploy the image on the cloud computing system • Setup the networking among the VM instances • Setup an MPI ring • Launch your application • Monitor execution • Collect and analyze the results • SaaS is a lot simpler that IaaS for users • HUB based systems such as NEEShub and nanoHUB provide a specific set up applications as a service • However, it limits what a user can do • The problem is how to establish a virtual HPC cluster that can be used by users to develop, test, and prepare a parallel application for production use or to eventually transition to application to a service (SaaS) than can be run in a HUB environment.

  41. Example of NEEShubSaaS Windows application

  42. Example of NEEShubSaaS Linux application You can create an account on nees.org and try these tools

  43. Practical notes on using virtualization linux • Use virt-manager to create and manage VM images • Images usually stored in /var/lib/libvirt • Make sure you have enough storage for /var/lib • Or you can change the default location using virsh • Networking can be complicated due to the use of virtual network bridges in Linux • Networking can be very complex – be prepared to work on it to make it work • Simplest to start with NAT to get your VM on the network • Be cautious about computer security

  44. Practical notes on using virtualization • Managing network can be tricky • Bridge-utils yum package provides brctl utilities to create and manage virtual network switches and connections • External interface connects to the virtual network switch • VMs will connect to the virtual switch to share the connection • Virt-manager provides some functionality for this, but basically relies on what is created and managed by bridge-utils

More Related