1 / 32

FermiCloud Project: Integration, Development, Futures

FermiCloud Project: Integration, Development, Futures. Gabriele Garzoglio Associate Head Grid & Cloud Computing Department Fermilab. Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359. FermiCloud Development Goals.

shaquille
Download Presentation

FermiCloud Project: Integration, Development, Futures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. FermiCloud Project:Integration, Development, Futures Gabriele Garzoglio Associate Head Grid & Cloud Computing Department Fermilab Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359

  2. FermiCloud Development Goals • Goal: Make virtual machine-based workflows practical for scientific users: • Cloud bursting: Send virtual machines from private cloud to commercial cloud if needed • Grid bursting: Expand grid clusters to the cloud based on demand for batch jobs in the queue. • Federation: Let a set of users operate between different clouds • Portability: How to get virtual machines from desktopFermiCloudcommercial cloud and back. • Fabric Studies: enable access to hardware capabilities via virtualization (100G, Infiniband, …) FermiCloud Review - Development & Futures

  3. Overlapping Phases Phase 1: “Build and Deploy the Infrastructure” Today Phase 2: “Deploy Management Services, Extend the Infrastructure and Research Capabilities” Phase 3: “Establish Production Services and Evolve System Capabilities in Response to User Needs & Requests” Time Phase 4: “Expand the service capabilities to serve more of our user communities” FermiCloud Review - Development & Futures

  4. FermiCloud Phase 4:“Expand the service capabilities to servemore of our user communities” • Complete the deployment of the true multi-user filesystem on top of a distributed & replicated SAN, • Demonstrate interoperability and federation: • Accepting VM's as batch jobs, • Interoperation with other Fermilab virtualization infrastructures (GPCF, VMware), • Interoperation with KISTI cloud, Nimbus, Amazon EC2, other community and commercial clouds, • Participate in Fermilab 100 Gb/s network testbed, • Have just taken delivery of 10 Gbit/second cards. • Perform more “Virtualized MPI” benchmarks and run some real world scientific MPI codes, • The priority of this work will depend on finding a scientific stakeholder that is interested in this capability. • Reevaluate available open source Cloud computing stacks, • Including OpenStack, • We will also reevaluate the latest versions of Eucalyptus, Nimbus and OpenNebula. Specifications Being Finalized FermiCloud Review - Development & Futures

  5. FermiCloud Phase 5 • See the program of work in the (draft) Fermilab-KISTIFermiCloud CRADA, • This phase will also incorporate any work or changes that arise out of the Scientific Computing Division strategy on Clouds and Virtualization, • Additional requests and specifications are also being gathered! Specifications Under Development FermiCloud Review - Development & Futures

  6. Proposed KISTI Joint Project (CRADA) • Partners: • Grid and Cloud Computing Dept. @FNAL • Global Science Experimental Data hub Center @KISTI • Project Title: • Integration and Commissioning of a Prototype Federated Cloud for Scientific Workflows • Status: • Finalizing CRADA for DOE and KISTI approval, • Work intertwined with all goals of Phase 4 (and potentially future phases), • Three major work items: • Virtual Infrastructure Automation and Provisioning, • Interoperability and Federation of Cloud Resources, • High-Throughput Fabric Virtualization. • Note that this is potentially a multi-year program: • The work during FY13 is a “proof of principle”. FermiCloud Review - Development & Futures

  7. Virtual InfrastructureAutomation and Provisioning[CRADA Work Item 1] • Accepting Virtual Machines as Batch Jobs via cloud API’s, • Grid and Cloud Bursting, • Launch pre-defined user virtual machines based on workload demand, • Completion of idle machine detection and resource reclamation, • Detect when VM is idle – DONE Sep ’12, • Convey VM status and apply policy to affect the state of the Cloud – TODO. FermiCloud Review - Development & Futures

  8. Virtual Machines as Jobs • OpenNebula (and all other open-source IaaS stacks) provide an emulation of Amazon EC2. • Condor team has added code to their “Amazon EC2” universe to support the X.509-authenticated protocol. • Planned use case for GlideinWMS to run Monte Carlo on clouds public and private. • Feature already exists, • this is a testing/integration task only. FermiCloud Review - Development & Futures

  9. Grid Bursting • Seo-Young Noh, KISTI visitor @ FNAL, showed proof-of-principle of “vCluster” in summer 2011: • Look ahead at Condor batch queue, • Submit worker node virtual machines of various VO’s to FermiCloud or Amazon EC2 based on user demand, • Machines join grid cluster and run grid jobs from the matching VO. • Need to strengthen proof-of-principle, then make cloud slots available to FermiGrid. • Several other institutions have expressed interest in extending vCluster to other batch systems such as Grid Engine. • KISTI staff has a program of work for the development of vCluster. FermiCloud Review - Development & Futures

  10. vCluster at SC2012 FermiCloud Review - Development & Futures

  11. Cloud Bursting • OpenNebula already has built-in “Cloud Bursting” feature to send machines to Amazon EC2 if the OpenNebula private cloud is full. • Need to evaluate/test it, see if it meets our technical and business requirements, or if something else is necessary. • Need to test interoperability against other stacks. FermiCloud Review - Development & Futures

  12. True Idle VM Detection • In times of resource need, we want the ability to suspend or “shelve” idle VMs in order to free up resources for higher priority usage. • This is especially important in the event of constrained resources (e.g. during building or network failure). • Shelving of “9x5” and “opportunistic” VMs allows us to use FermiCloud resources for Grid worker node VMs during nights and weekends • This is part of the draft economic model. • Giovanni Franzini (an Italian co-op student) has written (extensible) code for an “Idle VM Probe” that can be used to detect idle virtual machines based on CPU, disk I/O and network I/O. • This is the biggest pure coding task left in the FermiCloud project, • If KISTI joint project approved—good candidate for 3-month consultant. FermiCloud Review - Development & Futures

  13. Idle VM Information Flow Idle VM Management Process VM VM VM VM Raw VM StateDB VM VM Idle VM Collector VM VM VM VM Idle VM Logic Idle VM Trigger VM VM VM VM Idle VM List VM VM Idle VM Shutdown VM VM VM VM FermiCloud Review - Development & Futures

  14. Interoperability and Federation[CRADA Work Item 2] • Driver: • Global scientific collaborations such as LHC experiments will have to interoperate across facilities with heterogeneous cloud infrastructure. • European efforts: • EGI Cloud Federation Task Force – several institutional clouds (OpenNebula, OpenStack, StratusLab). • HelixNebula—Federation of commercial cloud providers • Our goals: • Show proof of principle—Federation including FermiCloud + KISTI “G Cloud” + one or more commercial cloud providers + other research institution community clouds if possible. • Participate in existing federations if possible. • Core Competency: • FermiCloud project can contribute to these cloud federations given our expertise in X.509 Authentication and Authorization, and our long experience in grid federation FermiCloud Review - Development & Futures

  15. Virtual Image Formats • Different clouds have different virtual machine image formats: • File system ++,Partition table, LVM volumes, Kernel? • Have to identify the differences and find tools to convert between them if necessary • This is an integration/testing issue not expected to involve new coding. FermiCloud Review - Development & Futures

  16. Interoperability/Compatibility of API’s • Amazon EC2 API is not open source, it is a moving target that changes frequently. • Open-source emulations have various feature levels and accuracy of implementation: • Compare and contrast OpenNebula, OpenStack, and commercial clouds, • Identify lowest common denominator(s) that work on all. FermiCloud Review - Development & Futures

  17. VM Image Distribution • Investigate existing image marketplaces (HEPiX, U. of Victoria). • Investigate if we need an Amazon S3-like storage/distribution method for OS images, • OpenNebula doesn’t have one at present, • A GridFTP “door” to the OpenNebula VM library is a possibility, this could be integrated with an automatic security scan workflow using the existing Fermilab NESSUS infrastructure. FermiCloud Review - Development & Futures

  18. High-Throughput Fabric Virtualization[CRADA Work Item 3] • Follow up earlier virtualized MPI work: • Use it in real scientific workflows • Example – simulation of data acquisition systems (the existing FermiCloud Infiniband fabric has already been used for such). • Will also use FermiCloud machines on 100Gbit Ethernet test bed • Evaluate / optimize virtualization of 10G NIC for the use case of HEP data management applications • Compare and contrast against Infiniband FermiCloud Review - Development & Futures

  19. Long Term Vision[another look at some work in the FermiCloud Project Phases] • FY10-12 • Batch system support (CDF) at KISTI • Proof-of-principle “Grid Bursting” via vCluster on FermiCloud • Demonstration of virtualized InfiniBand • FY13 • Demonstration of VM image transfer across Clouds • Proof-of-principle launch of VM as a job • Investigation of options for federation interoperations • FY14 • Integration of vCluster for dynamic resource provisioning • Infrastructure development for federation interoperations • Broadening of supported fabric utilization • FY15 • Transition of Federated infrastructure to production • Full support of established workflow fabric utilization • Fully automated provisioning of resources for scientific workflows Today Time FermiCloud Review - Development & Futures

  20. High-level Architecture vCluster (FY14) Federated Authorization Layer Provision Cloud API 1 Img Repository Job / VM queue 100 Gbps Cloud API 2 VM Workflow Submit VM Idle Detection Resource Reclaim Virt. InfiniBand Cloud Federation FermiCloud Review - Development & Futures

  21. OpenNebula Authentication • OpenNebula came with “pluggable” authentication, but few plugins initially available. • OpenNebula 2.0 Web services by default used “access key” / “secret key” mechanism similar to Amazon EC2. No https available. • Four ways to access OpenNebula: • Command line tools, • Sunstone Web GUI, • “ECONE” web service emulation of Amazon Restful (Query) API, • OCCI web service. • FermiCloud project wrote X.509-based authentication plugins: • Patches to OpenNebula to support this were developed at Fermilab and submitted back to the OpenNebula project in Fall 2011 (generally available in OpenNebula V3.2 onwards). • X.509 plugins available for command line and for web services authentication. FermiCloud Review - Development & Futures

  22. Grid AuthZ Interoperability Profile • Use XACML 2.0 to specify • DN, CA, Hostname, CA, FQAN, FQAN signing entity, and more. • Developed in 2007, (still) used in Open Science Grid and in EGEE. • Currently in the final stage of OGF standardization • Java and C bindings available for authorization clients • Most commonly used C binding is LCMAPS • Used to talk to GUMS, SAZ, and SCAS • Allows one user to be part of different Virtual Organizations and have different groups and roles. • For Cloud authorization we will configure GUMS to map back to individual user names, one per person • Each personal account in OpenNebula created in advance. FermiCloud Review - Development & Futures

  23. X.509 Authorization • OpenNebula authorization plugins written in Ruby • Use existing Grid routines to call to external GUMS and SAZ authorization servers • Use Ruby-C binding to call C-based routines for LCMAPS or • ***Use Ruby-Java bridge to call Java-based routines from Privilege proj. • GUMS returns uid/gid, SAZ returns yes/no. • Works with OpenNebula command line and non-interactive web services • TRICKY PART—how to load user credential with extended attributes into a web browser? • It can be done, but high degree of difficulty (PKCS12 conversion of VOMS proxy with all certificate chain included). • We have a side project to make it easier/automated. • Currently—have proof-of-principle running on our demo system, command line only. • In frequent communication with LCMAPS developers @ NIKHEF and OpenNebula developers FermiCloud Review - Development & Futures

  24. Virtualized Storage Service Investigation • Motivation: • General purpose systems from various vendors being used as file servers, • Systems can have many more cores than needed to perform the file service, • Cores go unused => Inefficient power, space and cooling usage, • Custom configurations => Complicates sparing issues. • Question: • Can virtualization help here? • What (if any) is the virtualization penalty? FermiCloud Review - Development & Futures

  25. Virtualized Storage ServerTest Procedure • Evaluation: • Use IOzone and real physics root based analysis code. • Phase 1: • Install candidate filesystem on “bare metal” server, • Evaluate performance using combination of bare metal and virtualized clients (varying the number), • Also run client processes on the “bare metal” server, • Determine “bare metal” filesystem performance. • Phase 2: • Install candidate filesystem on a virtual machine server, • Evaluate performance using combination of bare metal and virtualized clients (varying the number), • Also use client virtual machines hosted on same physical machine as the virtual machine server, • Determine virtual machine filesystem performance. FermiCloud Review - Development & Futures

  26. FermiCloud Test Bed - “Bare Metal” Server eth eth FCL : 3 Data & 1 Name node Dom0: - 8 CPU - 24 GB RAM ITB / FCL Ext. Clients (7 nodes - 21 VM) mount mount 2 TB 6 Disks BlueArc Lustre Server • Lustre 1.8.3 on SL5 2.6.18-164.11.1: 3 OSS (different striping) • CPU: dual, quad core Xeon E5640 @ 2.67GHz with 12 MB cache, 24 GB RAM • Disk: 6 SATA disks in RAID 5 for 2 TB + 2 sys disks( hdparm376.94 MB/sec ) • 1 GB Eth + IB cards • CPU: dual, quad core Xeon X5355 @ 2.66GHz with 4 MB cache; 16 GB RAM. • 3 Xen VM SL5 per machine; 2 cores / 2 GB RAM each. FermiCloud Review - Development & Futures

  27. FermiCloud Test Bed - Virtualized Server eth FCL : 3 Data & 1 Name node Dom0: - 8 CPU - 24 GB RAM ITB / FCL Ext. Clients (7 nodes - 21 VM) mount mount 2 TB 6 Disks ) ( BlueArc Opt. Storage Server VM • CPU: dual, quad core Xeon X5355 @ 2.66GHz with 4 MB cache; 16 GB RAM. • 3 Xen VM SL5 per machine; 2 cores / 2 GB RAM each. On Board Client VM 7 x • 8 KVM VM per machine; 1 cores / 2 GB RAM each. FermiCloud Review - Development & Futures

  28. Virtualized File ServiceResults Summary • See ISGC ’11 talk and proceedings for the details - http://indico3.twgrid.org/indico/getFile.py/access?contribId=32&sessionId=36&resId=0&materialId=slides&confId=44 FermiCloud Review - Development & Futures

  29. Summary • Collaboration: • Leveraging external work as much as possible, • Contribution of our work back to external collaborations. • Using (and if necessary extending) existing standards: • AuthZ, OGF UR, Gratia, etc. • Relatively small amount of development FTEs: • Has placed FermiCloud in a potential leadership position! • CRADA with KISTI, • Could be the beginnings of a beautiful relationship… FermiCloud Review - Development & Futures

  30. Thank You! Any Questions?

  31. X.509 Authentication—how it works • Command line: • User creates a X.509-based token using “oneuser login” command • This makes a base64 hash of the user’s proxy and certificate chain, combined with a username:expiration date, signed with the user’s private key • Web Services: • Web services daemon contacts OpenNebula XML-RPC core on the users’ behalf, using the host certificate to sign the authentication token. • Use Apache mod_ssl or gLite’sGridSite to pass the grid certificate DN (and optionally FQAN) to web services. • Limitations: • With Web services, one DN can map to only one user. FermiCloud Review - Development & Futures

  32. “Authorization” in OpenNebula • Note: OpenNebula has pluggable “Authorization” modules as well. • These control Access ACL’s—namely which user can launch a virtual machine, create a network, store an image, etc. • Focus on privilege management rather than the typical grid-based notion of access authorization. • Therefore, we make our “Authorization” additions to the Authentication routines of OpenNebula FermiCloud Review - Development & Futures

More Related