150 likes | 285 Views
First Steps in the Clouds . Kate Keahey keahey@mcs.anl.gov University of Chicago Argonne National Laboratory. Why Clouds?. Resource consumers Individual users or Virtual Organization Requirements Customized environments for their services/applications
E N D
First Steps in the Clouds Kate Keahey keahey@mcs.anl.gov University of Chicago Argonne National Laboratory
Why Clouds? • Resource consumers • Individual users or Virtual Organization • Requirements • Customized environments for their services/applications • Services/applications can be short-lived • New environments/services deployed quickly and often • Resource providers • Own and operate physical resources • Requirements • Ability to monitor and control their resources • Provide resources at reasonable operational cost • Protection from activities performed by resource consumer • Consumers need to be able to lease (potentially for short-term) platforms that they can customize and control
Cloud Computing for Grid Communities:The STAR Application Use Case
The STAR Application • Complex experimental application codes • Developed over more than 10 years, by more than 100 scientists, comprises ~2 M lines of C++ and Fortran code • www.star.bnl.gov • Require complex, customized environments • Rely heavily on the right combination of compiler versions and available libraries • Dynamically load external libraries depending on the task to be performed • Environment validation • To ensure reproducibility and result uniformity across environments • Why do we need a cloud? • Resources with the right configuration are hard to find • A VM-based cloud gives us the required control
Running STAR in a Cloud • First Challenge: finding VM-enabled resources • Amazon Elastic Compute Cloud (EC2) • More Challenges: • Can we use X.509 certs to submit to a cloud? Can we use Grid access protocols? How much manual configuration do we need to do for a cluster that we need for 4 hours? How do we integrate the cluster into the Grid infrastructure? • Workspace Service • X.509 certificates are mapped to a project account • Grid access protocols • Creating a virtual cluster dynamically • Contextualization (cluster context): the cluster node VMs find out about each other and integrate that information at boot time • Integrating the cluster into the Grid • Contextualization (grid context): cluster is configured with appropriate host certs, gridmapfiles, etc.
with thanks to Jerome Lauret and Doug Olson of the STAR project, presented at CHEP’07 with thanks to Jerome Lauret and Doug Olson of the STAR project Running jobs : 150 Running jobs : 94 Running jobs : 150 Running jobs : 142 Running jobs : 230 Running jobs : 109 Running jobs : 73 Running jobs : 42 Running jobs : 0 Running jobs : 124 VWS/EC2 BNL Running jobs : 221 Running jobs : 195 Running jobs : 140 Running jobs : 282 Running jobs : 300 Running jobs : 0 Running jobs : 300 Running jobs : 76 Running jobs : 300 Running jobs : 243 WSU Fermi Running jobs : 200 Running jobs : 0 Running jobs : 54 Running jobs : 96 Running jobs : 37 Running jobs : 183 Running jobs : 136 Running jobs : 150 Running jobs : 152 Running jobs : 195 Running jobs : 21 Running jobs : 50 Running jobs : 50 Running jobs : 0 Running jobs : 9 Running jobs : 42 Running jobs : 27 Running jobs : 34 Running jobs : 39 Running jobs : 15 PDSF Job Completion : File Recovery :
with thanks to Jerome Lauret and Doug Olson of the STAR project with thanks to Jerome Lauret and Doug Olson of the STAR project, presented at CHEP’07 Nersc PDSF EC2 (via Workspace Service) WSU Accelerated display of a workflow job state Y = job number, X = job state
What Did We Learn? • Performance was not an issue • The real comparison is having a resource to run on vs not having a resource to run on • Contextualization is key for dynamic virtual cluster deployment • Next steps: a more challenging application
Cloud Computing for Grid Providers: Building the Science Cloud at the University of Chicago
Challenges • Virtualization adoption has been relatively slow among Grid Providers • Challenge: integrating VMs into current provisioning models • Integrate into a site without disrupting the current operation of resources • I.e., be able to run jobs as well as VMs • Non-invasive from the perspective of currently used tools • E.g., no modification to the currently used schedulers and resource managers • Can be used alongside the current mode of operation • Batch jobs • Represent as small a change as possible • Operate within familiar metaphors • Avoid error-generating complexity
Roll Your Own Cloud • The Workspace Pilot • Operates on resources that can support jobs as well as VMs • E.g., have been booted into Xen domain 0 • Non-invasive extension to batch schedulers (e.g., PBS) • Wrappers for submission operation, scheduler signals to operate on VMs • Glidein approach: submits a “pilot program” that prepares a resource slot for VM deployment • E.g., adjusts Xen domain 0 memory • Comes with administrator tools • E.g., kill-all
VM VM VM VM Workspace Pilot in Action Level 2: provision VMs Level 1: provision raw resources Workspace Service Xen dom0 LRM/PBS Xen dom0 Xen dom0 VMs are decomissioned raw resources are decomissioned
The Pilot Program • Uses Xen balloon driver to reduce/restore domain0 memory so that guest domains (VMs) can be deployed • Secure VM deployment • The pilot requires sudo privilege and thus can be used only with site administrator’s approval • The workspace service provides fine-grained authorization for all requests • Signal handling • SIGTERM: pilot exceeded its allotted time • Notifies VWS, allows it to clean up • After a configurable time period takes things into its hands. • Default policy: one VM per physical node • Available for download • Workspace Release 1.3.1: • http://workspace.globus.org/downloads/index.html
Nimbus @ UC • What is it? • The Science Cloud at University of Chicago • UC TeraPort cluster configured with the workspace pilot • Currently 16 nodes • What can it do for me? • Allow you to “lease out” a cluster of VMs • Who can use it? • Members of scientific community • In as much as usage policies will allow • What do I need to do if I want to use it? • Contact us: keahey@mcs.anl.gov • You will need a VM image (we can help and know others who can), a certificate, and a simple client
Cloud Interoperability • Moving an app from a hardware platform to a cloud is relatively hard • Need to develop a VM image, learn about cloud computing, figure our logistics • Moving between clouds • E.g., STAR app EC2->Science Cloud and vice versa is very easy • Rough consensus on the interfaces needed to provision resources in the cloud • OGF gridvit-wg • Chairs: Erol Bozak, Wolfgang Reichert • Define the requirements for integration of Grid architecture with system virtualization platforms • Exploring the impact of virtualization on Grid use cases • Exploring the relationship with standards (DMTF, etc.)