290 likes | 442 Views
Enabling Cost-Effective Resource Leases with Virtual Machines. HPDC 2007 Hot Topics Session. Borja Sotomayor University of Chicago borja@cs.uchicago.edu. Kate Keahey Argonne National Laboratory/ University of Chicago keahey@mcs.anl.gov. Ian Foster
E N D
Enabling Cost-Effective Resource Leases with Virtual Machines HPDC 2007 Hot Topics Session Borja Sotomayor University of Chicago borja@cs.uchicago.edu Kate Keahey Argonne National Laboratory/University of Chicago keahey@mcs.anl.gov Ian Foster Argonne National Laboratory/University of Chicago keahey@mcs.anl.gov Tim Freeman Argonne National Laboratory/University of Chicago tfreeman@mcs.anl.gov
Motivation Leasing resources for short periods of time can be of great value to many applications. Workflows, real-time applications, and applications requiring resource co-scheduling. Leasing semantics The glidein approach: Condor glideins, MyCluster, and Falkon Advance reservations Meta-scheduling, deadlines, demos Utilization problems We argue that virtualization can make resource leasing cost-effective, despite the overhead of using VMs, thus: Providing an incentive for resource providers to allow short-term leasing of resources. Creating an opportunity for scientific applications (resource consumers) that require multi-level scheduling.
Approach Separate resource provisioning from execution management. Resource provisioning is handled by a new component called the Lease Manager Execution management can continue to be handled by a site's current scheduler (PBS/Maui, SGE, Condor, ...) All provisioning is handled via the use of VMs Including provisioning resources for a batch job Use VM’s suspend/resume mechanisms to backfill and suspend non-interactive/batch applications
Node 1 Node 2 Node 3 Node 1 Node 2 Node 3 Scheduling the lease without using virtualization : SHORT-TERM LEASE Scheduling the lease using virtualization: SHORT-TERM LEASE
Experiment Setting Simulated testbed of 8 nodes connected by 100Mbps network, such that at most two VMs can run simultaneously on one node. We consider the best and worst cases Traces Artificial traces, combining serial batch requests and ARs Would require 10h to run on testbed (assuming perfect utilization) VM runtime overhead assumed to be 10% Experiments
Experiment I • Is using VMs for suspend/resume backfill worth the overhead? • Assumption: we are using only one VM image • Prototype scheduler supporting batch serial requests and advance reservations, using backfilling or suspend/resume to plan around the ARs. • A Resource Management Model for VM-Based Virtual Workspaces, B.Sotomayor, Masters paper, University of Chicago. February 2007.
Best-case trace Trace characteristics Duration of batch requests: Avg=15 min. AR resource consumption: 75% - 100% Proportion of Batch/AR: 75%/25% Benefits from suspend/resume because the large number of relatively long batch requests limit the efficiency of backfilling.
One Image (best case) Baseline Not using VMs (no runtime overhead) and backfilling instead of suspend/resume
One Image (best case) Add Runtime Overhead Running inside a VM adds runtime overhead, but not a big hit since images are predeployed.
One Image (best case) Use Suspend/Resume Allows for better resource utilization than backfilling, even better than baseline (because of long batch requests)
Worst-case trace Same as previous trace, but with shorter batch requests (avg=5 minutes) This also entails that there are more batch requests, since the total running time of the trace is still 10h With a large number of relatively short requests, backfilling is already very effective, and little is gained from suspend/resume. Furthermore, many more images have to be deployed in this case, which increases the preparation overhead.
One Image (worst case) Baseline Not using VMs (no runtime overhead) and backfilling instead of suspend/resume
One Image (worst case) Add Runtime Overhead Running inside a VM adds runtime overhead, but not a big hit since images are predeployed.
One Image (worst case) Use Suspend/Resume Doesn't provide any significant advantage over backfilling because of short batch requests.
Experiment II How much do we pay for the added flexibility of operating in multiple virtualized environments? Assumption: we are using multiple images Scheduler also has application-specific knowledge (i.e., it knows it is scheduling VMs) so it is able to also schedule timely VM image transfer. Image reuse strategies: realistically not all images will be different Modification of Experiment I Use 37 possible 600MB VM images. 7 images account for 70% of requests.
Multiple Images (best case) Baseline Not using VMs (no runtime overhead) and backfilling instead of suspend/resume
Multiple Images (best case) Transferring images Adds deployment overhead which delays starting time of batch requests.
Multiple Images (best case) Adding Runtime Overhead Makes running time even larger
Multiple Images (best case) Use Suspend/Resume Better resource utilization compensates for deployment overhead.
Multiple Images (best case) Image Reuse Improves performance slightly.
Multiple Images (worst case) Baseline Not using VMs (no runtime overhead) and backfilling instead of suspend/resume
Multiple Images (worst case) Transferring images Adds deployment overhead which delays starting time of batch requests.
Multiple Images (worst case) Adding Runtime Overhead Relatively small performance hit (the least of our concerns here)
Multiple Images (worst case) Use Suspend/Resume Doesn't improve significantly over backfilling, which already does a good job thanks to the presence of small batch requests
Multiple Images (worst case) Image Reuse Compensates for deployment overhead. Still not as good as baseline, but relatively small difference
Conclusions Using virtualization can make short-term leasing with interesting semantics cost-effective even in the presence of runtime overhead Given reasonable strategies of deployment overhead management the cost of using multiple images is acceptable. However, only artificial stress traces have been used so far. Preliminary results with real traces suggest that short-term leases can be integrated into real workloads and still be cost-effective (we will release these results as soon as they're solid)
Ongoing Work • Develop a better scheduler • Handle parallel batch submissions • Integrate this virtualized resource manager with existing LRM • This work is our top-down effort • We also have a bottom-up effort • Better modeling of traces • Based on real world batch submissions • Non-uniform overhead • Understanding VM overhead in practice • Virtualization in Practice: http://press.mcs.anl.gov/virtualization/
Questions? Borja Sotomayor University of Chicago borja@cs.uchicago.edu Kate Keahey Argonne National Laboratory/University of Chicago keahey@mcs.anl.gov Ian Foster Argonne National Laboratory/University of Chicago keahey@mcs.anl.gov Tim Freeman Argonne National Laboratory/University of Chicago tfreeman@mcs.anl.gov