1 / 19

Predicting Queue Waiting Time For Individual User Jobs

Predicting Queue Waiting Time For Individual User Jobs. Rich Wolski , Dan Nurmi, John Brevik, Graziano Obertelli, Ryan Garver Computer Science Department University of California, Santa Barbara. Problem: Predicting Delay in Batch Queues. Time in queue is experienced as application delay

manjit
Download Presentation

Predicting Queue Waiting Time For Individual User Jobs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Predicting Queue Waiting Time ForIndividual User Jobs Rich Wolski, Dan Nurmi, John Brevik, Graziano Obertelli, Ryan Garver Computer Science Department University of California, Santa Barbara

  2. Problem: Predicting Delay in Batch Queues • Time in queue is experienced as application delay • Sounds like an easy problem, but • Distribution of load from users is a matter of some debate • Scheduling policy is partially hidden • Sites need to change the policies dynamically and without warning • Job execution times are difficult to predict • Much research in this area over the past 20 years, but few solutions • Current commercial systems provide high variance estimates • On-line simulation based on max requested time • “expected” value predictions • Most sites simply disable these features

  3. Hard Problem

  4. For Scheduling: It’s all about the big Q • Predictions of the form • “What is the maximum time my job will wait with X% certainty?” • “What is the minimum time my job will wait with X% certainty?” • Requires two estimates if certainty is to be quantified • Estimate the (1-X) quantile for the distribution of availability => Qx • Estimate the upper or lower X% confidence bound on the statistic Qx=> Q(x,b) • If the estimates are unbiased, and the distribution is stationary, future availability duration will be larger than Q(x,b)X% of the time, guaranteed

  5. Quantiles versus Moments • Quantiles permit quantifiable predictions for individual jobs • “expectation” in relation to the mean is a misnomer => useful for throughput • Example: 100 jobs, weighty tail, 6 orders of magnitude variation, random order • 95 jobs wait 10 seconds or less • 1 job waits 1000 seconds • 1 job waits 10000 seconds • 1 job waits 100000 seconds • 1 job waits 1000000 seconds • 1 job waits 10000000 seconds • mean wait time: 111120 seconds • The “expected value” • 0.95 quantile: 10 seconds • “95% chance” job will wait 10 seconds or less

  6. BMBP: A New Predictive Methodology • New quantile estimator invention based on Binomial distribution • Requires carefully engineered numerical system to deal with large-scale combinatorics • New changepoint detector • Binomial method in a time series context is difficult • Need a system to determining • Stationary regions in the data • Minimum statistically meaningful history in each region • New clustering methodology • More accurate estimates are possible if predictions are made from jobs with similar characteristics • Takes dynamic policy changes into account more effectively

  7. Ten Years of Supercomputing

  8. See it In Action • http://nws.cs.ucsb.edu/batchq

  9. Predicting Things Upside Down • Deadline scheduling: My job needs to start in the next X seconds for the results to be meaningful. • Amitava Mujumdar, Tharaka Devaditha, Adam Birnbaum (SDSC) • Need to run a 4 minute image reconstruction that completes in the next 8 minutes • Given a • Machine • Queue • Processor count • Run time • Deadline • What is the probability that a job will meet the deadline? • http://nws.cs.ucsb.edu/batchq/invbqueue.php

  10. How Well Does it Work with an Application? Refine Electron Micrograph Final 3D model Preliminary 3D Model EMAN Preliminary 3D model Particles EMAN has been developed at Baylor College of Medicine by Research group of Wah Chiu and Steven Ludtke {wah,sludtke}@bcm.tmc.edu

  11. VGrADS EMAN Batch Scheduler • EMAN emulator • Run the EMAN scheduler to determine a job launch sequence • Launch the jobs by submitting them to the queues specified by the scheduler • When an EMAN job acquires the processors, exit and “sleep” the emulator for the predicted execution time • Saves system allocation time • Record the overall makespan • Experiment: • Chicago TeraGrid, SDSC TeraGrid, NCSA TeraGrid and CNSI Dell at UCSB • 57 separate runs • Results: mean observed and mean predicted makespans are not significantly different at alpha = 0.05

  12. 95% Upper Bound on Median

  13. EMAN Turnaround Improvement

  14. Virtual Resource Reservations 0.75 submit time now • 75% is the target probability • 356 total requests • 257 total batch submissions • 99 requests resulted in initial ‘not possible’ response • 192 slots successfully acquired • 257 * .75 = 193

  15. Clustering • RMS ratio of BMBP with Clustering to without • Both achieve 95% correctness • Measures additional “tightness” improvement through clustering

  16. FAQ • What happens if everyone uses these predictions? Will it be stable? • Maybe • We do not consider jobs in queue • Automatic schedulers may cause destabilization • What about autocorrelation (you idiot)? • Difficult to compute in this space • Error-prone for non-stationary series • Queues reorder the series • Autocorrelation is and is not an issue • Quantile estimation and clustering algorithm are relatively robust to autocorrelation • Change-point detector computes uses the autocorrelation it computes on the fly • Not a guarantee since it can fail • All guarantees come with a failure probability

  17. The Software • Requires no special privileges • Predictions are better and “burn-in” shorter if scheduler logs are available => retrofit the log history • Version 1 -- obsolete • NWS sensors run at each site • Prediction software runs at UCSB • Command-line tools and web page connect to UCSB • Stable, but does not support clustering • Version 2 -- beta version • Supports automatic clustering • Prediction software can be run locally or at UCSB • Command-line tools locally or at UCSB • Web support at UCSB only • No packaging • Version 3 -- end of the year

  18. Batch Queue Prediction for Grid Systems • A good point-valued prediction remains elusive • “expectation” sounds attractive but is really a misnomer • Grid users certainly can use bounds instead • Early job completion is okay, typically • Bounds give a good intuitive feel for which queue will be quickest • Deployment and integration underway • CDF FermiLab working (barely) • Condor integration • UCLA Grid tools • Automatic schedulers are coming • EMAN doesn’t use ranges…it should • VGrADS is developing new schedulers (workflow) • NEESGrid and ISI are in development (workflow) • LEAD integration is underway (workflow) • Large-scale sensor network simulation

  19. What’s Next? • Open questions: • Does the availability of predictions affect load? • Rolling out production tools now and we will be monitoring • Job cancellation does not affect results • If it does, will allocations be stable? • Grid economies • Reservations must be integrated • Virtual resource reservations (VGrADS) • Conditional prediction and resubmission • Replicated submissions (boost success probability) • Virtual Cluster?? • Thanks • NSF SCI, NSF NGS, VGrADS, SDSC, TACC, NCSA, Argonne • rich@cs.ucsb.edu

More Related