Online Prediction of Task Running Time with Confidence Intervals

Online Prediction of the Running Time Of Tasks Peter A. Dinda Department of Computer ScienceNorthwestern University http://www.cs.northwestern.edu/~pdinda

Overview • Predict running time of task • Application supplies task size (0.1-10 seconds currently) • Task is compute-bound (current limit) • Prediction is a confidence interval • Expresses prediction error • Statistically valid decision-making in scheduler • Based on host load prediction • Homogenous Digital Unix hosts (current limit) • System is portable to many operating systems Everything in talk is publicly available

Outline • Running time advisor • Host load results • Computing confidence intervals • Performance evaluation • Related work • Conclusions

A Universal Challenge in High Performance Distributed Applications Highly variable resource availability • Shared resources • No reservations • No globally respected priorities • Competition from other users - “background workload” Running time can vary drastically Adaptationexample goal: soft real-time for interactivity example mechanism: server selection Performance queries

Running Time Advisor (RTA) background workload What will be the running time of this 3 second task if started now? App It will be 5.3 seconds Host nominal time: running time on empty host, task size • Entirely user-leveltool • No reservations or admission control • Query result is aprediction

Variability and Prediction Prediction resource High Resource Availability Variability t Low Prediction Error Variability Predictor resource error t t Characterization of variability ACF t Exchange high resource availability variability for low prediction error variability and a characterization of that variability

Running Time Advisor (RTA) background workload With 95% confidence, what will be the running time of this 3 second task if started now? App It will be 4.1 to 6.3 seconds Host CI captures prediction error to the extentthe application is interested in it Independent of prediction techniques

RTA API

Host Load Traces • DEC Unix 5 second exponential average • Full bandwidth captured (1 Hz sample rate) • Long durations • http://www.cs.northwestern.edu/~pdinda/LoadTraces

Host Load Properties • Self-similarity • long-range dependence • Epochal behavior • non-stationarity • Complex correlation structure[LCR ’98, Scientific Programming, 3:4, 1999]

Host Load Prediction • Fully randomized study on traces • MEAN, LAST, AR, MA, ARMA, ARIMA, ARFIMA models • AR(16) models most appropriate • Covariance matrix for prediction errors • Low overhead: <1% CPU [HPDC ’99, Cluster Computing, 3:4, 2000]

RPS Toolkit • Extensible toolkit for implementing resource signal prediction systems • Easy “buy-in” for users • C++ and sockets (no threads) • Prebuilt prediction components • Libraries (sensors, time series, communication) • Users have bought in • Incorporated in CMU Remos, BBN QuO [CMU-CS-99-138] http://www.cs.northwestern.edu/~RPS

A Model of the Unix Scheduler tact = f(tnom, background workload) Nominal running time Task tnom Background workload Unix Scheduler Actual running time Task tact Actual Load <zt>

A Model of the Unix Scheduler Nominal running time Task tnom Background workload Unix Scheduler Predicted running time > Task texp PredictedLoad <zt> > texp = g(tnom,<zt>) = tact + Error

Available Time and Average Load Available time from 0 to t Average load from 0 to t Load Signal – replace with prediction of load signal tact is minimum t where at(t)=tnom Fluid model, Processor Sharing,Idealized Round-Robin, …

Discrete Time • No magic here – this is the obvious discretization • is the sample interval zt+j replaced with prediction

Confidence Intervals > > > > zt+j replaced with zt+j in prediction, giving ali, ati, at(t) > > Confidence interval for at(t) is a CI for ali… prediction errors Since this is a sum, the central limit theorem applies… Then a 95% confidence interval is

The Variance of the Sum • Prediction errors at+j are not independent • Predictor’s covariance matrix captures this Predictor makes it possible to compute this variance and thus the CI Important detail: load discounting

Experimental Setup • Environment • Alphastation 255s, Digital Unix 4.0 • Workload: host load trace playback [LCR 2000] • Prediction system on each host • AR(16), MEAN, LAST • Tasks • Nominal time ~ U(0.1,10) seconds • Interarrival time ~ U(5,15) seconds • 95 % confidence level • Methodology • Predict CIs • Run task and measure http://www.cs.northwestern.edu/~pdinda/LoadTraces/playload

Metrics • Coverage • Fraction of testcases within confidence interval • Ideally should equal the target 95 % • Span • Average length of confidence interval • Ideally as short as possible • R2 between texp and tact

General Picture of Results • Five classes of behavior • I’ll show you two • RTA Works • Coverage near 95% in most cases is possible • Predictor quality matters • Better predictors lead to smaller spans on lightly loaded hosts and to correct coverage on heavily loaded hosts • AR(16) >= LAST >= MEAN • Performance is slightly dependent on nominal time

Most Common Coverage Behavior

Most Common Span Behavior

Uncommon Coverage Behavior

Uncommon Span Behavior

Related Work • Distributed interactive applications • QuakeViz/ Dv, Aeschlimann [PDPTA’99] • Quality of service • QuO, Zinky, Bakken, Schantz [TPOS, April 97] • QRAM, Rajkumar, et al [RTSS’97] • Distributed soft real-time systems • Lawrence, Jensen [assorted] • Workload studies for load balancing • Mutka, et al [PerfEval ‘91] • Harchol-Balter, et al [SIGMETRICS ‘96] • Resource signal measurement systems • Remos [HPDC’98] • Network Weather Service [HPDC‘97, HPDC’99] • Host load prediction • Wolski, et al [HPDC’99] (NWS) • Samadani, et al [PODC’95] • Hailperin [‘93] • Application-level scheduling • Berman, et al [HPDC’96] • Stochastic Scheduling, Schopf [Supercomputing ‘99]

Conclusions • Predict running time of compute-bound task • Based on host load prediction • Prediction is a confidence interval • Confidence interval algorithm • Covariance matrix • Load discounting • Effective for domain • Digital Unix, 0.1-10 second tasks, 5-15 second interarrival • Extensions in progress

For More Information • All software and traces are available • RPS + RTA + RTSA http://www.cs.northwestern.edu/~RPS • Load Traces and playbackhttp://www.cs.northwestern.edu/~pdinda/LoadTraces • Prescience Lab • Peter Dinda, Jason Skicewicz, Dong Lu • http://www.cs.northwestern.edu/~plab

A Universal Problem Which host should the application send the task to so that its running time is appropriate? ? Task Example: Real-time Known resource requirements What will the running time be if I...

Running Time Advisor Predicted Running Time Application notifies advisor of task’s computational requirements (nominal time) Advisor predicts running time on each host Application assigns task to most appropriate host ? Task nominal time

Real-time Scheduling Advisor Application specifies task’s computational requirements (nominal time) and its deadline Advisor acquires predicted task running times for all hosts Advisor chooses one of the hosts where the deadline can be met Predicted Running Time deadline ? Task nominal time deadline

Confidence Intervals to Characterize Variability “3 to 5 seconds with 95% confidence” Application specifies confidence level (e.g., 95%) Running time advisor predicts running times as a confidence interval (CI) Real-time scheduling advisor chooses host where CI is less than deadline CI captures variability to the extent the application is interested in it Predicted Running Time deadline ? Task nominal time deadline 95% confidence

Prototype System This Paper

Load Discounting Motivation • I/O priority boost • Short tasks less effected by load

Load Discounting • Apply before using load predictions • tdiscount is estimatable machine property

Online Prediction of Task Running Time with Confidence Intervals

Online Prediction of Task Running Time with Confidence Intervals

Presentation Transcript

The Biomechanics Of Running

Running Time

Online Prediction of the Running Time Of Tasks

Operating Systems for Reconfigurable Embedded Platforms: Online Scheduling of Real-Time Tasks

RUNNING TIME

The Science of Prediction

You are running out of time!

Running Out of Time

The Running of the Bulls

Running in the background with background tasks

THE Running of the bulls

Estimating the Completion Time of Crowdsourced Tasks using Survival Analysis

He Knew His Time Was Limited (13:1) Running out of time to teach Running out of time to influence

Response Time Analysis of Tasks in Multiprocessor Systems

Operating Systems for Reconfigurable Embedded Platforms: Online Scheduling of Real-Time Tasks

The Physics of Running

Logical Reliability of Interacting Real-Time Tasks

Running of the Bulls

Running of the bulls

Running of the Brides

Estimating Running Time

Running Time