180 likes | 374 Views
Automatic Statistical Evaluation of Resources for Condor. Daniel Nurmi, John Brevik, Rich Wolski University of California, Santa Barbara. Motivation. Distributed System/Grid applications execute on wide variety of architectures Clusters Large SMP systems Interactive workstation networks
E N D
Automatic Statistical Evaluation of Resources for Condor Daniel Nurmi, John Brevik, Rich Wolski University of California, Santa Barbara
Motivation • Distributed System/Grid applications execute on wide variety of architectures • Clusters • Large SMP systems • Interactive workstation networks • Condor provides vast, easily accessible resource pool, but is best suited to Condor applications
Condor As Resource Pool • Provides many required features • Resource manager • Account manager • Scheduler • Resource availability very dynamic • Controlled by large number of variables including overall load, user priority, occupancy time, owner revocation, etc. • Resources free up and drop out frequently • Long running apps must be checkpointed
Checkpointing Schemes • Condor checkpointing • Standard Universe uses system call liftoff • Core file is used to capture process state for restart • Application-level checkpointing: • Application developer must generate checkpoints from within the application • Disk storage may be limited (none available locally)
Condor Checkpointing • Checkpointing is invisible to application developer, but… • No threads • No forking • Single architecture support • Must use compiler supported by Condor (e.g. no GMP)
Application-Level Checkpointing • No support from Condor for checkpointing in Vanilla universe • Left to the application • No restrictions on system calls or compilation • If it compiles it will run • No local disk storage • Checkpoints must traverse the network to a machine with stable storage • Checkpoint schedule major performance concern
Checkpoint Scheduling • Given a long running application and volatile resource, determine the amount of time perform useful computation between checkpoints such that the overhead of checkpointing is minimized • Well studied • K. M. Chandy, C. V. Ramamoorthy. Rollback and recovery strategies for computer systems. • M. Elnozahy, L. Alvisi, Y. M. Wang, D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. • A. Duda. The effects of checkpointing on program execution time. • N. H. Vaidya. Impact of checkpoint latency on overhead ratio of a checkpointing scheme • We use Markov Model based approach proposed by N. H. Vaidya.
Checkpoint Interval Selection • Model requires statistical distribution describing resource availability • Vaidya, and later Plank assume exponential distributions
What is the Availability Distribution? • Weibull • T. Heath, P. M. Martin, T. D. Nguyen. The shape of failure • J. Xu, Z. Kalbarczyk, R. K. Iyer. Networked Windows NT system field failure data analysis • Hyperexponential • M. Mutka, M. Livny. Profiling workstations’ available capacity for remote execution. • I. Lee, D. Tang, R. K. Iyer, M. C. Hsueh. Measurement-based evaluation of operating system fault tolerance.
Generating Statistical Models • Network Weather Service monitoring of Condor pool over 2 year period • 708 machines observed • Automatic model fitting software • Takes as input distribution type and historical Condor uptime values • Outputs best fit parameters for given distribution • Design experiment to test overall work efficiency of checkpointing scheme using four different distributions
Checkpoint Experiment • Test application submitted to Condor and when it runs… • Sends resource information to central server • Model fitting software estimates model parameters using MLE or EMpht methods • Checkpoint scheduler solves the Markov model using tested distribution • Application uses schedule, checkpoints its memory, and records performance • Test different distributions • Checkpointing to disks at UCSB
Moral • We can determine optimal checkpoint schedules for Condor jobs automatically • Execution performance impact is about the same until checkpoint costs get big • Network load improvements are substantial (particularly useful in wide area) • Software is real, but non-NWS parts are in prototype • We want to bring them into the NWS release cycle • Paper in submission to HPDC
What’s Next • Better Models • Brevik Method: we can predict the percentiles of availability with provable confidence bounds using less data • Can’t use it (yet) for Markov model • Better Utility • Provide information to Condor itself • Automatic fault and anomaly detection • Better Information for users • Publish availability predictions the in matchmaker
Thanks • Rich Wolski • John Brevik • Miron Livny • NSF Next Generation Software program • VGrADS Project (NSF ITR, Ken Kennedy, PI) • NSF Middleware Initiative (NWS) • Questions?