Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism

Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick Kondo, Franck Cappello

Outline • Background of Google Cloud Task Processing • System Overview • Research Formulation • Optimization of Fault-tolerance • Optimization of the Number of Checkpoints • Adaptive Optimization of Fault Tolerance • Local disk vs. Shared disk • Performance Evaluation • Conclusion and Future Work

Background • Google trace (released in 2011.11): • 670,000 jobs, 2,500,000 tasks, 12,000 nodes • One-month period (29 days) • Various events, Resource request/allocation, Job/task length, Various attributes, etc. • There are two types of jobs in Google trace: • sequential-task job and Bag-of-Task job • 4000 application types, such as map-reduce. • Failure events occur often for some tasks! • Most of task lengths are short (a few or dozens of minutes), so task execution is sensitive to checkpointing cost.

System Overview • User Interface • Receive tasks • Task Scheduling • Coordinate resource competition among hosts • Resource Allocation • Coordinate resource usage within a particular host

System Overview (Cont’d) • Task Processing Procedure

Research Formulation • Analysis of Google trace: • Task failure intervals, Task length, Job structure • Equidistant checkpointing model • Checkpointing interval for a particular task is fixed • Task execution model (suppose k failures) • Tw(task) = Te(task)+C(x-1)+Σk{roll-back-loss}+Σk{restart-cost} • Objective: minimizing E(Tw(task)) • Random Variable: K (# of task failure events) • Compute optimal # of checkpoints for a Google task Task Entry Task Exit Task’s wall-clock time Productive time Roll-back loss Restart cost Checkpoint cost

Optimization of the Number of Checkpoints: New formula • Theorem 1: • x*: the optimal number of checkpointing intervals • Te: task execution length (productive length) • E(Y): task’s expected # of failures (characterized by MNOF) • C: checkpoint cost (time increment per checkpoint) • Formula (3): • Example: • A task’s productive length is 18 seconds, C = 2 sec, expected # of failures = 2 in its execution • Optimal # of checkpointing intervals = sqrt(18*2/(2*2))=3 • The optimal checkpointing interval = 18/3 = 6 seconds

Optimization of the Number of Checkpoints : Discussion • Formula (3) does not depend on probability distribution, unlike Young’s formula • Young’s formula (proposed in 1977) Optimal checkpoint interval: • C: checkpointing cost • Tf: mean time between failures (MTBF) • Conditions: • Task failure intervals follows exponential distribution • Checkpoint cost C is far smaller than checkpoint interval Tc • Due to Taylor series and second-order approximation

Optimization of the Number of Checkpoints : Discussion • The assumption with exponential distribution makes Young’s formula unsuitable for Google task processing • Distribution of Google task failure intervals based on priority

Optimization of the Number of Checkpoints : Discussion • Corollary 1: Young’s formula is a special case • Two important conditions: • Task failure intervals follow exponential distribution • Checkpointing cost is small

Optimization of the Number of Checkpoints : Discussion • Our formula (3) is easier to apply than Young’s formula in practice - Young’s formula depends on MTBF, while MTBF may not be easy to predict precisely • Non-asynchronous clocks across hosts • Inevitable influence of checkpointing cost • Significant delay of failure detection - By contrast, MNOF is easy to record accurately

Adaptive Optimization of Chpt Positions • Problem: what if the probability distribution of failure intervals (or failure rates) changes over time? • This is possible due to changeable priority …. • Objective: To design an adaptive algorithm to dynamically suit the changing failure rates. • Question: Will the optimal checkpoint positions change with decreasing remaining workload over time? • Solution: • We just need to monitor MNOF, regardless of the decreasing remaining workload to process - because of Theorem 2 means current time Kth chpt (K+1)th chpt Later on Opt chpt intervals?

Adaptive Optimization of Fault Tolerance (Cont’d) • Theorem 2: Optimal # of checkpointing Intervals computed at (k+1)th checkpoint position Optimal # of checkpointing intervals computed at kth checkpoint position

Local disk vs. Shared disk checkpointing • Characterization based on BLCR • Operation time cost in setting a checkpoint

Performance Evaluation • Experimental Setting • We build a testbed based on Google trace, in a cluster with hundreds of VM instances running across 16 nodes (16*8 cores, 16*16GB memroy size, XEN4.0, BLCR) • We call it GloudSim (Google based cloud simulation system) [under review by HiPC’13] • We reproduce Google task execution as close as possible to Google trace, e.g., • Task arrivals are based on the trace or some distribution • Task’s memory is reproduced via Google trace • Task’s failure events are reproduced via Google trace • Each job is chosen from among all sample jobs in the trace

Performance Evaluation (Cont’d) • Experimental Results • Job’s Workload-Processing Ratio (WPR) • Checkpointing effect with precise prediction (on MNOF and MTBF)

Performance Evaluation (Cont’d) • Distribution of WPR with diff. C/R formulas a

Performance Evaluation (Cont’d) • MNOF & MTBF w.r.t. Priority in Google trace • MNOF is stable with task lengths, while MTBF is not stable (changing from 179 to 4199 secs)

Performance Evaluation (Cont’d) • Min/Avg/Max WPR with respect to diff. Priorities • Our formula outperforms Young’s formula by 3-10%

Performance Evaluation (Cont’d) • Wall-clock lengths of 10,000 job execution • Conclusion: Job wall-clock lengths are often incremented by 50-100 seconds under Young’s formula than ours.

Performance Evaluation (Cont’d) • Adaptive Algorithm vs. Static Algorithm

Conclusion and Future Work • Selected conclusions: • Our formula (3) is better than Young’s formula by 3-10 percent, w.r.t. Google task processing • Job wall-clock lengths are incremented by 50-100 seconds under Young’s formula than ours. • Worst WPR under dynamic algorithm stays about 0.8, compared to 0.5 under static algorithm. • Future work • Port our theorems to more cases like MPI over Cloud platforms.

Thanks for your attention!!Contact me at:disheng222@gmail.com

Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism