Analysis of Checkpointing Schemes for Multiprocessor Systems

Analysis of Checkpointing Schemes for Multiprocessor Systems Avi Ziv Jehoshua Bruck Presentation By: Emre Chasan Moustafa

Outline • Introduction • Checkpointing • Execution Of A Task • Performance Analysis • Analysis Technique • Analysis technique • Building The State Machine • Creating the Markov Chain • Analyzing the Scheme Using the MRM • Scheme Comparison • Average Execution Time • Average Work • Conclusion

Checkpointing • A technique in distributed shared memory systems for inserting fault tolerance into systems. • Reduces the time spent in retrying a task in case of a failure • Hence reduces the average execution time of a task • Important in many applications • Real-timesystems with hard deadlines, • Transactionssystems, where high availability is required.

Checkpointing (2) • Basically serves two purposes: • Detecting faults that occurred during the execution of a task, • Reducing the time spent in recovering from faults. • Achieved by • Duplicating the task into twoor more processors • Comparing the states of theprocessors at the checkpoints.

Execution Of A Task • Execute one interval of the task by all theprocessorsthat are assigned to it. • Performs the operations necessary to achieve fault detection and recovery. • Store the states of the processors in the stable storage • Compare those states. • If no fault occurred • The execution of the task is resumed with the next interval in the next step. • Otherwise • Checkpoint processor performs operations to recover from the fault.

Performance Analysis • Important when • Tryingto evaluate and compare different schemes • Checking ifa scheme achieves its goals in a certain system. • Making simulations for performance evaluation • Leads to long and time consumingevaluation • Using simplified fault model • Provides only approximate results • This paper describes an analysis technique forstudying the performance of checkpointing schemes forfault-tolerance • Provides a way to compare various schemesand select optimal values for some parameters of thescheme, like the number of checkpoints.

Analysis Technique • Based on the analysis ofa discrete time Markov Reward Model(MRM) • Done in 3 steps • Theanalyzed scheme is modelled as a state-machine. • The edges of the state-machine are assignedtransition probabilities according to the eventsthat cause the transition and the fault model used. • The Markov chain, created by the firsttwo steps, is analyzed, and values for the properties ofinterest are derived.

Analysis Technique (2) • An example using the DMR-B-1 scheme • Task is executed by twoprocessors in parallel

Building The State Machine • Describes the behaviour of the scheme in the eyes of anexternal viewer, who can observe the faults that occurredduring a step. • Each transition in the state-machine represents onestep. • Each transition has associated with it a set of propertiescalled rewards. • For the execution timeof the schemes, we use two rewards • vi- The amount of useful work that is done duringthe transition. • ti- The time it takes to complete the step thatcorresponds to the transition.

Building The State Machine (2) • DMR-B-1 scheme the operation has two basicmodes. • The first mode is the normal operationmode, where two processors areexecuting the task inparallel. • The second mode is the fault recovery mode,where a single processor tries to find a match to anunverified checkpoint. The execution of the previous figure causes thefollowing transitions (the number above the arrows arethe edges that are used for the transitions)

Creating the Markov Chain • Involves assigning probabilitiesto each of the transitions in the state-machineconstructed in the first step. • The probabilities assigned to the edges are determinedby the fault model. • Fis the probability that a processor willhave a fault while executing an interval. • Transition description for the DMR-B-1 extended state-machine:

Analyzing the Scheme Using the MRM • To solve the MRM, construct the transition matrix of theMarkov chain. • Eachentry pi,jis the probability of transition from state Ito state j . • Two ways to analyze a Markov chain • Transient analysis • We look at the stateprobabilities at each step, and from those probabilitiesget the desired quantities. • Steady-state or limiting analysis. • Welook at the state probabilities in the limit as t→∞. • In this paper we use the steady-stateanalysis.

Analysis of DMR-B-1 • Applying results to the DMR-B-1 scheme: • The transition matrix of the scheme is: • The steady-state probabilities are: • And the average execution time of a task:

Simulation Results • The comparison was made for a task of length 1 with 20 checkpoints (n = 20, tl= .05), tck= 0.001 and t l d= 0.003. • The simulation points fall on the line of analytical plot. • Also in other schemes, the the analytical simulation results match well. Comparison between analytical and simulation results of the average execution time for the DMR-B-1 scheme

Scheme Comparison • TMR-F scheme • The task is executed by three processors,all of them executing the same interval. • Afault ina single processor can be recovered without a rollbackbecause two processors with correct executionstill agree on the checkpoint. • If faults occur in morethan one processor all the processors are rolled backand execute the same interval again. • DMR-B-2 scheme • Two processors execute the task. • Whenever a fault occurs both processors are rolledback and execute the same interval again. • The differencebetween this scheme and simple rollback schemes,like TMR-F, is that all the unverified checkpoints arestored and compared, not just the checkpoints of thelast step. • Two steps with a single fault areenough to verify an interval.

Scheme Comparison (2) • DMR-F-1 scheme • Uses spare processors and the roll-forward recovery techniquein order to avoid rollback • Two processorsare used during fault free steps. • Three additionalspare processors are added for a single step after eachfault to try to recover without a rollback. • Roll-Forward CheckpointingScheme(RFCS) • Aspare processor is used in fault recovery in order toavoid rollback. • The difference between the DMR-F-1 and RFCS schemes isthat RFCS uses only one spare processor and the recoverytakes two steps instead of one step in DMR-F-1.

Scheme Comparison (3) • Two properties are compared: • Average execution time • Importantin real-time systems where fast response is desired • Averagework used to complete the execution of a task • Important in transaction systems, where high availability of the system is required, and so the system should use asfew resources aspossible.

Simplified Model • To obtain general properties of the schemes withoutthe influence of a specific implementation • The time to execute each stepis • ts+ toh,where tohis the overhead time required bythe scheme. • Using the simplified model, and a task with n intervals (tl= 1/n) • The average execution time: • The total work of a task:

Average Execution Time • The average execution time of a task with n checkpointsis: where S is the average number of steps it takes to complete an interval. • The average execution time of the four schemes

Average Execution Time (2) • As seen from the figure: • TMR-F scheme has the lowestexecution time. • Because it is using more processors than • Has a much lower probabilityof failing to find two matching checkpoints. • DMR-B-2 scheme is the worst • Because it uses onlytwo processors • Does not use spare processors totry to overcome the failure. Average execution time with optimal checkpoints • The RFCS and DMR-F-1 schemes use spare processors during fault recovery,and thus have better performance than DMR-B-2.

Average Work • Applying the precise model, the four schemes give the following formulas: (The average work of a task is of length 1 with overhead time of t,,, = 0.002)

Average Work (2) • The results here are the reverse of the results in the average execution time. • The best scheme here is the DMR-B-2 because • it always uses only two processors. • The RFCS and DMR-F-1, which use 2 processors during normal execution and add spare processors during fault recovery, require more work. • The TMR-F scheme, which uses 3 processors, is the worst scheme.

Conclusions • A novel techniqueto analyze the performance of checkpointing schemes is presented. • The technique is based on modeling theschemes under a given fault model with a Markov RewardModel • Results show that: • Generally the number of processorhas a major effect on both quantities. • Whena scheme uses more processors, its execution time decreases,while the total work increases. • The complexityof the scheme has only a minor effect on its performance. • The proposed technique is not limited to theschemes described in this paper, or to the fault modelused here. • It can be used to analyze any checkpointingfault-tolerance scheme, with various fault models.

Analysis of Checkpointing Schemes for Multiprocessor Systems

Analysis of Checkpointing Schemes for Multiprocessor Systems

Presentation Transcript

Research Center for Multiprocessor Systems Doctor Sergei Abramov

Multiprocessor Systems

Performance Analysis of Multiprocessor Architectures

Checkpointing 2.0

CSCE 313: Embedded Systems Multiprocessor Systems

Checkpointing-Recovery

Uncoordinated Checkpointing

Analysis of schemes for doublet production

Reliable and Scalable Checkpointing Systems for Distributed Computing Environments

Response Time Analysis of Tasks in Multiprocessor Systems

Diskless Checkpointing

Caching in multiprocessor systems

Research Center for Multiprocessor Systems Doctor Sergei Abramov

Uniprocessor Checkpointing

Design of Adaptive On-Chip Multiprocessor Systems

Uncoordinated Checkpointing

Lecture 6: Performance of Multiprocessor Systems

Multiprocessor Systems

Uncoordinated Checkpointing

Checkpointing-Recovery