230 likes | 372 Views
Analysis of Checkpointing Schemes for Multiprocessor Systems. Avi Ziv Jehoshua Bruck Presentation By: Emre Chasan Moustafa. Outline. Introduction Checkpointing Execution Of A Task Performance Analysis Analysis Technique Analysis technique Building The State Machine
E N D
Analysis of Checkpointing Schemes for Multiprocessor Systems Avi Ziv Jehoshua Bruck Presentation By: Emre Chasan Moustafa
Outline • Introduction • Checkpointing • Execution Of A Task • Performance Analysis • Analysis Technique • Analysis technique • Building The State Machine • Creating the Markov Chain • Analyzing the Scheme Using the MRM • Scheme Comparison • Average Execution Time • Average Work • Conclusion
Checkpointing • A technique in distributed shared memory systems for inserting fault tolerance into systems. • Reduces the time spent in retrying a task in case of a failure • Hence reduces the average execution time of a task • Important in many applications • Real-timesystems with hard deadlines, • Transactionssystems, where high availability is required.
Checkpointing (2) • Basically serves two purposes: • Detecting faults that occurred during the execution of a task, • Reducing the time spent in recovering from faults. • Achieved by • Duplicating the task into twoor more processors • Comparing the states of theprocessors at the checkpoints.
Execution Of A Task • Execute one interval of the task by all theprocessorsthat are assigned to it. • Performs the operations necessary to achieve fault detection and recovery. • Store the states of the processors in the stable storage • Compare those states. • If no fault occurred • The execution of the task is resumed with the next interval in the next step. • Otherwise • Checkpoint processor performs operations to recover from the fault.
Performance Analysis • Important when • Tryingto evaluate and compare different schemes • Checking ifa scheme achieves its goals in a certain system. • Making simulations for performance evaluation • Leads to long and time consumingevaluation • Using simplified fault model • Provides only approximate results • This paper describes an analysis technique forstudying the performance of checkpointing schemes forfault-tolerance • Provides a way to compare various schemesand select optimal values for some parameters of thescheme, like the number of checkpoints.
Analysis Technique • Based on the analysis ofa discrete time Markov Reward Model(MRM) • Done in 3 steps • Theanalyzed scheme is modelled as a state-machine. • The edges of the state-machine are assignedtransition probabilities according to the eventsthat cause the transition and the fault model used. • The Markov chain, created by the firsttwo steps, is analyzed, and values for the properties ofinterest are derived.
Analysis Technique (2) • An example using the DMR-B-1 scheme • Task is executed by twoprocessors in parallel
Building The State Machine • Describes the behaviour of the scheme in the eyes of anexternal viewer, who can observe the faults that occurredduring a step. • Each transition in the state-machine represents onestep. • Each transition has associated with it a set of propertiescalled rewards. • For the execution timeof the schemes, we use two rewards • vi- The amount of useful work that is done duringthe transition. • ti- The time it takes to complete the step thatcorresponds to the transition.
Building The State Machine (2) • DMR-B-1 scheme the operation has two basicmodes. • The first mode is the normal operationmode, where two processors areexecuting the task inparallel. • The second mode is the fault recovery mode,where a single processor tries to find a match to anunverified checkpoint. The execution of the previous figure causes thefollowing transitions (the number above the arrows arethe edges that are used for the transitions)
Creating the Markov Chain • Involves assigning probabilitiesto each of the transitions in the state-machineconstructed in the first step. • The probabilities assigned to the edges are determinedby the fault model. • Fis the probability that a processor willhave a fault while executing an interval. • Transition description for the DMR-B-1 extended state-machine:
Analyzing the Scheme Using the MRM • To solve the MRM, construct the transition matrix of theMarkov chain. • Eachentry pi,jis the probability of transition from state Ito state j . • Two ways to analyze a Markov chain • Transient analysis • We look at the stateprobabilities at each step, and from those probabilitiesget the desired quantities. • Steady-state or limiting analysis. • Welook at the state probabilities in the limit as t→∞. • In this paper we use the steady-stateanalysis.
Analysis of DMR-B-1 • Applying results to the DMR-B-1 scheme: • The transition matrix of the scheme is: • The steady-state probabilities are: • And the average execution time of a task:
Simulation Results • The comparison was made for a task of length 1 with 20 checkpoints (n = 20, tl= .05), tck= 0.001 and t l d= 0.003. • The simulation points fall on the line of analytical plot. • Also in other schemes, the the analytical simulation results match well. Comparison between analytical and simulation results of the average execution time for the DMR-B-1 scheme
Scheme Comparison • TMR-F scheme • The task is executed by three processors,all of them executing the same interval. • Afault ina single processor can be recovered without a rollbackbecause two processors with correct executionstill agree on the checkpoint. • If faults occur in morethan one processor all the processors are rolled backand execute the same interval again. • DMR-B-2 scheme • Two processors execute the task. • Whenever a fault occurs both processors are rolledback and execute the same interval again. • The differencebetween this scheme and simple rollback schemes,like TMR-F, is that all the unverified checkpoints arestored and compared, not just the checkpoints of thelast step. • Two steps with a single fault areenough to verify an interval.
Scheme Comparison (2) • DMR-F-1 scheme • Uses spare processors and the roll-forward recovery techniquein order to avoid rollback • Two processorsare used during fault free steps. • Three additionalspare processors are added for a single step after eachfault to try to recover without a rollback. • Roll-Forward CheckpointingScheme(RFCS) • Aspare processor is used in fault recovery in order toavoid rollback. • The difference between the DMR-F-1 and RFCS schemes isthat RFCS uses only one spare processor and the recoverytakes two steps instead of one step in DMR-F-1.
Scheme Comparison (3) • Two properties are compared: • Average execution time • Importantin real-time systems where fast response is desired • Averagework used to complete the execution of a task • Important in transaction systems, where high availability of the system is required, and so the system should use asfew resources aspossible.
Simplified Model • To obtain general properties of the schemes withoutthe influence of a specific implementation • The time to execute each stepis • ts+ toh,where tohis the overhead time required bythe scheme. • Using the simplified model, and a task with n intervals (tl= 1/n) • The average execution time: • The total work of a task:
Average Execution Time • The average execution time of a task with n checkpointsis: where S is the average number of steps it takes to complete an interval. • The average execution time of the four schemes
Average Execution Time (2) • As seen from the figure: • TMR-F scheme has the lowestexecution time. • Because it is using more processors than • Has a much lower probabilityof failing to find two matching checkpoints. • DMR-B-2 scheme is the worst • Because it uses onlytwo processors • Does not use spare processors totry to overcome the failure. Average execution time with optimal checkpoints • The RFCS and DMR-F-1 schemes use spare processors during fault recovery,and thus have better performance than DMR-B-2.
Average Work • Applying the precise model, the four schemes give the following formulas: (The average work of a task is of length 1 with overhead time of t,,, = 0.002)
Average Work (2) • The results here are the reverse of the results in the average execution time. • The best scheme here is the DMR-B-2 because • it always uses only two processors. • The RFCS and DMR-F-1, which use 2 processors during normal execution and add spare processors during fault recovery, require more work. • The TMR-F scheme, which uses 3 processors, is the worst scheme.
Conclusions • A novel techniqueto analyze the performance of checkpointing schemes is presented. • The technique is based on modeling theschemes under a given fault model with a Markov RewardModel • Results show that: • Generally the number of processorhas a major effect on both quantities. • Whena scheme uses more processors, its execution time decreases,while the total work increases. • The complexityof the scheme has only a minor effect on its performance. • The proposed technique is not limited to theschemes described in this paper, or to the fault modelused here. • It can be used to analyze any checkpointingfault-tolerance scheme, with various fault models.