Chapter 7: roadmap

Chapter 7: roadmap 7.1 Super stabilization 7.2 Self-Stabilizing Fault-Containing Algorithms 7.3 Error-Detection Codes and Repair Chapter 7 - Local Stabilization

Introduction We present a scheme that can be used to correct the state of algorithms for ongoing long-lived tasks. Converting non-stabilizing algorithms for such tasks to self-stabilizing algorithm for the same task. Chapter 7 - Local Stabilization

The Malicious Fault Model Starting from a safe configuration c, after which k processors experience transient fault - a new configuration c’ is reached. The states of the faulty processors can be chosen as the states that result in the longest convergence time. Chapter 7 - Local Stabilization

The Malicious Fault Model (2) • This worst case measure minimize the convergence time in the worst case scenario • However, algorithms designed with the worst case measure may have largeraverage convergence timethan other algorithms Chapter 7 - Local Stabilization

The Non-malicious Fault Model • In this model, a transient fault assigns a state to a processor, that is chosen with equal probability from the state space of the processor Chapter 7 - Local Stabilization

Average Convergence Time • Pr (c, k, c’) : The probability of reaching a particular configuration c’ from a safe configuration c due to the occurrence of k faults • WorstCase(c) : The maximal number of cycles before the system reaches a safe configuration when it starts in c Chapter 7 - Local Stabilization

Average Convergence Time (2) • The average convergence time following the occurrence of k non-malicious transient faults is: Σ [pr(c, k, c’) · WorstCase(c’)]Computed over all possible configurations c’ Chapter 7 - Local Stabilization

Error Detection Codes • We use error-detection codes to reduce average convergence time • For each processor we maintain a variableErrorDetectholding the error-detection codeed, of its current states • The error-detecting function computes a pair <s, ed> given s Chapter 7 - Local Stabilization

Converting the Algorithm Replace every step a by a step a’ that does the following: Examines whether the value of ErrorDetect fits the current state If (1) holds, execute a Otherwise, execute a special repair step a’’ Compute the new ed’ by using the error-detecting function on the resulting state s’ Chapter 7 - Local Stabilization

Converting the Algorithm (2) • A transient fault can corrupt all the memory bits of a processor • Thus, the probability that the value of ErrorDetect will fit the state of the faulty processor, decreases as the number of bits in ErrorDetect increases Chapter 7 - Local Stabilization

Pyramids A pyramid ∆i = vi[0], vi[1], vi[2],…, vi[d]of views is maintained by every processor Pi , where vi[h] is a view of all the processors that are within a distance of no more than hfrom Pi, h times units ago. In particular, vi[d] is a view of the entire system, d time units ago. Chapter 7 - Local Stabilization

V1[0] : View of V1 Now. V1 Chapter 7 - Local Stabilization

V1[1] : View of colored vertices, one time unit ago. V1 Chapter 7 - Local Stabilization

V1[2] : View of colored vertices, two time units ago. V1 Chapter 7 - Local Stabilization

V1[3] : View of colored vertices, three time units ago. V1 Chapter 7 - Local Stabilization

V1[4] : View of the entire system, four time units ago. V1 Chapter 7 - Local Stabilization

V1[5] and V1[6] are views of the entire system as well, the difference is only in the time these views were taken. V1 Chapter 7 - Local Stabilization

Neighboring Pyramids • Neighboring processors exchange pyramids between themselves, and checkagreementon the shared portions • If shared portions are equal, then all the v[d] views are equal In addition, every processor checks that vi[d] is aconsistent configuration for the input algorithm AL and the current task (the configuration is reachable from the initial state of AL) Chapter 7 - Local Stabilization

Checking Consistent Configuration • Pichecks that its state in the view vi[h] , for 0 ≤ h ≤ d-1, is obtained by executing AL using the state of Pi and its neighbors in vi[h+1] . Chapter 7 - Local Stabilization

Updating the Pyramids • In every time unit, Pi receives the pyramid ∆j = vj[0], vj[1], vj[2],…, vj[d] of every neighbor, and uses the values of vj[d-1]to construct the value of the new vi[d] • The values of vj[d-1] contain information about every processor at distance d from Pi, d-1 time units ago In the same way, Pi uses the received values of vj[k-1], for 0 ≤ k ≤ d-1, (together with vi[k-1] ) to compute vi[k] Chapter 7 - Local Stabilization

The Repair Scheme • First, we will assume that the error detection code, identifies all the faults In general, the faulty processors initialize their states, and collect state information from non-faulty processors to reconstruct their pyramids Chapter 7 - Local Stabilization

The Repair Scheme(2) • Let c’ be a configuration reached after several faults • Three groups of processors: Faulty,Border-non-faulty, Operating. • A Process which identifies an error, assigns faultyto its local status variable, and resets its pyramid Chapter 7 - Local Stabilization

Border-Non-Faulty and Operating • The pyramid of a non-faulty processor that is neighbor to a faulty processor has almost all the information stored in the faulty processor before the fault. • Such process assigns its local status variable the value border-non-faulty. • The rest non-faulty processors are defined operating. Chapter 7 - Local Stabilization

Faulty Border-non-faulty Operating Chapter 7 - Local Stabilization

Freezing the Pyramids • A border-non-faulty processor does not change its pyramid until all the faulty processors finished reconstructing theirs • The Topology Collectionprocedure is used to verify that. Chapter 7 - Local Stabilization

Topology Collection • Every faulty and border-non-faulty processors send their topology known at that moment to their neighbors • After several rounds (the diameter of the corrupted region + 1), all the information in the pyramids of processors next to a faulty one has arrived Chapter 7 - Local Stabilization

Topology Collection (2) • Every processor checks if there exists a faulty processor which has an edge connected to a processor with an unknown state • When this test returns false, the processor pyramids can be reconstructed Chapter 7 - Local Stabilization

Reconstruction • The faulty processors reconstruct their pyramids using the collected information from the other pyramids and the transition functions of the processors Chapter 7 - Local Stabilization

Back to Operating • Using a local counter, and the collected topology, the faultyand border-non-faultyprocessors conclude when the rest have finished reconstructing their pyramids • At the end of the repair process, all the processors change their status to operating Chapter 7 - Local Stabilization

The algorithm State variables: • Status = {operating, faulty, border non faulty} • Topology = {V , E} • Pyramid (Explained before) • Round Counter – counts the number of rounds since the occurrence of the recent fault. Chapter 7 - Local Stabilization

The algorithm (cont.) Detects if a transient error occurred Error Detection Codes Upon a clock tick: • If (status = operating) 1.1 if (DetectError()) 1.1.1 status = faulty 1.1.2 Pyramid = nil 1.1.3 RoundCounter = 0 1.2 else if (HaveFaultyNeighbor()) 1.2.1 status = Border non faulty 1.2.2 RoundCounter = 0 1.3 else UpdatePyramid() 2. Else 2.1 ExchangeLocalTopologyInformation() 2.2 if ( HasAllTopology() & status = faulty) 2.2.1 ReconstructPyramid() 2.3 RoundCounter++ 2.4 If (Diamater(Topology) = RoundCounter) 2.4.1 status = operating If one of the neighbors is faulty Returns true iff there is not an edge coming out from faulty to an unknown state processor` Send immediate neighbors information, and receive Information from neighbors Chapter 7 - Local Stabilization

Undetected Faults What happens in case the faults are not detected? Transient fault detectors and watch dog counters are used in this situation When an error is detected by the transient fault detector, the faulty process starts counting while letting the repair scheme try and fix the problem Chapter 7 - Local Stabilization

Undetected Faults (2) • When the counter reaches its upper bound, the system is examined again • If the repair failed, a reset is triggered to the system Chapter 7 - Local Stabilization

Chapter 7: roadmap