180 likes | 290 Views
Designing Modular Services in the Scattered Byzantine Failure Model* Emmanuelle Anceaume (IRISA / CNRS ). Joint work with Michel Hurfin (IRISA), Carole Delporte-Gallet (LIAFA) Hugues Fauconnier (LIAFA), Gérard Le Lann (INRIA). *This work has been supported by the French Space Agency.
E N D
Designing Modular Services in the Scattered Byzantine Failure Model*Emmanuelle Anceaume (IRISA / CNRS) Joint work with Michel Hurfin (IRISA), Carole Delporte-Gallet (LIAFA) Hugues Fauconnier (LIAFA), Gérard Le Lann (INRIA) *This work has been supported by the French Space Agency
Fault Tolerant Distributed Applications • Fault tolerance is a critical issue • To tolerate failures (from benign to malign ones) physical redundancy is mandatory • Increase the overall reliability of the computing system • Replication techniques and agreement algorithms are needed
Correct / Faulty Processes • Classically, the set of redundant processes are classified into two categories: • Correct processes: behave according to their specification during the whole application • Otherwise processes are faulty • To correctly design fault tolerant applications, • Maximal subset of faulty processes • Once these faulty processes have failed, no more failures can be tolerated
Context of the study Space domain context • Radiation, power supply glitches (ex:bit-flip) may cause transient faults in electronic systems • Running times of the applications are extremely long • Drastic limitations are imposed on the computer system • Most of the failures are recoverable, and are accidental • Physical phenomena can arbitrarily affect the behavior of a processor (altering the executed code, registers, …) • Checking procedures or reconfiguration are available. Operational state – may not be semantically correct
Outline • Formalize the scattered byzantine failure model • Solving the clock synchronization problem and the timed atomic broadcast problem in this model • Characterization of the post-fault period, i.e., minimal period of time that is needed for a processor to recover is given • For non atomic services, characterization of the fore-fault period, i.e., completion time of the service
The Scattered Byzantine Failure Model • A processor can alternate correct and faulty periods • Good period: a processor behaves according to its specification • Faulty period: a processor behaves arbitrarily • No limitation on the number of faulty processors • Frees the application designer from the recurrent question • “What happens if the quorum of processors that were supposed to fail is exceeded ?” • Extension of the classical byzantine failure model
Model of the System (1) • Computational model: • Finite set of processes {p1, …,pn} modeled as automata • Synchronous • Duration of computation steps are bounded • Local hardware clock with a bounded drift rate wrt real time (1+ )-1(t2-t1) ≤ Ri(t2)- Ri(t1) ≤ (1+ )(t2-t1) • Transmission delays are upper bounded () • Communication links are reliable • The communication network does not lose, falsify, duplicate messages
Model of the System (2) • Scattered byzantine failure model: • At any time, all processes can alternate correct and faulty periods • At any time, at most t processes are in a faulty period faulty correct correct p1 p1 p2 p2 correct f faulty pn pn correct faulty correct faulty
faulty correct post-fault Atomic broad. service bad Level k Layered services faulty correct bad post-fault Clock sync. service Level k-1 Scattered Byzantine Failure Model Faulty periods: Bad period: byzantine failures • End of a bad period when an operational state is reached Post-fault period: from operational state to safe state • Consistent with correct processes state • Purge of logs, validity of critical variables • Maximal duration Dspis computable
faulty correct Atomic broad. service correct Level k good fore-fault Layered services correct correct good Clock sync. service Level k-1 Scattered Byzantine Failure Model Correct period: To exactly identify completed activitiesfrom uncompleted one: • good period • fore-fault period: reflects wcet of a long lasting service s (Dfs) • Ensures the completion of a service • Maximal duration Dsfis computable
Clock Synchronization Service (1) • Enables to overcome the effects of drifts and failures • Guarantees that • The maximal deviation between all logical clocks is bounded Agreement property : there is a constant Dmax such that: | Ci() - Cj() | ≤ Dmax • Logical clocks are within a linear envelope of real-time Accuracy property: there exists a constant such that: /(1+ ) + a ≤ Ci() ≤ (1+ ) + b A process is in a bad period if it deviates from its algorithm or if the rate of drift of its physical clock is not bounded
Clock Synchronization Service (2) • Principles of the algorithm of Srikanth and Toueg [ST87] • Classical failure model At process i if C() = kP send (Sync-init,k) to all the processes upon receipt of (Sync-init,k) from t+1 processes relay (Sync-echo,k) to all processes if (2t+1) (Sync-echo,k,j) have been received accept (Synchro,k) if (accept(Synchro,k)) then C()=kP+ ≥ ((1+)Dmax+ 2)(1+ ) Dmax ≥(P(1+ )+2)dr+ 2 (1+) P> 2 (1+)+Dmax
Clock Synchronization Service (3) • Extension of this algorithm to ensure that: • Local structures of the processes in correct periods are never corrupted by the recovering processes • Faulty processes recover by synchronizing their local clocks within a bounded delay (I.e., Dp is bounded)
Validity test 1 Validity test 2 Clock Synchronization Service (4) if C(t) = kP broadcast (Sync,k,i) to all the other processes if (Sync,m,j) is received at time T=C(t) from l if (l=j) and (-Dmax)(1+)≤T-mP(1+ )≤(+Dmax)(1+ ) then relays this message to all the processes otherwise discards it else if (lj) then add (sync,m,j,l) to Buff-rec if (l’: l’l s.t. (sync,m,j,l’) to Buff-rec) then add (sync,m,j) to Buff-accepted if (j’: j’j s.t. (sync,m,j’) to Buff-accepted) and (km+1) then C(t):= mP+ Buff-rec = Buff-accepted := Ø k:=k+1 Clean local structures
Clock Synchronization Service (5) • Proposition: Suppose that process p recovers an operational state at time t (I.e., enters a post-fault period at time t), then by time t+2((P-)(1+ )+2), p is resynchronized with all the processes in correct periods. • post-fault period duration = 2((P-)(1+ )+2) time units • fore-fault period duration = 0 time units • Similarly to [ST87], achieve optimal accuracy
-Atomic Broadcast • Powerful communication paradigm • Agreement on the set of received messages and their order • 2 primitives: broadcast and deliver • Revisited properties • Validity: if process pi broadcasts (m,i) at time t during its good period, then every process in a good period at time t delivers (m,i) exactly once during the corresponding correct period • Agreement: if process pj delivers (m,i) at time t during its good period, then every process in a good period at time t delivers (m,i) • -timeliness: if process pi delivers (m,j) at time t during its good period, then pj broadcast (m,j) between time t- and t • Total order: if two processes pi and pj deliver two messages (m,k1) and (m,k2) during a correct period then both messages are delivered in the same order by pi and pj
faulty correct Atomic broad. service correct Level k good fore-fault Layered services correct correct good Clock sync. service Level k-1 -Atomic Broadcast (1+ ) =2(1+ )+2+3Dmax) 2((P-)(1+ )+2)
Conclusion and future work • Byzantine recovery problem: • Formalization of the scattered byzantine failure model • Two fundamental agreement problems: • Clock synchronization problem • Timed bounded atomic broadcast problem • Revisited their specifications • Designed simple and efficient solutions, and computed Df and Dp • Designing independent services • Asynchronous model: self-stabilizations techniques ?