460 likes | 629 Views
A Fusion-based Approach for Tolerating Faults in Finite State Machines. Vinit Ogale, Bharath Balasubramanian Parallel and Distributed Systems Lab Electrical and Computer Engineering Dept. University of Texas at Austin Vijay K. Garg IBM India Research Lab. Outline. Motivation
E N D
A Fusion-based Approach for Tolerating Faults in Finite State Machines Vinit Ogale, Bharath Balasubramanian Parallel and Distributed Systems Lab Electrical and Computer Engineering Dept. University of Texas at Austin Vijay K. Garg IBM India Research Lab
Outline • Motivation • Related Work • Questions and Issues Addressed • Model • Partition Lattice • Fault Graphs • Fault Tolerance in FSMs and (f,m) – fusion • Algorithms : Generating Backups and Recovery • Implementation Results • Conclusion and Future Work
Motivation • Many real applications modeled as FSMs • Embedded Systems : • Traffic controllers, home appliances • Sensor networks • E.g. hundreds of multiple sensors (like temperature, pressure etc) need to be backed up
Problem • Given a set of finite state machines (FSMs), some FSMs may either crash (fail-stop faults) or lie about their execution state (Byzantine faults) a a b b a0 a1 a2 b1 b2 b0 a b Counter counting ‘a’s Counter counting ‘b’s
Existing Solution - Replicate • n.f extra FSMs to tolerate k crashfaults; 2.n.f extra FSMs to tolerate f Byzantine faults (where n is the # of original FSMs) a a a a a0 a1 a2 a a Counter counting ‘a’s 1-crash fault tolerant setup b b b b b1 b2 b0 b b Counter counting ‘b’s
Related Work • Traditional Approach – Redundancy • n.k backup machines to tolerate k faults in n machines • Fault Tolerance in Finite State Machines using Fusion (Balasubramanian, Ogale, Garg 08) • Exponential algorithm for generating machines which can tolerate crash faults • Number of faults = Number of Machines • Fusible Data Structures (Garg, Ogale 06) • Fuse common data structures such as link lists, hash tables etc – the fused structure smaller than sum of original structures • Erasure Coding • Fault Tolerance in Data
Reachable Cross Product a a A a0 a1 a2 <a0, b2> <a0, b0> <a0, b1> 0 0 0 a Counter counting ‘a’s = <a1, b0> <a1, b1> <a1,b2> b b <a2, b0> <a2, b1> <a2, b2> B b1 b2 b0 R (A, B) b Reachable Cross Product of {A,B} Counter counting ‘b’s
Can We Do Better ? a a a0 a1 a2 b b a a a “a a b” Counter counting ‘a’s (mod 3) F1 a b b b (a + b ) modulo 3 b1 b2 b0 b Counter counting ‘b’s (mod 3)
Can We Do Better ? b b a a a a F1 a0 a1 a2 a a b (a + b ) modulo 3 Counter counting ‘a’s (mod 3) 2-crash fault tolerant setup b a a b b b1 b2 b0 F2 b b b (a - b ) modulo 3 a Counter counting ‘b’s (mod 3)
Questions and Issues addressed • Can we do better than the cross product ? • How many faults can be tolerated ? What is the minimum number of machines required to tolerate f crash faults ? • Can these machines tolerate Byzantine faults? (For example, in previous slide, DFSMs A and B along with F1and F2cantolerate one Byzantine fault ) • Main Aims : • Develop theory to understand and define this problem • Efficient algorithms based on this to generate backup machines
Application Scenario: Sensor Network • 1000 sensors (simple counters) each recording a parameter (temperature, pressure etc.). Sensors will be collected later and their data analyzed offline • 10 sensors are expected to crash • Replication requires 1000 x 10 backup sensors to ensure fault tolerant operation • Can we use just 10 extra sensors instead of 10000?
Model • FSMs (machines) execute independently (in parallel) • The inputs to a FSM are not determined by any other FSM. • FSMs act concurrently on the same set of events • Fail stop (crash) faults • Loss of current state, underlying FSM intact • Byzantine faults • Machines can `lie` about their current state
Join (t) : Reachable cross product: 4 states in this case instead of 9 Join of Two FSMs
Less Than Equal To Relation (·) • Given FSMs: A and B • A · B , A t B = B • Given the state of B, we can determine the current state of A
t3 t0 t1 t2 Partitions • Given any FSM, we can partition the states into blocks such that the transitions for all states in a block are consistent • E.g. if states t0 and t3 have to be combined to form one partition Input 0 Input 1
t0,t3 t1 t2 Largest Consistent Partition Containing {t0,t3} t3 t0 t1 t2
Largest Consistent Partition Containing {t0,t1} t3 t0,t1, t2 t3 t0 t1 t2
Partition Lattice • Set of all FSMs corresponding to partitions of a given FSM (say T) forms a lattice with respect to the · relation [HarSte66]. • i.e, for any two FSMs, A and B, formed by partitioning T, there exists a unique C · T such that • C = A t B : (join/ t ) A · C and B · C and C is the smallest such element • C = A u B : (meet/ u) C · A and C · B and C is the largest such FSM
t3 > t0 t1 t2 F2 (B) F4 F3 F1 (A) t0,t3 t0,t2 t1 t3 t1 t2 t2,t3 t0 t1 t0 t1,t2 t3 t0 t1,t2,t3 t0,t3 t1,t2 t0, t1,t2 t3 t0,t2,t3 t1 S2 S1 S3 S4 t0,t1,t2,t3
Top Element (>) • Given a set of FSMs: A = {A1, …, An} > = A1t A2t … t An • All FSMs we consider henceforth are less than or equal to > • Intuitively, > has information about the state of every machine in the original set, A
Bottom Element of Lattice (?) • Single state FSM. • contains one partition with all the states • on any input it transitions to itself • conveys no information about the current state of any machine
t3 > t0 t1 t2 F2 F4 F3 F1 t0,t3 t0,t2 t1 t3 t1 t2 t2,t3 t0 t1 t0 t1,t2 t3 t0 t1,t2,t3 t0,t3 t1,t2 t0, t1,t2 t3 t0,t2,t3 t1 S2 S1 S3 S4 t0,t1,t2,t3
Tolerating Faults F2 F1
Tolerating Faults F2 F1 X t3 > t0 t1 t2 T: Reachable cross product
Fault Graph: Fault tolerance indicator F1 t0,t3 t1 t2 t3 1 1 2 t3 > t0 t2 X 2 t0 t1 t2 F2 2 2 t1 t2,t3 t0 t1 T: Reachable cross product Fault Graph G (A , T) A : { F1, F2} : Original machines
t3 t3 A = {FSMs in Yellow Region} 1 1 > 2 t0 t1 t2 t0 t2 2 2 2 t1 F2 F3 F1 F4 t0,t3 t0,t2 t1 t3 t1 t2 t2,t3 t0 t1 t0 t1,t2 t3 t0 t1,t2,t3 t0,t3 t1,t2 t0, t1,t2 t3 t0,t2,t3 t1 S2 S1 S3 S4 t0,t1,t2,t3
Hamming Distance • Hamming distance d(ti, tj) : weight of the edge separating the states (ti, tj) in the fault graph e.g. d(t0, t1) = 2 • Minimum Hamming distance dmin(T, A ) : The weight of the weakest edge in the fault graph e.g. dmin(T, A ) = 1 t3 1 1 2 t0 t2 2 2 2 t1 dmin(T, A ) = 1
Fault Tolerance in FSMs (crash faults) • Theorem 1 : A set of machines A can tolerate up to f crash faults iff : dmin(T(A), A) > f e.g. A = {A,B,M1,M2} - dmin(T(A ), A) = 3 - can tolerate 2 crash faults t3 3 4 4 t0 t2 3 3 3 t1 dmin(T(A), A) = 3
Fault Tolerance in FSMs (Byzantine faults) • Theorem 2 : A set of machines A can tolerate up to f Byzantine faults iff : dmin(T(A), A) > 2f e.g. A = {A,B,M1,M2} • Let the machines be in the following states: • A = {t0, t3}, B= {t0}, M1 = {t0, t2}, M2 ={t3} • B and M1 are lying about their state (f = 2) • Since dmin(T(A), A) = 3 < 4, we cannot determine the state of T t3 3 4 4 t0 t2 3 3 3 t1 dmin(T(A), A) = 3
Fault Tolerance in FSMs (Byzantine faults) • Let the machines be in the following states: • A = {t0, t3}, B= {t0}, M1 = {t3}, M2 ={t3} • Only B is lying about it’s state (f = 2) • Since dmin(T(A), A) = 3 > 2, we can determine the state of T as t3 Henceforth, dmin(T(A), A) => dmin(A) t3 3 4 4 t0 t2 3 3 3 t1 dmin(T(A), A) = 3
Fault Tolerance and (f,m)- fusion • Given a set of n machines, A , the set of m machines, F , is an (f,m)-fusion of A, if : dmin(A F ) > f • The set of machines in A F can tolerate f crash faults or f/2 Byzantine faults • E.g. A = {A,B}, F = {M1,M2}, dmin(A F ) = 3 • F = {M1,M2} is a (2,2) – fusion of A
Minimal Fusion • Given a set of machines A, a fusion set F is minimal if there does not exist another (f, m)- fusion F' such that • 8 F 2F, 9 F' 2F' : F' · F and • 9( F 2F, F' 2F') : F' < F
A = {FSMs in Yellow Region} n = 2 t3 > t0 t1 t2 (1,1) fusion F2 F4 F3 F1 t0,t3 t0,t2 t1 t3 t1 t2 t2,t3 t0 t1 t0 t1,t2 t3 t0 t1,t2,t3 t0,t3 t1,t2 t0, t1,t2 t3 t0,t2,t3 t1 S2 S1 S3 S4 Minimal (1,1) fusion t0,t1,t2,t3
Minimal Fusion: Example F1 t0,t3 t1 t2 t3 2 2 F2 3 t3 > t0 t2 X 2 t2,t3 t0 t1 t0 t1 t2 2 2 S4 t1 t0, t1,t2 t3 Fault Graph : G (A , T) A
Algorithm : Generating Backups • Aim: Add the least possible number of machines that tolerate f faults • Input: Set of machines A , number of faults f • Output: Minimal fusion set with the least size • If |T|= N , size of the event set if |E|, the time complexity of the algorithm is O(N3. |E|. f)
Algorithm overview • f: # of faults, A : given set of machines • Whiledmin(A F) f • M := > • While M • Compute lower cover of M , i.e. LC(M) • If machine F LC(M): dmin (F A F)> dmin (A F) M := F ElseF := F F • ReturnF
A = {FSMs in Yellow Region} t3 w=1 1 t3 1 > 2 t0 t2 t0 t1 t2 2 2 2 t1 F2 F3 F1 F4 t0,t3 t0,t2 t1 t3 t1 t2 t2,t3 t0 t1 t0 t1,t2 t3 t0 t1,t2,t3 t0,t3 t1,t2 t0, t1,t2 t3 t0,t2,t3 t1 S2 S1 S3 S4 t0,t1,t2,t3
A = {FSMs in Yellow Region} t3 w=2 2 t3 2 > 3 t0 t2 t0 t1 t2 3 3 3 t1 F2 F3 F1 F4 t0,t3 t0,t2 t1 t3 t1 t2 t2,t3 t0 t1 t0 t1,t2 t3 t0 t1,t2,t3 t0,t3 t1,t2 t0, t1,t2 t3 t0,t2,t3 t1 S2 S1 S3 S4 t0,t1,t2,t3
A = {FSMs in Yellow Region} t3 w=2 2 t3 2 > 3 t0 t2 t0 t1 t2 3 3 2 t1 F2 F3 F1 F4 t0,t3 t0,t2 t1 t3 t1 t2 t2,t3 t0 t1 t0 t1,t2 t3 t0 t1,t2,t3 t0,t3 t1,t2 t0, t1,t2 t3 t0,t2,t3 t1 S2 S1 S3 S4 t0,t1,t2,t3
A = {FSMs in Yellow Region} t3 w=1 1 t3 2 > 2 t0 t2 t0 t1 t2 3 3 2 t1 F2 F3 F1 F4 t0,t3 t0,t2 t1 t3 t1 t2 t2,t3 t0 t1 t0 t1,t2 t3 t0 t1,t2,t3 t0,t3 t1,t2 t0, t1,t2 t3 t0,t2,t3 t1 S2 S1 S3 S4 t0,t1,t2,t3
A = {FSMs in Yellow Region} t3 w=2 2 t3 2 > 3 t0 t2 t0 t1 t2 2 2 2 t1 F2 F3 F1 F4 t0,t3 t0,t2 t1 t3 t1 t2 t2,t3 t0 t1 t0 t1,t2 t3 t0 t1,t2,t3 t0,t3 t1,t2 t0, t1,t2 t3 t0,t2,t3 t1 S2 S1 S3 S4 t0,t1,t2,t3
Algorithm : Recovery • Aim: Recover the state of the faulty machines for f crash or f/2 Byzantine faults, given the state of the remaining machines • Input: Current states of all available machines in A F • Output: Correct state of T • The time complexity of the algorithm is O((n+ m) . f )
Algorithm overview • S: set of current states of machines in A F • count : Vector of size |T|, initialized to 0 • For all (s in S) do • For all (ti in s) do • ++count[i] • returntc : 1 ·c·N and count[c] is the maximal element in count
Algorithm : Example • Consider machines A, B, M1,M2 : • dmin ({A, B, M1,M2 }) = 3 ; they can tolerate one Byzantine fault • Let the machines be in the following states: • A = {t0, t3}, B= {t0}, M1 = {t1, t2,t3}, M2 ={t0} • M1 is lying about it’s state • The recovery algorithm will return t0 since, count[0] = 3, is greater than, count[1] = 1, count[2] = 1 and count[3] = 2
Conclusion/Future Work • It is not always necessary to have n.f backups to tolerate f faults • Polynomial time algorithm to generate the smallest minimal set that tolerates f faults • Implementation of this algorithm shows that many complex state machines have efficient fusions • Will machines outside the lattice give better results? • Backup Machines need to be given all events ; can we do better?