210 likes | 359 Views
Mikhail Nesterenko Kent State University. Anish Arora Ohio State University. Local Tolerance to Unbounded Byzantine Faults. faulty. unaffected. affected. Faults in System of Large Scale. large system size presents unique challenges and opportunities to ensuring dependability problem
E N D
Mikhail Nesterenko Kent State University Anish Arora Ohio State University Local Tolerance to UnboundedByzantine Faults
faulty unaffected affected Faults in System of Large Scale • large system size presents unique challenges and opportunitiesto ensuring dependability • problem • faults: • occur often • affect multiple components • interact unpredictably • asynchronous execution model • faults are spatially/temporally unbounded, complex & undetectable • opportunity • a fault directly affects a region rather than whole system • if faults are contained, rest of the system continues to function
Difficulties Containing Unbounded Faults • lack of spatial bound • arbitrary number of processes can be faulty • cannot rely on limited scope offault or number of faulty processes • lack of temporal bound • faulty process behaves incorrectly arbitrarily long • cannot wait until fault stops • contain correctness and tolerance instead of faults • use execution models that simplify such containment
Outline • containing correctness and tolerance:strict fault containment and strict stabilization • execution models and example programs • reactive program: dining philosophers • transformational execution models and programs • output dependent: -independent set selection • output independent: lightweight spanner construction
containment radius l fault of classF containment locality Containing Correctness • address specification first • what does it mean for a system to be correct when its arbitrary portion is faulty? • spec defines correct sequences for each process P • sequence involves states of Pand possibly others • a program is locally containing of faults of class F if constant l (containment radius) such that • every P conforms to its spec if faulty processes are at least l hops away from P • problem: correctness of P depends onevery process in the system conforming to spec or F
Byzantinefault Strict Fault Containment strict fault containing (SFC) program is locally containing of unboundedByzantine faults • a process satisfies spec regardlessof actions of processes outsidelocality • SFC-program is containing ofbounded and unbounded faults of any class • for each P the spec can only mention processes inside locality • a problem lacking such specs (e.g. routing) does not have SFC-solutions
strict stabilization – stabilization from transient faults: regardless of actions outside locality, each P eventually satisfies spec Strict Stabilization additional tolerance properties to faults within locality for a strictly-fault containing program
Outline • containing correctness and tolerance:strict fault containment and strict stabilization • execution models and example programs • reactive program: dining philosophers • transformational execution models and programs • output dependent: k-independent set selection • output independent: lightweight spanner construction
cycle forrequesting process thinking (T) hungry (H) eating (E) Dining Philosophers Problem definition • network of processes, each may request to eat • properties • mutual exclusion – no two neighbors eat together • liveness – each requestingprocess eats eventually execution model • interleaving • communication via shared registers • high-atomicity
E H T any decreasing priority Solution to Dining Philosophers priority based actions • if T & higher priority neighbors thinking become hungry • if H & no neighbors are eating eat (ensures MX) • E & done think & give priority to neighbors (ensures liveness) • waiting chain ≤ 3 • optimal containmentradius of 2
process: sends info to b a sends a’s info to c b sends a’s info to d c result: d reads from a d Fault Containment andInformation Propagation • fault containment leverages limit on information propagation • idea: abstract fromthe process of information propagation and highlight the result
range P readsinput&output P readsinput only Execution Models • transformation program – given input computes output (e.g. leader election) • models for transformation programs – each process reads from processes within range (finite distance) • output dependent – each process reads all information within range: input and (atomically) output • output independent – each process reads only input within range • every program in this model is strictly fault containing
1-independent set k k P Q R joins S leaves S joins S k-Independent Set Selection (cf. [HHJS01]) problem: select a maximal subset of processes S such that • for each process in S each otherprocess of S is at leastk hops away solution actions • if no member of S less than k-hops away join S • if exists member of S less than k-hops away leave S observe: • only faulty node P can make another process Q to leave S • if Q leaves S, it can make another process R join S • containment radius is 2k
Outline • containing correctness and tolerance:strict fault containment and strict stabilization • execution models and example programs • reactive program: dining philosophers • transformational execution models and programs • output dependent: k-independent set selection • output independent: lightweight spanner construction • practical problem: fast routing tree construction in sensor networks • spanner construction with double range • spanner optimization with larger ranges
Experimental Platform: Wireless Sensors • 4 MHz Amtel processor • 8 Kb of programming memory • 512B of data memory • 916 MHz single-channel, low-power radio • 10 Kbps of raw bandwidth • uniform antenna length & orientation • TinyOS as the runtime system • fresh AA batteries
Experiment: Fast Routing Tree Construction By Flooding [G+02] • 156 nodes are arranged in a 13x12 grid on an open parking lot, with grid spacing of 2 feet. • the base station is placed in the middle of the base of the grid and starts the flooding • each receiving node rebroadcast the flood message immediately upon receipt and then squelches further broadcasts • the sender is selected as parent, thus routing tree to the base station is formed • expectation: a routing tree with relatively regular structure: • # of children, link length, path size, etc.
1 hop 2 hops Long Link Backward Link final 3 hops Straggler Clustering
Problems and Solution Approach problem: routing tree constructed fast over“raw” topology is inadequate • uneven clustering (some nodes have too many neighbors) • long links (possibly unreliable) • unoptimal paths (backward links) idea: pre-process the topology to mitigate the problem • weigh links (by length, error rate, node degree, etc.) • locally construct a connected but lightweight spanner • link weight may be reflexive (depend on the spanner, ex: node degree)
Lightweight Spanner Construction Using2k-Range P can compute MSTfor each process Qin this region • spanner – connected subgraph that includes all nodes (ex: spanning tree) • k-local spanner – there is a path within distance ≤ k to each neighbor problem: given a weighted graph(all weights unique) and 2k-rangebuild a lightweight k-local spanner solution: each process P computes the minimum spanning tree for eachprocess Q in distance no more than k and selects the union of incident edges k k P Q MST forQ’sregion
Spanner Optimization Using Ranges > 2 • each P computes spanner’s topology in neighborhood with radius range-k • P knows complete spanner in this region • P iteratively repeats theprocedure on the resultant spanner P can compute MSTfor each process Qin this region k k k P Q
Conclusion • complexity and scale of large systemsforces unorthodox approaches to faults • we explored spatial dimension of fault tolerance to complex unbounded faults, used lack of global info propagation • stated necessary conditions and impossibility results • gave first examples of programs • question: how to solve problems that do have global info propagation? is it possible to contain problems before they spread?