220 likes | 408 Views
Distributed Algorithms for Failure Detection and Consensus in Crash, Crash-Recovery and Omission Environments. Mikel Larrea Distributed Systems Group University of the Basque Country, UPV/EHU. Context and Seminal Papers.
E N D
Distributed Algorithms forFailure Detection and Consensus inCrash, Crash-Recovery andOmission Environments Mikel Larrea Distributed Systems Group University of the Basque Country, UPV/EHU
Context and Seminal Papers • In the Consensus problem, all correct processes propose a value and must reach a unanimous and irrevocable decision on some proposed value • [FLP85] M. Fischer, N. Lynch, M. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM, 1985 • [CT96] T. Chandra, S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 1996 • [CHT96] T. Chandra, V. Hadzilacos, S. Toueg. The weakest failure detector for solving consensus. Journal of the ACM, 1996 Mikel Larrea − Mannheim, May 2011
Motivation Mikel Larrea − Mannheim, May 2011
Motivation++ (Zurich, July 2010) Mikel Larrea − Mannheim, May 2011
Crash Failure Detectors [CT96] Mikel Larrea − Mannheim, May 2011
Strengthening Completeness Mikel Larrea − Mannheim, May 2011
Guest Stars: P and Omega • P: strong completeness, eventual strong accuracy • Eventually every process that crashes is permanently suspected by every correct process • There is a time after which correct processes are not suspected by any correct process • Omega satisfies the following property: • There is a time after which all the correct processes always trust the same correct process • What is a correct process? • It depends on the failure model :-) Mikel Larrea − Mannheim, May 2011
FD-based Consensus Mikel Larrea − Mannheim, May 2011
Fault-tolerant Architecture Mikel Larrea − Mannheim, May 2011
Outline • Part I: Crash Environments • (Near-) Communication-efficient algorithms for P • Communication-optimal algorithms for P • Part II: Crash-Recovery Environments • Implementing Omega with/without stable storage • Communication-efficient algorithms for Omega • From Omega to P • Fault-tolerant aggregator election and data aggregation in wireless sensor networks • Part III: Omission Environments • Secure failure detection and consensus in TrustedPals • Communication-efficient algorithm for P Mikel Larrea − Mannheim, May 2011
Part I:P in Crash Environments Joint work with Roberto Cortiñas, Alberto Lafuente, Iratxe Soraluze, Joachim Wieland
The First P Algorithm [CT96] Mikel Larrea − Mannheim, May 2011
Part I. Summary of Results • Efficient implementations of P • Nearly communication-efficient algorithms (n+C links are used forever) • Q-based, transformations • Communication-efficient algorithms (n links) • Pure ring-based, optimizations • Optimal implementations of P • Communication-optimal algorithms (C links) • RBcast-based, one-to-one, one-to-all Mikel Larrea − Mannheim, May 2011
Reliable Broadcast [CT96] “All correct processes deliver the same set of messages” Mikel Larrea − Mannheim, May 2011
P in Crash Environments • [WLL07] J. Wieland, M. Larrea, A. Lafuente. An evaluation of ring-based algorithms for the Eventually Perfect failure detector class. 15th International Conference on Parallel, Distributed and Network-based Processing, 2007 • [LSCL08] M. Larrea, I. Soraluze, R. Cortiñas, A. Lafuente. An Evaluation of Communication-Optimal P Algorithms. 16th International Conference on Parallel, Distributed and Network-based Processing, 2008 Mikel Larrea − Mannheim, May 2011
Part II:Omega in Crash-Recovery Environments Joint work with José Javier Astrain, Ernesto Jiménez, Cristian Martín, Iratxe Soraluze
Part II. Summary of Results • Redefinition of Omega • Take into account unstable processes • Take into account the availability of stable storage • Implementation of Omega • With and without stable storage • Efficient algorithms • From Omega to P • Fault-tolerant aggregator election and data aggregation in wireless sensor networks Mikel Larrea − Mannheim, May 2011
From Omega to P Mikel Larrea − Mannheim, May 2011
Part III:P in Omission Environments Joint work with Roberto Cortiñas, Felix Freiling, Marjan Ghajar-Azadanlou, Alberto Lafuente, Lucia Penso, Iratxe Soraluze
Part III. Summary of Results • Reduction from Byzantine to omission • Processes are equipped with tamper proof security modules (e.g., smartcards) • Actually, omission + buffering/timing attacks • Omission models • send | receive | general • permanent | transient • non-selective | selective Mikel Larrea − Mannheim, May 2011
Part III. Summary of Results • Impossibility result • P is impossible to implement in the (transient) general omission model • Redefinition and implementation of P • In-connected and out-connected processes • All-to-all communication, sequence numbers, connectivity matrix • P-based Consensus • Termination: every in-connected process eventually decides • Adaptation of Chandra-Toueg’s algorithm Mikel Larrea − Mannheim, May 2011
Distributed Algorithms forFailure Detection and Consensus inCrash, Crash-Recovery andOmission Environments Thank you!mikel.larrea@ehu.es Mikel Larrea Distributed Systems Group University of the Basque Country, UPV/EHU