Distributed Computing:

Fault-Tolerant Distributed Computing: Atomic Broadcast

Outline • Why distributed computing? • Atomic Broadcast • The atom system • Relevance for e-textiles • What’s next? • Q&A

Why Distributed Computing? • Spread and balance the computational weight of applications • Solve bigger problems • Deal with problems locally instead of centralizing all the data

Example • Space filtering vs. raw consensus • Acoustic Beam Forming: master collects information from slaves and decides according to the relevance of data • Consensus: no master, all processes decide upon one common value

Atomic Broadcast: Definition (1) • Atomic Broadcast = the same set of messages is delivered by all the processes in the same order • Consensus = all processes decide upon one common value among those proposed

Atomic Broadcast: Definition (2) • Validity: If a correct process broadcasts a message m it will eventually receive it • Uniform agreement: If a process delivers a message m then every correct process will deliver it • Uniform integrity: Every message m is delivered at most once and only if it was reliably broadcasted by sender(m) • Total order: If 2 correct processes p and q deliver 2 messages m and m’ then p delivers m before m’ iff q delivers m before m’

Atomic Broadcast: Bad News • Impossibly to achieve in a totally asynchronous system [Fisher, Lynch, Patterson 85]

Atomic Broadcast: Good News • Can be done using unreliable failure detectors • Based on a Consensus algorithm described in [Chandra, Toueg 96]

Atom • Open source Atomic Broadcast system

Producer Atom A-broadcast AB task1 FD suspect transmission do_Consensus R-broadcast start AB task 3 One_run start do_decide cancel AB task 2 RB A-deliver Consumer FD trust

Relevance to E-textiles • Synchronization of data • Coordination of decisions and actions • Light-weight process • Buffer sizes can be predicted

What’s Next? • Scalability is a problem for classic fault-tolerant distributed algorithms • Bimodal Multicast[Ken Birman, Mark Hayden, Oznur Ozkasap, Zhen Xiao, Mihai Budiu, Yaron Minsky – 1998] • Gossip protocol • Relaxes the “strong” reliability guarantees replacing them with probabilistic guarantees • Converges to “strong” reliability in the absence of failures • Scalable with steady throughput

Questions …

Distributed Computing: