DISTRIBUTED SYSTEMS

DISTRIBUTED SYSTEMS Department of Computing Science Umea University

Fundamental Concepts Distributed Systems - D N Ranasinghe

About Distributed Computing • devising algorithms for a set of processes that seek to achieve some form of a ‘cooperative goal’ • quoting Leslie Lamport: ‘ a distributed system is one in which the failure of a computer you did not even now existed can render your own computer unusable’ Distributed Systems - D N Ranasinghe

Distributed Algorithm • has no shared global information: only decides on local state and the messages they receive • has no shared global time frame: observes progress of computation through at best a partial order of events • non deterministic behaviour: cannot predict the exact sequence of global states from the study of the algorithm Distributed Systems - D N Ranasinghe

Design challenges from a systems perspective • heterogeneity: in hardware, OS, mode of interaction (c-s, p2p etc), middleware provisioning for developers • security: involves eavesdropping, deliberate corruption, process compromise, denial of service etc., • scalability: robustness, performance bottlenecks • process failures: detecting/suspecting, masking, tolerating, recovery, redundancy in the presence of partial processes failure • concurrency Distributed Systems - D N Ranasinghe

transparency: • access (local and remote resources accessed through identical operations) • location (resources access independent of physical location) • concurrency (process concurrency on shared resources) • replication (maintaining replicas with consistency) • failure (concealment of failures) • mobility (movement of resources and clients) Distributed Systems - D N Ranasinghe

Role of middleware • software layer with services provided to the applications designer • consisting of processes and objects • mechanisms: • Remote Method Invocation • object brokering • Service Oriented Architecture • event notification • distributed shared memory… Applications/Services Middleware Operating System Computer & Network H/W Distributed Systems - D N Ranasinghe

Motivating application domains • information dissemination (publish-subscribe paradigm): by event registration and notification with time-space decoupling property, based on reliable broadcast and agreement abstractions • process control in automation, in industrial systems etc., where consensus may have to be reached on multitude of sensorial inputs • cooperative work: multi-user cooperation in editing etc., based on shared persistent space paradigm employing ordered broadcast abstractions • distributed databases: need for atomic commitment abstraction on acceptance or rejection of serialized transactions Distributed Systems - D N Ranasinghe

Motivating application domains • software based fault tolerance through replication: uses the so called state machine replication paradigm • when a centralized server is required to be made highly available by executing several copies of it whose consistency is guaranteed by total order broadcast abstraction Distributed Systems - D N Ranasinghe

Modeling of distributed systems • abstraction: • to capture properties that are common to a large range of systems so that it enables to distinguish the fundamental from the accessory • to prevent reinvent the wheel for every minor variant of the problem • a model abstracts away the key components and the way they interact • purpose: • to make explicit all relevant assumptions about the system • to express behaviour through algorithms • make impossibility observations etc through logical analysis including proofs Distributed Systems - D N Ranasinghe

Modeling of distributed systems • abstracting the physical model: processes, links and failure detectors (latter an indirect measurement of time) 2 1 3 5 4 Distributed Systems - D N Ranasinghe

Modeling of distributed systems • component properties: • channel (a communication resource) - message delays, message loss • process (a computational resource, has only local state) – can incur process failure, be infinitely slow or corrupt • low level models of interaction: synchronous message passing, asynchronous message passing Process Internal Computation (modules of the process) Outgoing message Incoming message (receive) (send) Distributed Systems - D N Ranasinghe

Modeling of distributed systems • failure detector abstraction: a possible way to capture the notion of process and link failures based on their timing behaviour • incorporation of a failure detector, a specialized process in each process which emits a heartbeat to others • a failure detector can be considered as an indirect abstraction of time; simply a timeout is an indication of a failure, mostly unreliable with an outcome either suspected or unsuspected • a synchronous system => a ‘perfect failure detector’ Distributed Systems - D N Ranasinghe

Modeling of distributed systems • clock: physical and logical • abstracting a process: by the process failure model Arbitrary Crashes & Recoveries Omissions Crashes Distributed Systems - D N Ranasinghe

Modeling of distributed systems • crashes: a faulty process as opposed to a correct process (which executes an infinite number of steps) does no further local computation or message generation or respond to messages • a crash does not preclude a recovery later but this is considered another category • also the correctness of any algorithm may depend on a maximally admissible number of faulty processes Distributed Systems - D N Ranasinghe

arbitrary faults: a process that deviates arbitrarily from the algorithm assigned to it • also known as malicious or Byzantine faulty or in fact may be due to a bug in the program • under such conditions some algorithmic abstractions may be ‘impossible’ Distributed Systems - D N Ranasinghe

Modeling of distributed systems • omission failure: due to network congestion or buffer overflow, resulting in process unable to send messages • crash-recovery: a process simply crashes fail-stop or, crashes and recovers infinite times • every process that recovers is assumed to have a stable storage (also called a log) accessible through some primitives, which stores the most recent local state with time stamps • alternatively those which do never crash could also act as virtual stable storage Distributed Systems - D N Ranasinghe

Modeling of distributed systems • abstracting communication: by loss or corruption of messages, also known as communication omission • usually resolved through end-to-end network protocol support unless of course there is a network partition • Desirable properties for ‘reliable’ delivery of messages • liveness: any message in the outgoing buffer of sender is ‘eventually’ delivered to the incoming message buffer of receiver • safety: the message received is identical to the one sent, and no messages are delivered twice Distributed Systems - D N Ranasinghe

Abstracting other higher level interactions • e.g., capturing recurring patterns of interaction in the form of • distributed agreement (on an event, a sequence of events etc.,) • atomic commitment (whether to take an irrevocable step or not) • total order broadcast (i.e., agreeing on order of actions) leads to a wide range of algorithms Distributed Systems - D N Ranasinghe

Modeling of distributed systems • Predicting impossibility results in higher level interactions • due to in some cases indistinguishability of network failures from process failures or, a slow process from a network delay • e.g., agreement in the presence of message loss, agreement in the presence of process failures in asynchronous situations • Impossibility of agreement in the presence of message loss • leads to a widely used assumption in almost all models • typical two army problem • formal model described below Distributed Systems - D N Ranasinghe

Formal model of the two army problem • processes A and B communicate by sending and receiving messages on a bidirectional channel; A sends a message to B, then B sends a message to A and so on • A and B can execute two actions  and  • neither process can fail but the channel can lose messages • desired outcome is both processes take the same action and neither take both actions Distributed Systems - D N Ranasinghe

proof- by contradiction: let there be a protocol P that solves the problem using the fewest rounds, the last message sent by A being m • Observe that, action taken by A cannot depend on m since its receipt could never be learned by A • Action taken by B cannot depend on m because B must take the same choice of action as A even m is lost • Since actions of both A and B do not depend on m, m can be discarded • m is not the last message • P is not using the fewest rounds Distributed Systems - D N Ranasinghe

Formal models for message passing algorithms • processes and channels: channels can be unidirectional or bidirectional • topology represented by an undirected graph G(V, E) P1 P0 P2 P4 P3 Distributed Systems - D N Ranasinghe

Formal models for message passing algorithms • System has n processes, p0 to pn-1 where i is the index of the process • The algorithm run by each pi is modeled as a process automaton a formal description of a sequential algorithm and is associated with a node in the topology. Distributed Systems - D N Ranasinghe

Formal models for message passing algorithms • A process automaton is a description of the process state machine • consists of a 5-tuple: {message alphabet, process states, initial states, message generation function, state transition function} • message_alphabet: content of messages exchanged • process_states: the finite set of states that a process can be in • initial_state: the start state of a process • message_gen_function: on the current process state how the next message is to be generated • state_trans_function: on the receipt of a messages, and based on current state, the next state to which the process should transit Distributed Systems - D N Ranasinghe

Description of system state • A configuration is a vector C = (q0,…qn-1) where qi is a state of pi • In message passing systems two events can take place: computation event of process pi (application of the so called state transition function), and delivery event, the delivery of message m from process pi to process pj consisting of a message sending event and a corresponding receiving event • Each message is uniquely identified by its sender process, sequence number and may be local clock value • The behaviour of the system over time is modeled as an execution which is a sequence of configurations alternating with events. Distributed Systems - D N Ranasinghe

Formal models for message passing algorithms • All possible executions of a distributed abstraction must satisfy two conditions: safety and liveness. Process Internal Computation (modules of the process) Outgoing message Incoming message (receive) (send) Distributed Systems - D N Ranasinghe

Formal models for message passing algorithms • Safety: ‘nothing bad has/can happen (yet)’ • e.g., ‘every step by a process pi immediately follows a step by process p0’, or, ‘no process should receive a message unless the message was indeed sent’ • Safety is a property that can be violated at some time t and never be satisfied thereafter; doing nothing will also ensure safety! Distributed Systems - D N Ranasinghe

Formal models for message passing algorithms • Liveness: ‘eventually something good happens’ • a condition that must hold a number of times (possibly infinite), e.g., ‘eventually p1 terminates’ => p1’s termination happens once, or, liveness for a perfect link will require that if a correct process (one which is alive and well behaved) sends a message to a correct destination process, then the destination process should eventually deliver the message • Liveness is a property that for any time t, there is some hope that the property can be satisfied at some time t’ t Distributed Systems - D N Ranasinghe

Asynchronous systems • there is no fixed upper bound for message delivery time or, the time elapse between consecutive steps of a process • notion of ordering of events, local computation, message send or message receive are based on logical clocks • an execution  of an asynchronous message passing system is a finite or infinite sequence of the form C0, 1, C1, 2, C2,…., where Ck is a configuration of process states, C0 is an initial configuration and k is an event that captures all of messages send, computation and message receive events. • A schedule  is a sequence of events in the execution, e.g., 1, 2, …., where if the local processes are deterministic then, the execution is uniquely defined by (C0, ). Distributed Systems - D N Ranasinghe

Synchronous systems • There is a known upper bound on message transmission and processing delays • processes execute in lock step; execution is partitioned into ‘rounds’: C0, 1|,C1, 2 |,C2,…., • very convenient for designing algorithms, but not very practical • leads to some useful possibilities: e.g., timed failure detection – every process crash can be detected by all correct processes, can implement a lease abstraction • in a synchronous system with no failures, only the C0 matters for a given algorithm, but in an asynchronous system, there can be many executions for a given algorithm Distributed Systems - D N Ranasinghe

synchronous message passing state transition P recv() send() Q R Time round 1 round 2 round 3 new state current State upper bound on time Distributed Systems - D N Ranasinghe

Properties of algorithms • validity and agreement: specific to the objective of the algorithm • termination: an algorithm has terminated when all processes are terminated and there are no messages in transit • an execution can still be infinite, but once terminated, the process stays there taking ‘dummy’ steps • complexity: message (maximum number of messages sent over all possible executions) and time (equal to maximum number of rounds if synchronous; and in asynchronous, this is less straightforward Distributed Systems - D N Ranasinghe

Properties of algorithms • Interaction algorithms are possible for each process failure model • fail-stop – processes can fail by crashing but the crashes can be reliably detected by all other processes • fail-silent – where process crashes can never be reliably detected • fail-noisy – processes can fail by crashing, and the crashes can be detected, but not always in a reliable manner • fail-recovery – where processes can crash and later recover and still participate in the algorithm • Byzantine – processes deviate from the intended behaviour in an unpredictable manner • no solutions exist for all models in all interaction abstractions Distributed Systems - D N Ranasinghe

Coordination and Agreement Distributed Systems - D N Ranasinghe

under this broad topic we will discuss • Leader election • Consensus • Distributed mutual exclusion • common or uniform decisions by participating processes to various internal and external stimuli is often required, in the presence of failures and synchrony considerations Distributed Systems - D N Ranasinghe

Leader election (LE) • a process that is correct and which acts as the coordinator in some steps of a distributed algorithm, is a leader; e.g., commit manager in a distributed database, central server in distributed mutual exclusion • LE abstraction can be straightforwardly implemented using a perfect failure detector (that is in a synchronous situation) • Hierarchical LE: assumes the existence of a ranking order agreed among processes apriori, s.t. a function O associates, with every process, those that precede in ranking, i.e., O(p1) = , p1 leader by default; O(p2) = {p1}, if p1 dies p2 becomes leader; O(p3) = {p1, p2} etc., Distributed Systems - D N Ranasinghe

Leader election (LE) LCR algorithm (LeLann-Chang-Roberts): a simple ring based algorithm • assumptions: n processes each with a hard coded uid in a logical ring topology, unidirectional message passing-process pi to p(i+1) mod n, processes are not aware of ring size, asynchronous, no process failures, no message loss • leader is defined to be the process with the highest uid Distributed Systems - D N Ranasinghe

Pn P2 P4 P3 Leader election (LE) algorithm in prose: • each process forwards its uid to neighbour • if received uid < own uid, then discard, else if received uid > own uid, forward received uid to neighbour, else if received uid =own uid then declare self as leader uid1 uid2 uidn uid3 uid4 Distributed Systems - D N Ranasinghe

Leader election (LE) • process automaton: message_alphabet: set U of uid’s for each pi statei: defined by three state variables u U, initially uidi send U + null, initially uidi status {leader, unknown}, initially unknown msgi: place value of send on output channel; transi: {send = null; receive v  U on input channel; if v = null or else if v < u then exit; if v > u then send =v; if v = u then status = leader;} Distributed Systems - D N Ranasinghe

Leader election (LE) • expected properties: validity – if a process decides, then the decided value is the largest uid of a process • termination – every correct process eventually decides • agreement – no two correct processes decide differently • message complexity: O (n2) • time complexity: if synchronous, then n rounds until leader is discovered; 2n rounds until terminates • other possible scenarios: synchronous and processes are aware of ring size n (useful if processes fail), bidirectional ring (for a more efficient version of the algorithm) Distributed Systems - D N Ranasinghe

P4 P2 Pk P3 Leader election (LE) • an O(n log n) message complexity algorithm (Hirschberg-Sinclair) • assumptions: bidirectional ring, where for every i, 0i n, pi has a channel to left to p i+1 mod n, and pi has a channel to right to p i-1, n processes each with a hard coded uid in a logical ring topology, processes are not aware of ring size, asynchronous, no process failures, no message loss uid1 uid2 uidk uid3 uid4 Distributed Systems - D N Ranasinghe

Leader election (LE) algorithm in prose: • as before, a process sends its identifier around the ring and the message of the process with the highest identifier traverses the whole ring and returns • define a k-neighbourhood of a process pi to be the set of processes at distance at most k from pi in either direction, left and right • algorithm operates in phases starting from 0 • in the kth phase a process tries to become a winner for that phase, where it must have the largest uid in its 2k neighbourhood • only processes that are winners in the kth phase can go to (k+1)th phase Distributed Systems - D N Ranasinghe

to start with, in phase 0 each process attempts to become a phase 0 winner and sends probe messages to its left and right neighbours • if the identifier of the neighbour receiving the probe is higher, then it swallows the probe, else its sends back a reply message if it is at the edge of neighbourhood, else forwards probe to next in line • a process that receives replies from both its neighbours is a winner in phase 0 • similarly in a 2k neighbourhood the kth phase winner will receive replies from the farthest two processes in either direction • a process which receives its own probe message declares itself winner Distributed Systems - D N Ranasinghe

Leader election (LE) pseudo code for pi: send <probe, uidi, phase, hop_count> to left and to right; initially phase=0, and hop_count=1 upon receiving <probe, j, k, d> from left (or right) { if j= uidi then terminate as leader; if j > uidi and d< 2k then send <probe, j, k, d+1> to right (or left); // forward msg and increase hop count if j > uidi and d  2k then // if reached edge, do not forward but send <reply, j, k> to left (or right);} // if j < uid, msg is swallowed upon receiving <reply,j,k> from left (or right) { if j  uidi then send <reply, j,k> to right (or left) // forward else // reply is for own probe if already received <reply, j,k> from right (or left) then send <probe, uidi, k+1, 1> ;} // phase k winner Distributed Systems - D N Ranasinghe

Leader election (LE) • other possible scenarios: • synchronous with alternative ‘swallowing’ rules – any thing higher than minimum uid seen so far etc., with tweaking of uid usage • leads to a synchronous leader election algorithm whose message complexity is at most 4n Distributed Systems - D N Ranasinghe

DME • shared memory mutual exclusion is a well known aspect in operating systems when there is a need for concurrent threads to access a shared variable or object for read/write purposes • the shared resource is made a critical section with access to it controlled by atomic lock or semaphore operations • the lock or the semaphore variable is seen by all threads consistently • asynchronous shared memory is an alternative possibility: say, P1, P2 and P3 share M1 and, P2 and P3 share M2 Distributed Systems - D N Ranasinghe

DME • in a distributed system there will be no shared lock variable to look at • processes will have to agree on the process eligible to access the shared resource at any given time, by message passing • assumptions: system of n processes, pi, i=1..n; a process wishing to access an external shared resource must obtain permission to enter the critical section (CS); asynchronous, processes do not fail, messages are reliably delivered Distributed Systems - D N Ranasinghe

correctness properties • ME1 safety: at most one process my execute in the CS at any given time • ME2 liveness: requests to enter and exit CS eventually succeed • ME3 ordering: if one request to enter the CS ‘happened-before’ another, then entry to the CS is granted in that order • ME2 ensures freedom from both starvation and deadlock Distributed Systems - D N Ranasinghe

DME • several algorithms exist: Central Server version, Ring, Ricart-Agrawala Central Server version 4 2 Server Queue of requests 3. Grant Token 1. Request Token 2. Release Token P4 P1 P3 P2 Distributed Systems - D N Ranasinghe

DISTRIBUTED SYSTEMS

DISTRIBUTED SYSTEMS

Presentation Transcript

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed systems

Distributed Systems

Distributed Systems

Distributed Systems Course Distributed Multimedia Systems

Distributed Systems Course Distributed File Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems Course Distributed File Systems

Distributed Systems

Distributed Systems