1.06k likes | 1.36k Views
DISTRIBUTED SYSTEMS. Department of Computing Science Umea University. Fundamental Concepts. About Distributed Computing devising algorithms for a set of processes that seek to achieve some form of a ‘cooperative goal’
E N D
DISTRIBUTED SYSTEMS Department of Computing Science Umea University
Fundamental Concepts Distributed Systems - D N Ranasinghe
About Distributed Computing • devising algorithms for a set of processes that seek to achieve some form of a ‘cooperative goal’ • quoting Leslie Lamport: ‘ a distributed system is one in which the failure of a computer you did not even now existed can render your own computer unusable’ Distributed Systems - D N Ranasinghe
Distributed Algorithm • has no shared global information: only decides on local state and the messages they receive • has no shared global time frame: observes progress of computation through at best a partial order of events • non deterministic behaviour: cannot predict the exact sequence of global states from the study of the algorithm Distributed Systems - D N Ranasinghe
Design challenges from a systems perspective • heterogeneity: in hardware, OS, mode of interaction (c-s, p2p etc), middleware provisioning for developers • security: involves eavesdropping, deliberate corruption, process compromise, denial of service etc., • scalability: robustness, performance bottlenecks • process failures: detecting/suspecting, masking, tolerating, recovery, redundancy in the presence of partial processes failure • concurrency Distributed Systems - D N Ranasinghe
transparency: • access (local and remote resources accessed through identical operations) • location (resources access independent of physical location) • concurrency (process concurrency on shared resources) • replication (maintaining replicas with consistency) • failure (concealment of failures) • mobility (movement of resources and clients) Distributed Systems - D N Ranasinghe
Role of middleware • software layer with services provided to the applications designer • consisting of processes and objects • mechanisms: • Remote Method Invocation • object brokering • Service Oriented Architecture • event notification • distributed shared memory… Applications/Services Middleware Operating System Computer & Network H/W Distributed Systems - D N Ranasinghe
Motivating application domains • information dissemination (publish-subscribe paradigm): by event registration and notification with time-space decoupling property, based on reliable broadcast and agreement abstractions • process control in automation, in industrial systems etc., where consensus may have to be reached on multitude of sensorial inputs • cooperative work: multi-user cooperation in editing etc., based on shared persistent space paradigm employing ordered broadcast abstractions • distributed databases: need for atomic commitment abstraction on acceptance or rejection of serialized transactions Distributed Systems - D N Ranasinghe
Motivating application domains • software based fault tolerance through replication: uses the so called state machine replication paradigm • when a centralized server is required to be made highly available by executing several copies of it whose consistency is guaranteed by total order broadcast abstraction Distributed Systems - D N Ranasinghe
Modeling of distributed systems • abstraction: • to capture properties that are common to a large range of systems so that it enables to distinguish the fundamental from the accessory • to prevent reinvent the wheel for every minor variant of the problem • a model abstracts away the key components and the way they interact • purpose: • to make explicit all relevant assumptions about the system • to express behaviour through algorithms • make impossibility observations etc through logical analysis including proofs Distributed Systems - D N Ranasinghe
Modeling of distributed systems • abstracting the physical model: processes, links and failure detectors (latter an indirect measurement of time) 2 1 3 5 4 Distributed Systems - D N Ranasinghe
Modeling of distributed systems • component properties: • channel (a communication resource) - message delays, message loss • process (a computational resource, has only local state) – can incur process failure, be infinitely slow or corrupt • low level models of interaction: synchronous message passing, asynchronous message passing Process Internal Computation (modules of the process) Outgoing message Incoming message (receive) (send) Distributed Systems - D N Ranasinghe
Modeling of distributed systems • failure detector abstraction: a possible way to capture the notion of process and link failures based on their timing behaviour • incorporation of a failure detector, a specialized process in each process which emits a heartbeat to others • a failure detector can be considered as an indirect abstraction of time; simply a timeout is an indication of a failure, mostly unreliable with an outcome either suspected or unsuspected • a synchronous system => a ‘perfect failure detector’ Distributed Systems - D N Ranasinghe
Modeling of distributed systems • clock: physical and logical • abstracting a process: by the process failure model Arbitrary Crashes & Recoveries Omissions Crashes Distributed Systems - D N Ranasinghe
Modeling of distributed systems • crashes: a faulty process as opposed to a correct process (which executes an infinite number of steps) does no further local computation or message generation or respond to messages • a crash does not preclude a recovery later but this is considered another category • also the correctness of any algorithm may depend on a maximally admissible number of faulty processes Distributed Systems - D N Ranasinghe
arbitrary faults: a process that deviates arbitrarily from the algorithm assigned to it • also known as malicious or Byzantine faulty or in fact may be due to a bug in the program • under such conditions some algorithmic abstractions may be ‘impossible’ Distributed Systems - D N Ranasinghe
Modeling of distributed systems • omission failure: due to network congestion or buffer overflow, resulting in process unable to send messages • crash-recovery: a process simply crashes fail-stop or, crashes and recovers infinite times • every process that recovers is assumed to have a stable storage (also called a log) accessible through some primitives, which stores the most recent local state with time stamps • alternatively those which do never crash could also act as virtual stable storage Distributed Systems - D N Ranasinghe
Modeling of distributed systems • abstracting communication: by loss or corruption of messages, also known as communication omission • usually resolved through end-to-end network protocol support unless of course there is a network partition • Desirable properties for ‘reliable’ delivery of messages • liveness: any message in the outgoing buffer of sender is ‘eventually’ delivered to the incoming message buffer of receiver • safety: the message received is identical to the one sent, and no messages are delivered twice Distributed Systems - D N Ranasinghe
Abstracting other higher level interactions • e.g., capturing recurring patterns of interaction in the form of • distributed agreement (on an event, a sequence of events etc.,) • atomic commitment (whether to take an irrevocable step or not) • total order broadcast (i.e., agreeing on order of actions) leads to a wide range of algorithms Distributed Systems - D N Ranasinghe
Modeling of distributed systems • Predicting impossibility results in higher level interactions • due to in some cases indistinguishability of network failures from process failures or, a slow process from a network delay • e.g., agreement in the presence of message loss, agreement in the presence of process failures in asynchronous situations • Impossibility of agreement in the presence of message loss • leads to a widely used assumption in almost all models • typical two army problem • formal model described below Distributed Systems - D N Ranasinghe
Formal model of the two army problem • processes A and B communicate by sending and receiving messages on a bidirectional channel; A sends a message to B, then B sends a message to A and so on • A and B can execute two actions and • neither process can fail but the channel can lose messages • desired outcome is both processes take the same action and neither take both actions Distributed Systems - D N Ranasinghe
proof- by contradiction: let there be a protocol P that solves the problem using the fewest rounds, the last message sent by A being m • Observe that, action taken by A cannot depend on m since its receipt could never be learned by A • Action taken by B cannot depend on m because B must take the same choice of action as A even m is lost • Since actions of both A and B do not depend on m, m can be discarded • m is not the last message • P is not using the fewest rounds Distributed Systems - D N Ranasinghe
Formal models for message passing algorithms • processes and channels: channels can be unidirectional or bidirectional • topology represented by an undirected graph G(V, E) P1 P0 P2 P4 P3 Distributed Systems - D N Ranasinghe
Formal models for message passing algorithms • System has n processes, p0 to pn-1 where i is the index of the process • The algorithm run by each pi is modeled as a process automaton a formal description of a sequential algorithm and is associated with a node in the topology. Distributed Systems - D N Ranasinghe
Formal models for message passing algorithms • A process automaton is a description of the process state machine • consists of a 5-tuple: {message alphabet, process states, initial states, message generation function, state transition function} • message_alphabet: content of messages exchanged • process_states: the finite set of states that a process can be in • initial_state: the start state of a process • message_gen_function: on the current process state how the next message is to be generated • state_trans_function: on the receipt of a messages, and based on current state, the next state to which the process should transit Distributed Systems - D N Ranasinghe
Description of system state • A configuration is a vector C = (q0,…qn-1) where qi is a state of pi • In message passing systems two events can take place: computation event of process pi (application of the so called state transition function), and delivery event, the delivery of message m from process pi to process pj consisting of a message sending event and a corresponding receiving event • Each message is uniquely identified by its sender process, sequence number and may be local clock value • The behaviour of the system over time is modeled as an execution which is a sequence of configurations alternating with events. Distributed Systems - D N Ranasinghe
Formal models for message passing algorithms • All possible executions of a distributed abstraction must satisfy two conditions: safety and liveness. Process Internal Computation (modules of the process) Outgoing message Incoming message (receive) (send) Distributed Systems - D N Ranasinghe
Formal models for message passing algorithms • Safety: ‘nothing bad has/can happen (yet)’ • e.g., ‘every step by a process pi immediately follows a step by process p0’, or, ‘no process should receive a message unless the message was indeed sent’ • Safety is a property that can be violated at some time t and never be satisfied thereafter; doing nothing will also ensure safety! Distributed Systems - D N Ranasinghe
Formal models for message passing algorithms • Liveness: ‘eventually something good happens’ • a condition that must hold a number of times (possibly infinite), e.g., ‘eventually p1 terminates’ => p1’s termination happens once, or, liveness for a perfect link will require that if a correct process (one which is alive and well behaved) sends a message to a correct destination process, then the destination process should eventually deliver the message • Liveness is a property that for any time t, there is some hope that the property can be satisfied at some time t’ t Distributed Systems - D N Ranasinghe
Asynchronous systems • there is no fixed upper bound for message delivery time or, the time elapse between consecutive steps of a process • notion of ordering of events, local computation, message send or message receive are based on logical clocks • an execution of an asynchronous message passing system is a finite or infinite sequence of the form C0, 1, C1, 2, C2,…., where Ck is a configuration of process states, C0 is an initial configuration and k is an event that captures all of messages send, computation and message receive events. • A schedule is a sequence of events in the execution, e.g., 1, 2, …., where if the local processes are deterministic then, the execution is uniquely defined by (C0, ). Distributed Systems - D N Ranasinghe
Synchronous systems • There is a known upper bound on message transmission and processing delays • processes execute in lock step; execution is partitioned into ‘rounds’: C0, 1|,C1, 2 |,C2,…., • very convenient for designing algorithms, but not very practical • leads to some useful possibilities: e.g., timed failure detection – every process crash can be detected by all correct processes, can implement a lease abstraction • in a synchronous system with no failures, only the C0 matters for a given algorithm, but in an asynchronous system, there can be many executions for a given algorithm Distributed Systems - D N Ranasinghe
synchronous message passing state transition P recv() send() Q R Time round 1 round 2 round 3 new state current State upper bound on time Distributed Systems - D N Ranasinghe
Properties of algorithms • validity and agreement: specific to the objective of the algorithm • termination: an algorithm has terminated when all processes are terminated and there are no messages in transit • an execution can still be infinite, but once terminated, the process stays there taking ‘dummy’ steps • complexity: message (maximum number of messages sent over all possible executions) and time (equal to maximum number of rounds if synchronous; and in asynchronous, this is less straightforward Distributed Systems - D N Ranasinghe
Properties of algorithms • Interaction algorithms are possible for each process failure model • fail-stop – processes can fail by crashing but the crashes can be reliably detected by all other processes • fail-silent – where process crashes can never be reliably detected • fail-noisy – processes can fail by crashing, and the crashes can be detected, but not always in a reliable manner • fail-recovery – where processes can crash and later recover and still participate in the algorithm • Byzantine – processes deviate from the intended behaviour in an unpredictable manner • no solutions exist for all models in all interaction abstractions Distributed Systems - D N Ranasinghe
Coordination and Agreement Distributed Systems - D N Ranasinghe
under this broad topic we will discuss • Leader election • Consensus • Distributed mutual exclusion • common or uniform decisions by participating processes to various internal and external stimuli is often required, in the presence of failures and synchrony considerations Distributed Systems - D N Ranasinghe
Leader election (LE) • a process that is correct and which acts as the coordinator in some steps of a distributed algorithm, is a leader; e.g., commit manager in a distributed database, central server in distributed mutual exclusion • LE abstraction can be straightforwardly implemented using a perfect failure detector (that is in a synchronous situation) • Hierarchical LE: assumes the existence of a ranking order agreed among processes apriori, s.t. a function O associates, with every process, those that precede in ranking, i.e., O(p1) = , p1 leader by default; O(p2) = {p1}, if p1 dies p2 becomes leader; O(p3) = {p1, p2} etc., Distributed Systems - D N Ranasinghe
Leader election (LE) LCR algorithm (LeLann-Chang-Roberts): a simple ring based algorithm • assumptions: n processes each with a hard coded uid in a logical ring topology, unidirectional message passing-process pi to p(i+1) mod n, processes are not aware of ring size, asynchronous, no process failures, no message loss • leader is defined to be the process with the highest uid Distributed Systems - D N Ranasinghe
Pn P2 P4 P3 Leader election (LE) algorithm in prose: • each process forwards its uid to neighbour • if received uid < own uid, then discard, else if received uid > own uid, forward received uid to neighbour, else if received uid =own uid then declare self as leader uid1 uid2 uidn uid3 uid4 Distributed Systems - D N Ranasinghe
Leader election (LE) • process automaton: message_alphabet: set U of uid’s for each pi statei: defined by three state variables u U, initially uidi send U + null, initially uidi status {leader, unknown}, initially unknown msgi: place value of send on output channel; transi: {send = null; receive v U on input channel; if v = null or else if v < u then exit; if v > u then send =v; if v = u then status = leader;} Distributed Systems - D N Ranasinghe
Leader election (LE) • expected properties: validity – if a process decides, then the decided value is the largest uid of a process • termination – every correct process eventually decides • agreement – no two correct processes decide differently • message complexity: O (n2) • time complexity: if synchronous, then n rounds until leader is discovered; 2n rounds until terminates • other possible scenarios: synchronous and processes are aware of ring size n (useful if processes fail), bidirectional ring (for a more efficient version of the algorithm) Distributed Systems - D N Ranasinghe
P4 P2 Pk P3 Leader election (LE) • an O(n log n) message complexity algorithm (Hirschberg-Sinclair) • assumptions: bidirectional ring, where for every i, 0i n, pi has a channel to left to p i+1 mod n, and pi has a channel to right to p i-1, n processes each with a hard coded uid in a logical ring topology, processes are not aware of ring size, asynchronous, no process failures, no message loss uid1 uid2 uidk uid3 uid4 Distributed Systems - D N Ranasinghe
Leader election (LE) algorithm in prose: • as before, a process sends its identifier around the ring and the message of the process with the highest identifier traverses the whole ring and returns • define a k-neighbourhood of a process pi to be the set of processes at distance at most k from pi in either direction, left and right • algorithm operates in phases starting from 0 • in the kth phase a process tries to become a winner for that phase, where it must have the largest uid in its 2k neighbourhood • only processes that are winners in the kth phase can go to (k+1)th phase Distributed Systems - D N Ranasinghe
to start with, in phase 0 each process attempts to become a phase 0 winner and sends probe messages to its left and right neighbours • if the identifier of the neighbour receiving the probe is higher, then it swallows the probe, else its sends back a reply message if it is at the edge of neighbourhood, else forwards probe to next in line • a process that receives replies from both its neighbours is a winner in phase 0 • similarly in a 2k neighbourhood the kth phase winner will receive replies from the farthest two processes in either direction • a process which receives its own probe message declares itself winner Distributed Systems - D N Ranasinghe
Leader election (LE) pseudo code for pi: send <probe, uidi, phase, hop_count> to left and to right; initially phase=0, and hop_count=1 upon receiving <probe, j, k, d> from left (or right) { if j= uidi then terminate as leader; if j > uidi and d< 2k then send <probe, j, k, d+1> to right (or left); // forward msg and increase hop count if j > uidi and d 2k then // if reached edge, do not forward but send <reply, j, k> to left (or right);} // if j < uid, msg is swallowed upon receiving <reply,j,k> from left (or right) { if j uidi then send <reply, j,k> to right (or left) // forward else // reply is for own probe if already received <reply, j,k> from right (or left) then send <probe, uidi, k+1, 1> ;} // phase k winner Distributed Systems - D N Ranasinghe
Leader election (LE) • other possible scenarios: • synchronous with alternative ‘swallowing’ rules – any thing higher than minimum uid seen so far etc., with tweaking of uid usage • leads to a synchronous leader election algorithm whose message complexity is at most 4n Distributed Systems - D N Ranasinghe
DME • shared memory mutual exclusion is a well known aspect in operating systems when there is a need for concurrent threads to access a shared variable or object for read/write purposes • the shared resource is made a critical section with access to it controlled by atomic lock or semaphore operations • the lock or the semaphore variable is seen by all threads consistently • asynchronous shared memory is an alternative possibility: say, P1, P2 and P3 share M1 and, P2 and P3 share M2 Distributed Systems - D N Ranasinghe
DME • in a distributed system there will be no shared lock variable to look at • processes will have to agree on the process eligible to access the shared resource at any given time, by message passing • assumptions: system of n processes, pi, i=1..n; a process wishing to access an external shared resource must obtain permission to enter the critical section (CS); asynchronous, processes do not fail, messages are reliably delivered Distributed Systems - D N Ranasinghe
correctness properties • ME1 safety: at most one process my execute in the CS at any given time • ME2 liveness: requests to enter and exit CS eventually succeed • ME3 ordering: if one request to enter the CS ‘happened-before’ another, then entry to the CS is granted in that order • ME2 ensures freedom from both starvation and deadlock Distributed Systems - D N Ranasinghe
DME • several algorithms exist: Central Server version, Ring, Ricart-Agrawala Central Server version 4 2 Server Queue of requests 3. Grant Token 1. Request Token 2. Release Token P4 P1 P3 P2 Distributed Systems - D N Ranasinghe