330 likes | 480 Views
Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms Author: Ozalp Babaoglu and Keith Marzullo Distributed Systems: 526 U1580 Professor: Ching-Chi Hsu. Introduction.
E N D
Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms Author: Ozalp Babaoglu and Keith Marzullo Distributed Systems: 526 U1580 Professor: Ching-Chi Hsu
Introduction • Many problems in distributed computing can be cast as executing some notification or reaction when the state of the system satisfies a particular condition • Global Predicate Evaluation (GPE): to establish the truth of a Boolean expression whose variables may refer to the global systems state • A global state may not be consistent • Asynchronous system: • no bounds on the relative speeds of processes and message delays • Impossible to maintain synchronized local clocks • Communication remains the only possible mechanism for synchronization • channels are reliable but may deliver messages out of order
Outline • Two Class of solutions to the GPE problem: • A reactive-architecture: each process, when executing an event, notify P0 by sending it a message describing the event • A snapshot architecture: the monitor P0 sends each process a ‘state enquiry’ message.
Definitions (1) • distributed systems: a collection of sequential processes p1, p2, ..., pn networked by unidirectional communication channels • events: the activity of each sequential process, which can be internal events or communications: send(m) or receive(m) with another process • local history of process pi : hi = ei1ei2... • global history: H = h1h2... hn • cause-effect relation '->': • If eik, eilhi and k<l, then eikeil • If ei = send(m) and ej = receive(m), then ei ej • If e e' and e' e'', then e e'' • Concurrent e||e': neither e e' nor e' e
Definitions (2) • distributed computation: a partially ordered set defined by the pair (H, ) • space-diagram: representation of a distributed computation e11 e12 e13 e14 e15 e16 p1 e22 p2 e21 e23 p3 e31 e32 e33 e34 e35 e36
Definitions (3) • local state of pi immediately after executing event eik is denoted by ik • global state: (, ..., n) • a cut C(c1,...,cn) is a subset of global history H and contains an initial prefix of each of the local histories, i.e. C h1c1hncn • a run R is a total ordering of all events in H and is consistent with each local history • Example: pp6 • Note that a single distributed computation may have many runs
Example • Insistent cut and phantom deadlock e11 e12 e13 e14 e15 e16 p1 resp req req resp e22 p2 e21 e23 req req p3 e31 e32 e33 e34 e35 e36 C C’
Consistency • A consistent cut C, is such that • e and e', (e C)(e' e) => e' C • A consistent global state is one corresponding to a consistent cut • Aconsistent run R, is such that • e and e', (e e') => e appears before e' in R • Example: pp6 • If the run is consistent then all the global states in the sequence will be consistent as well
Observing Distributed Computations • A monitor p0 will assume a passive role in that it will not send any messages of its own • The application processes notify p0 by sending it a message whenever they execute an event • The monitor p0 constructs an observation of the underlying distributed computation as the events arrived • Due to the variability of message delays, an observation can correspond to a consistent run, an inconsistent run or no run at all • O1 = e21e11e31e32e34e12e22e33e13e14e35.... => not a run • O2 = e11e31e21e32e12e33e34e13e22e35e36.... => inconsistent run • O3 = e31e21e11e12e32e33e13e34e14e22e15.... => consistent run • To restore order of messages by defining a delivery rule for deciding when received messages are to be presented to the application process
FIFO delivery • First-In-First-Out(FIFO) delivery • for all messages m and m' from pi to pj • if sendi(m) sendi(m') => deliverj(m) deliverj(m') • FIFO can be implemented by adding sequence numbers to messages • While FIFO delivery is sufficient to guarantee that observations correspond to runs, it is not sufficient to guarantee consistent observations
Observing Distributed Computations with Real-Time Clocks • Environment: • message delays are bounded by • channels are FIFO • existence of a global real-time clock • each message includes RC(e), the global real-time clock when event e occurs, as its timestamp • DR1: • At time t, deliver all received messages with timestatmps up to t- in increasing timestamp order • Observation is consistent iff the following is satisfied • Clock condition: e e' => RC(e) < RC(e')
Observing Distributed Computations with Logical Clocks • Environment: • channels are FIFO • asynchronous communication • implementation of logical clocks • each message includes LC(e), the logical clock when event e occurs, as its timestamp • DR2: • Deliver all messages that are stable at p0 in increasing timestamp order • Note: a message m is stable at p if no future messages with timestamp < TS(m) • Given FIFO channels, m is stable at p0 when p0 has received at least one message with timestamp>TS(m) from all other processes
Logical Clocks • Logical Clock • each process pi maintains a local variable LCi • when a new event ei occurs, pi modifies LCi to • LCi + 1 if ei is an internal or send event • max{ LCi, TS(m)} + 1 if ei = receive(m) 1 2 4 5 6 7 p1 5 p2 1 6 p3 1 2 3 4 5 7
Observing Distributed Computations with Causal Delivery • Causal Delivery (CD): • sendi(m) sendj(m') => deliverk(m) deliverk(m') • If p0 uses a delivery rule satisfying CD, then all of its observations will be consistent
Efficient Delivering • For implementing causal delivery, what is really needed is an effective procedure for deciding: • given events e,e' that are causally related and their clock values, does there exists some other event e'' such that e e'' e' • Given RC(e) <RC(e') (or LC(e)<LC(e')), it may be that • e e' or e|| e', i.e. e' e) • The above observations suggest a timing mechanism TC whereby causal precedence relations between events can be deduced from their timstamps • Stong Clock Condition: • e e' TC(e) < TC(e')
Causal History (1) • Causal history of event e • (e) = { e' H | e' e} {e} • That is, (e) is the smallest consistent cut that includes e e11 e12 e13 e14 e15 e16 p1 e22 p2 e21 e23 p3 e33 e34 e35 e36 e31 e32 Causal history of event e14
Causal Histories (2) • Maintaining Causal History • Each process pi initializes local variable i to be • Each message m contains a timestamp TS(m) which is the causal history of its send event • Scheme • If ei is internal or send event, • then i={ei} the causal history of the previous local event • If ei is the receive of message m by process pi from pj • then i={ei} the causal history of the previous local event of pi • the causal history of the corresponding send event at pj • The strong clock condition is satisfied if clock comparison is interpreted as set inclusion • e e' (e) (e') or e e' e (e') if e e' • Problem: the causal histories will grow rapidly
Vector Clocks • The causal history of an event can be represented as a fixed-dimensional vector VC(e)[1..n] rather than a set, where • VC(e)[i] = k, iff i(e) = hik for i = 1,2,...,n (1,0,0) (2,1,0) (3,1,3) (4,1,3) (5,1,3) (6,1,3) p1 (1,2,4) p2 (0,1,0) (4,3,4) p3 (0,0,1) (1,0,2) (1,0,3) (1,0,4) (1,0,5) (1,0,6)
Maintaining Vector Clocks • Maintaining Vector clock • Each process pi maintains a local vector VCi[1..n] • Each message m contains a timestamp TS(m) which is the vector clock value VC(e)of its send event e • Scheme • if ei is an internal or send event • VCi [i]= VCi [i] + 1, and VC(ei)=VCi • if ei = receive(m) • VCi = max { VCi , TS(m) } • VCi [i] = VCi [i] + 1 • VC(ei)[j] number of events of pj that causally precede event ei of pi • V < V' (VV')k: 1kn: V[k] V'[k])
Properties of Vector Clocks • Properties of Vector Clocks • Strong Clock Condition Simple Strong Clock Condition • e e' VC(e) < VC(e') ei ejVC(ei)[i] VC(ej)[i] • Concurrent • ei||ej VC(ei)[i] VC(ej)[i]) (VC(ej)[j] VC(ei)[j]) • Pairwise Inconsistent • i j, VC(ei)[i] VC(ej)[i]) (VC(ej)[j] VC(ei)[j]) • Consistent Cut (c1,c2, ..., cn) iff • i, j: 1 i,j n, VC(eici)[i] VC(ejcj)[i] • Counting: the number of events precedes e is givent by #(e) • #(e) =nj=1 VC(e)[j] -1 • Weak Gap-Detection: Given ei and ej • if VC(ei)[k] < VC(ej)[k] for some k j, • then ek such that (ek ei) (ek ej)
Implementing Causal Deliberywith Vector Clocks • Babaoglu & Marzullo • monitor p0 maintains an array D[1..n] where D[i] contains TS(mi)[i] where mi is the last message delivered from process pi • DR3: • Deliver message m from process pj when both of the following is satisfied • D[j] = TS(m)[j] -1 => guarantee FIFO • D[k] TS(m)[k], k j => guarantee Causal Relation • DR4: • Monitor p0 maintains an counter D • Deliver message m of event ei as soon as • D = #(ei) - 1
Causal Delivery with vector ClockExamples (1,0) (1,1) (1,2) (2,2) (3,2) p0 [0,0] (1,0) (2,2) p1 (0,0) (3,2) p2 (0,0) (1,1) (1,2)
Distributed Snapshots • In this strategy, p0 will request the states of the other processes and then combined them into a global state • Definition: • channel state: for each channel from pi to pj, • i,j = set difference between i and j • incoming channels of process pi :INi • outgoing channels of process pi :OUTi • Snapshot Protocols • Chandy and Lamport [1985] • Morgan[1985]
Snapshot Protocol 1 • Assumption: • existence of a global real-time clock : RC • Each message is attached with timestamp • Message delays are bounded • global clock algorithm • P0 sends [take snapshot at tss] to all processes • When clock RC reads tss, each process pi do the following • records its local state i, • sends an empty message over all its outgoing channels • and starts recording all message received over each incoming channels • For the time pi receives a message from pj with timestamp greater than or equal to tss, pi stops recording messages for that channel
Snapshot Protocol 2 • Assumption: • Bounded message delays • Channels are FIFO • Chandy & Lamport • P0 send [take snapshot] to itself • For each process pi receiving [take snapshot] • If it is the first time • records its local state i • sends each out-going channels [take snapshot] • starts recording messages from other incoming channels • If it is not the first time • stops recording message from that incoming channel
Chandy & Lamport (1985) p0 e11 e12 e13 e14 e15 e16 p1 e1* p2 e21 e22 e23 e24 e25 e2* • Real computation R= e21 e11 e12 e13 e22 e14 e23 e24 e15 e25 e16 • in terms of global state =00 0111 21 31 32 42 43 44 54 55 65
Properties of Snapshots • Definition • a : the global state in which the snapshot protocol is initiated, • f : the global state in which the protocol terminates and • S : the global state constructed • ei* denote the event when pi receives [take snapshot] for the first time, causing pi to start recording its state • let the time be ti when ei* occurs • ei is a prerecordering event if ei ei*, • otherwise it is a post-recording event • Properties • Then there exists a run R' such that a S f • That is to say S could have happened
Argumentation (1) • Chandy & Lamport(1985) • consider any (post-recordering, prerecordering) pair (e, e') • then e e') • swapping all such events will result in another consistent run R' • swap (e13 , e22 ) r1= e21 e11 e12 e22 e13 e14 e23 e24 e15 e25 e16 • swap (e14 , e23 ) r2= e21 e11 e12 e22 e13 e23 e14 e24 e15 e25 e16 • swap (e13 , e23 ) R'= e21 e11 e12 e22 e23 e13 e14 e24 e15 e25 e16 • the global state after executing the last prerecording event (e23 ) in R' is S (=23), the constructed global state • If the computation goes in this run, S could have happen
Argumentation (2) • Lai & Yang(1987) • Let GSN(ti:piP) be a snapshot taken between 1 and 2, during the computation R. • Let =2-1, construct R' as follows: • R' is the same as R except that every post-recording event in R is now postponed for d units of time, that is • R'(t) =R(t) if R(t) is an event at piand tti • R(t-) if R(t-) is an event at pi and t-ti • otherwise • Example
Properties of Global Predicates • Stable Predicates • Many system properties one wishes to detect have the characteristic that once they become true, they remain true • If is a stable predicate, since a S f • ( is true in s ) => ( is true in f ) • ( is false in s ) =>( is false in a ) • Nonstable Predicates • the condition encoded by the predicate may not persist long enough for it to be true when the predicate is evaluated • if a predicate is found to be true by the monitor, we do not know whether ever held during the actual run
Nonstable Predicates • Two problems • The condition encoded by the predicate may not persist long enough for it to be true when the predicate is evaluated • If a predicate F is found to be true by the monitor, we do not know whether F ever held during the actual run • The predicate may have held even if it is not detected, and even if it is detected it may have never held. • Extended nonstable global predicate: apply to the entire distributed computation • Possibly(F) • Definitely(F)
Detecting Possibly and Definitely F • Smin (sik) : the global state with the smallest level in the lattice containing sik • Smax(sik) : the global state with the largest level in the lattice containing sik • Examples: Smin (s13) = S31,Smax (s13) = S33 • Smin(sik) = (s1c1,s2c2,…,sncn ): j: VC(sjcj)[j]=VC(sik)[j] • Smax(sik) = (s1c1,s2c2,…,sncn ): j: VC(sjcj)[i]<=VC(sik)[i] and ((sjCj = sjf) or (VC(sjCj+1)[i] > VC(sjk)[i])) • The minimum level containing sjk is the sum of components of the vector timestamp VC(sjk) • An algorithm for detecting Definitely(F): O(kn): k is the maximum number of events a monitored process has executed