480 likes | 573 Views
Logical Clocks. Ken Birman. Time: A major issue in distributed systems. We tend to casually use temporal concepts Example: “p suspects that q has failed” Implies a notion of time: first q was believed correct, later q is suspected faulty
E N D
Logical Clocks Ken Birman
Time: A major issue in distributed systems • We tend to casually use temporal concepts • Example: “p suspects that q has failed” • Implies a notion of time: first q was believed correct, later q is suspected faulty • Challenge: relating local notion of time in a single process to a global notion of time • Discuss this issue before developing practical tools for dealing with other aspects, such as system state
Time in Distributed Systems • Three notions of time: • Time seen by external observer. A global clock of perfect accuracy • Time seen on clocks of individual processes. Each has its own clock, and clocks may drift out of sync. • Logical notion of time: event a occurs before event b and this is detectable because information about a may have reached b.
External Time • The “gold standard” against which many protocols are defined • Not implementable: no system can avoid uncertain details that limit temporal precision! • Use of external time is also risky: many protocols that seek to provide properties defined by external observers are extremely costly and, sometimes, are unable to cope with failures
Time seen on internal clocks • Most workstations have reasonable clocks • Clock synchronization is the big problem (will visit topic later in course): clocks can drift apart and resynchronization, in software, is inaccurate • Unpredictable speeds a feature of all computing systems, hence can’t predict how long events will take (e.g. how long it will take to send a message and be sure it was delivered to the destination)
Logical notion of time • Has no clock in the sense of “real-time” • Focus is on definition of the “happens before” relationship: “a happens before b” if: • both occur at same place and a finished before b started, or • a is the send of message m, b is the delivery of m, or • a and b are linked by a chain of such events
Logical time as a time-space picture p0 p1 p2 p3 a a, b are concurrent c c happens after a, b b d d happens after a, b, c
Notation • Use “arrow” to represent happens-before relation • For previous slide: • a c, b c, c d • hence, a d, b d • a, b are concurrent • Also called the “potential causality” relation
Logical clocks • Proposed by Lamport to represent causal order • Write: LT(e) to denote logical timestamp of an event e, LT(m) for a timestamp on a message, LT(p) for the timestamp associated with process p • Algorithm ensures that if ab, then LT(a) < LT(b)
Algorithm • Each process maintains a counter, LT(p) • For each event other than message delivery: set LT(p) = LT(p)+1 • When sending message m, set LT(m) = LT(p) • When delivering message m to process q, set LT(q) = max(LT(m), LT(q))+1
Illustration of logical timestamps p0 p1 p2 p3 0 1 2 7 0 2 3 4 5 6 0 1 0 1 6
Concurrent events • If a, b are concurrent, LT(a) and LT(b) may have arbitrary values! • Thus, logical time lets us determine that a potentially happened before b, but not that a definitely did so! • Example: processes p and q never communicate. Both will have events 1, 2, ... but even if LT(e)<LT(e’) e may not have happened before e’
Vector timestamps • Extend logical timestamps into a list of counters, one per process in the system • Again, each process keeps its own copy • Event e occurs at process p: p increments VT(p)[p] (p’th entry in its own vector clock) • q receives a message from p: q sets VT(q)=max(VT(q),VT(p)) (element-by-element)
Illustration of vector timestamps p0 p1 p2 p3 [1,0,0,0] [2,0,0,0] [2,1,1,0] [2,2,1,0] [0,0,1,0] [0,0,0,1]
Vector timestamps accurately represent happens-before relation • Define VT(e)<VT(e’) if, • for all i, VT(e)[i]<VT(e’)[i], and • for some j, VT(e)[j]<VT(e’)[j] • Example: if VT(e)=[2,1,1,0] and VT(e’)=[2,3,1,0] then VT(e)<VT(e’) • Notice that not all VT’s are “comparable” under this rule: consider [4,0,0,0] and [0,0,0,4]
Vector timestamps accurately represent happens-before relation • Now can show that VT(e)<VT(e’) if andonly if e e’: • If e e’, then there exists a chain e0 e1 ... en on which vector timestamps increase “hop by hop” • If VT(e)<VT(e’) suffices to look at VT(e’)[proc(e)], where proc(e) is the place that e occured. By definition, we know that VT(e’)[proc(e)] is at least as large as VT(e)[proc(e)], and by construction, this implies a chain of events from e to e’
Examples of VT’s and happens-before • Example: suppose that VT(e)=[2,1,0,1] and VT(e’)=[2,3,0,1], so VT(e)<VT(e’) • How did e’ “learn” about the 3 and the 1? • Either these events occured at the same place as e’, or • Some chain of send/receive events carried the values! • If VT’s are not comparable, the corresponding events are concurrent
Notice that vector timestamps require a static notion of system membership • For vector to make sense, must agree on the number of entries • Later will see that vector timestamps are useful within groups of processes • Will also find ways to compress them and to deal with dynamic group membership changes
What about “real-time” clocks? • Accuracy of clock synchronization is ultimately limited by uncertainty in communication latencies • These latencies are “large” compared with speed of modern processors (typical latency may be 35us to 500us, time for thousands of instructions) • Limits use of real-time clocks to “coarse-grained” applications
Interpretations of temporal terms • Understand now that “a happens before b” means that information can flow from a to b • Understand that “a is concurrent with b” means that there is no information flow between a and b • What about the notion of an “instant in time”, over a set of processes?
Neither clock is appropriate • Problem is that with both clocks, there can be many events that are concurrent with a given event • Leads to a philosophical question: • Event e has happened at process p • Which events are “really” simultaneous with p?
Perspectives on logical time • One view is based on intuition from physics • Imagine a time-space diagram • Cones of causality define past and future • “Now” is any cut across the system consistent including no future events and no past events • Next Tuesday will see algorithms based on this
Causal notions of past, future p0 p1 p2 p3 a d e f b g c
Causal notions of past, future FUTURE p0 p1 p2 p3 a d e PAST f b g c
Issues raised by time • Time is a tool • Typical uses of time? • To put events into some sort of order • Example: the order of updates on a replicated data item • With one item, logical time may make sense • With multiple items, consider VT with one element per item
Ways to extend time to a total order • Often extend a logical timestamp or vector timestamp with actual clock time when the event occurred and process id where it occurred • Combination breaks any possible ties • Or can use event “names”
An example • Suppose we are broadcasting messages • Atomic broadcast is • Fault-tolerant: unless every process with a copy fails, the message is delivered everywhere (often expressed as all or nothing delivery) • Ordered: if p, q both receive m, n, either both receive m before n, or both receive n before m • How should we implement this policy?
Easy case • In many systems there is really just one source of broadcasts • Typically we see this pattern when there is really one reference copy of a replicated object and the replicas are viewed as cached copies • Accordingly we can use a FIFO ordered broadcast and reduce the problem to fault-tolerance • FIFO ordering simply requires a counter from sender
A more complex example • Sender-ordered multicast • Sender places a timestamp in the broadcast • Receiver waits until it has full set of messages • Orders them by logical timestamp, breaks ties with sender-id • Then delivers in this order • How can it tell when it has the “full set”?
A more complex example m Deliver m,n or n,m? n
A more complex example • Solution implicitly depends upon membership • In fact, most distributed systems depend upon membership • Membership is “the most fundamental” idea in many systems for this reason • Receiver can simply wait until all members have sent one message • System ends up running in rounds, where each member contributes zero or one messages per round • Use a “null” message if you have nothing to send
Optimizations • We could agree in advance on “permission to send” • Now, perhaps only p, q have permission • We treat their messages in rounds but others must get permission before sending • Avoids all the null messages and ensures fairness if p, q send at same rate • Dolev: explored extensions for varied rates, gets quite elaborate…
Optimizations • In the limit, we end up with a token scheme • While holding the token, p has permission to send • If q requests the token p must release it (perhaps after a small delay) • Token carries the sequence number to use
A more complex example m:1 n:2
An example • Such solutions are expressed in many ways • With a ring: Chang and Maxemchuck; messages are like a “train” with new message tacked onto end and old ones delivered from front • Direct all-to-all broadcast • Like a token moving around the ring, but it carries the messages with it (inspired by FDDI) • Tree structured in various ways
More examples • Old Isis system uses logical clocks • Sender says “here is a message” • Receivers maintain logical clocks. Each proposes a delivery time • Sender gathers votes, picks maximum, says “commit delivery at time t” • Receivers deliver committed messages in timestamp order from front of a queue
More examples m m:[1,p] n:[2,p] n:[1,q] m:[2,q] m:[1,r] n:[2,r]
More examples m m:[1,p] n:[2,p] m:[2,q] n:[1,q] m:[2,q] n:[2,r] m:[1,r] n:[2,r]
More examples m m:[1,p] n:[2,p] m:[2,q] m! n! n:[1,q] m:[2,q] n:[2,r] m!n! m:[1,r] n:[2,r] m!n!
More examples • Later versions of Isis used vector times • Membership is handled separately • Each message is assigned a vector time • Delivered in vector time order, with ties broken using process id of the sender
Totem and Transis • These systems represent time using partial order information • Message m arrives and includes ordering fields: • Deliver m after n and o • By transitivity, if n is after p, them m is after p • Break ties using process id number
Totem and Transis m n o p
Things to notice • Time is just a programming tool • But membership and message atomicity are very fundamental • Waiting for m won’t work if m never arrives • And VT is only meaningful if we can agree on the meaning of the indicies • With failures, these algorithms get surprisingly complicated: suppose p fails while sending m?
Major uses of time • To order updates on replicated data • To define versions of objects • To deal with processes that come and go in dynamic networked applications • Processes that joined earlier often have more complete knowledge of system state • Process that leaves and rejoins often needs some form of incrementing “incarnation number” • To prove correctness of complex protocols