E N D
1. Distributed Synchronization
2. Clock Synchronization When each machine has its own clock, an event that occurred after another event may nevertheless be assigned an earlier time.
3. Physical Clocks Clock Synchronization Maximum resolution desired for global time keeping determines the maximum difference which can be tolerated between “synchronized” clocks
The time keeping of a clock, its tick rate should satisfy:
The worst possible divergence d between two clocks is thus:
So the maximum time ?t between clock synchronization operations that can ensure d is:
4. Physical Clocks Clock Synchronization Christian’s Algorithm
Periodically poll the machine with access to the reference time source
Estimate round-trip delay with a time stamp
Estimate interrupt processing time
figure 3-6, page 129 Tanenbaum
Take a series of measurements to estimate the time it takes for a timestamp to make it from the reference machine to the synchronization target
This allows the synchronization to converge within d with a certain degree of confidence
Probabilistic algorithm and guarantee
5. Physical Clocks Clock Synchronization Wide availability of hardware and software to keep clocks synchronized within a few milliseconds across the Internet is a recent development
Network Time Protocol (NTP) discussed in papers by David Mill(s)
GPS receiver in the local network synchronizes other machines
What if all have GPS receivers
Increasing deployment of distributed system algorithms depending on synchronized clocks
Supply and demand constantly in flux
6. Physical Clocks (1) Computation of the mean solar day.
7. Physical Clocks (2) TAI seconds are of constant length, unlike solar seconds. Leap seconds are introduced when necessary to keep in phase with the sun.
8. Clock Synchronization Algorithms The relation between clock time and UTC when clocks tick at different rates.
9. Cristian's Algorithm Getting the current time from a time server.
10. The Berkeley Algorithm The time daemon asks all the other machines for their clock values
The machines answer
The time daemon tells everyone how to adjust their clock
11. Lamport Timestamps Three processes, each with its own clock. The clocks run at different rates.
Lamport's algorithm corrects the clocks.
12. Example: Totally-Ordered Multicasting Updating a replicated database and leaving it in an inconsistent state.
13. Global State (1) A consistent cut
An inconsistent cut
14. Global State (2) Organization of a process and channels for a distributed snapshot
15. Global State (3) Process Q receives a marker for the first time and records its local state
Q records all incoming message
Q receives a marker for its incoming channel and finishes recording the state of the incoming channel
16. The Bully Algorithm (1) The bully election algorithm
Process 4 holds an election
Process 5 and 6 respond, telling 4 to stop
Now 5 and 6 each hold an election
17. Mutual Exclusion Distributed components still need to coordinate their actions, including but not limited to access to shared data
Mutual exclusion to some limited set of operations and data is thus required
Consider several approaches and compare and contrast their advantages and disadvantages
Centralized Algorithm
The single central process is essentially a monitor
Central server becomes a semaphore server
Three messages per use: request, grant, release
Centralized performance constraint and point of failure
18. Mutual ExclusionDistributed Algorithm Factors Functional Requirements
1) Freedom from deadlock
2) Freedom from starvation
3) Fairness
4) Fault tolerance
Performance Evaluation
Number of messages
Latency
Semaphore system Throughput
Synchronization is always overhead and must be accounted for as a cost
19. Mutual Exclusion Distributed Algorithm Factors Performance should be evaluated under a variety of loads
Cover a reasonable range of operating conditions
We care about several types of performance
Best case
Worst case
Average case
Different aspects of performance are important for different reason and in different contexts
20. Mutual ExclusionLamport’s Algorithm Every site keeps a request queue sorted by logical time stamp
Uses Lamport’s logical clocks to impose a total global order on events associated with synchronization
Algorithm assumes ordered message delivery between every pair of communicating sites
Messages sent from site Sj in a particular order arrive at Sj in the same order
Note: Since messages arriving at a given site come from many sources the delivery order of all messages can easily differ from site to site
21. Lamport’s Algorithm Request Resource r
Thus, each site has a request queue containing resource use requests and replies
Note that the requests and replies for any given pair of sites must be in the same order in queues at both sites
Because of message order delivery assumption
22. Lamport’s Algorithm Entering CS for Resource r Site Si enters the CS protecting the resource when
This ensures that no message from any site with a smaller timestamp could ever arrive
This ensures that no other site will enter the CS
Recall that requests to all potential users of the resource and replies from then go into request queues of all processes including the sender of the message
23. Lamport’s Algorithm Releasing the CS The site holding the resource is releasing it, call that site
Note that the request for resource r had to be at the head of the request_queue at the site holding the resource or it would never have entered the CS
Note that the request may or may not have been at the head of the request_queue at the receiving site
24. Lamport ME Example
25. Lamport’s AlgorithmComments Performance: 3(N-1) messages per CS invocation since each requires (N-1) REQUEST, REPLY, and RELEASE messages
Observation: Some REPLY messages are not required
If sends a request to and then receives a REQUEST from with a timestamp smaller than its own REQUEST
need not send a reply to because it already has enough information to make a decision
This reduces the messages to between 2(N-1) and 3(N-1)
As a distributed algorithm there is no single point of failure but there is increased overhead
26. Ricart and Agrawala Refine Lamport’s mutual exclusion by merging the REPLY and RELEASE messages
Assumption: total ordering of all events in the system implying the use of Lamport’s logical clocks with tie breaking
Request CS (P) operation:
1) Site requesting the CS creates a message and sends it to all processes using the CS including itself
Messages are assumed to be reliably delivered in order
Group communication support can play an obvious role
27. Ricart and AgrawalaReceive a CS Request If the receiver is not currently in the CS and does not have pending request for it in its request_queue
Send REPLY
If the receiver is already in the CS
Queue the request, sending no reply
If the receiver desires the CS but has not entered
Compare the TS of its request to that just received
REPLY if received is newer
Queue the request if pending request is newer
28. Ricart and Agrawala Enter a CS
A process enters the CS when it receives a REPLY from every member of the group that can use the CS
Leave a CS
When the process leaves the CS it sends a REPLY to the senders of all pending messages on its queue
29. Ricart and Agrawala Example 1
30. Ricart and Agrawala Example 2
31. Ricart and AgrawalaObservations The algorithm works because the global logical clock ensures a global total ordering on events
This ensures, in turn, that the decision about who enters the CS is unambiguous
Single point of failure is now N points of failure
A crashed group member cannot be distinguished from a busy CS
Distributed and “optimized” version is N times more vulnerable than the centralized version!
Explicit message denying entry helps reliability and converts this into busy wait
32. Ricart and AgrawalaObservations Either group communication support is used, or each user of the CS must keep track of all other potential users correctly
Powerful motivation for standard group communication primitives
Argument against a centralized server said that a single process involved in each CS decision was bad
Now we have N processes involved in each decision
Improvements: get a majority - Makaewa’s algorithm
Bottom Line: a distributed algorithm is possible
Shows theoretical and practical challenges of designing distributed algorithms that are useful
33. Token Passing Mutex General structure
One token per CS ? token denotes permission to enter
Only process with token allowed in CS
Token passed from process to process ? logical ring
Mutex
Pass token to process i + 1 mod N
Received token gives permission to enter CS
hold token while in CS
Must pass token after exiting CS
Fairness ensured: each process waits at most N-1 entries to get CS
34. Token Passing Mutex Correctness is obvious
No starvation since passing is in strict order
Difficulties with token passing mutex
Idle case of no process entering CS pays overhead of constantly passing the token
Lost tokens: diagnosis and creating a new token
Duplicate tokens: ensure generation of only one token
Crashes: require a receipt to detect dead destinations
Receipts double the message overhead
Design challenge: holding time for unneeded token
Too short ? high overhead, too long ? high CS latency
35. Mutex Comparison Centralized
Simplest and most efficient
Centralized coordinator crashes create the need to detect crash and choose a new coordinator
M/use: 3; Entry Latency: 2
Distributed
3(N-1) messages per CS use (Lamport)
2(N-1) messages per CS use (Ricart & Agrawala)
If any process crashes with a non-empty queue, algorithm won’t work
M/use: 2(N-1); Entry Latency: 2(N-1)
36. Mutex Comparison Token Ring
Ensures fairness
Overhead is subtle ? no longer linked to CS use
M/use: 1 ? ? ; Entry Latency: 0 ? N-1
This algorithm pays overhead when idle
Need methods for re-generating a lost token
Design Principle: building fault handling into algorithms for distributed systems is hard
Crash recovery is subtle and introduces overhead in normal operation
Performance Metrics: M/use and Entry Latency
37. Centralized approaches often necessary
Best choice in mutex, for example
Need method of electing a new coordinator when it fails
General assumptions
Give processes unique system/global numbers (e.g. PID)
Elect process using a total ordering on the set
All processes know process number of members
All processes agree on new coordinator
All do not know if it is up or down ? election algorithm is responsible for determining this
Design challenge: network delay vs. crashed peer Election Algorithms
38. Bully Algorithm Suppose the coordinator doesn’t respond to P1 request
P1 holds an election by sending an election message to all processes with higher numbers
If P1 receives no responses, P1 is the new coordinator
If any higher numbered process responds, P1 ends its election
Process receives an election request
Reply to the sender tells it that it has lost the election
Holds an election of its own
Eventually all but highest surviving process give up
Process recovering from a crash takes over if highest
39. Example: Processes 0-7, 4 detects that 7 has crashed
4 holds election and loses
5 holds election and loses
6 holds election and wins
Message overhead variable
Who starts an election matters
Solid lines say “Am I leader?”
Dotted lines say “you lose”
Hollow lines say “I won”
6 becomes the coordinator
When 7 recovers it is a bully and sends “I win” to all Bully Algorithm
40. Processes have a total order known by all
Each process knows its successor ? forming a ring
Ring: mod N
So the successor of Pi is P(i+1) mod N
No token involved
Any process Pi noticing that the coordinator is not responding
Sends an election message to its successor P(i+1) mod N
If successor is down, send to next member ? timeout
Receiving process adds its number to the message and passes it along Ring Algorithm
41. When election message gets back to election initiator
Change message to coordinator
Circulate to all members
Coordinator is highest process in the total order
All processes know the order and thus all will agree no matter how the election started
Strength
Only one coordinator chosen
Weakness
Scalability: latency increases with N because the algorithm is sequential Ring Algorithm
42. What if more than one process detects a crashed coordinator?
More than one election will be produced: message storm
All messages will contain the same information: member process numbers and order of members
Same coordinator is chosen (highest number)
Refinement might include filtering duplicate messages
Some duplicates will happen
Consider two elections chasing each other
Eliminate one initiated by lower numbered process
Duplicated until lower reaches source of the higher Ring Algorithm
43. Global State (3) Process 6 tells 5 to stop
Process 6 wins and tells everyone
44. A Ring Algorithm Election algorithm using a ring.
45. Mutual Exclusion: A Centralized Algorithm Process 1 asks the coordinator for permission to enter a critical region. Permission is granted
Process 2 then asks permission to enter the same critical region. The coordinator does not reply.
When process 1 exits the critical region, it tells the coordinator, when then replies to 2
46. A Distributed Algorithm Two processes want to enter the same critical region at the same moment.
Process 0 has the lowest timestamp, so it wins.
When process 0 is done, it sends an OK also, so 2 can now enter the critical region.
47. A Toke Ring Algorithm An unordered group of processes on a network.
A logical ring constructed in software.
48. Comparison A comparison of three mutual exclusion algorithms.
49. Deadlocks Definition: Each process in a set is waiting for a resource to be released by another process in set
The set is some subset of all processes
Deadlock only involves the processes in the set
Remember the necessary conditions for DL
Remember that methods for handling DL are based on preventing or detecting and fixing one or more necessary conditions
50. Deadlocks Necessary Conditions Mutual exclusion
Process has exclusive use of resource allocated to it
Hold and Wait
Process can hold one resource while waiting for another
No Preemption
Resources are released only by explicit action by controlling process
Requests cannot be withdrawn (i.e. request results in eventual allocation or deadlock)
Circular Wait
Every process in the DL set is waiting for another process in the set, forming a cycle in the SR graph
51. Deadlock Handling Strategies No strategy
Prevention
Make it structurally impossible to have a deadlock
Avoidance
Allocate resources so deadlock can’t occur
Detection
Let deadlock occur, detect it, recover from it
52. No Strategy The “Ostrich Algorithm” Assumes deadlock rarely occurs
Becomes more probable with more processes
Catastrophic consequences when it does occur
May need to re-boot all or some machines in system
Fairly common and works well when
DL is rare and
Other sources of instability are more common
How reboots of Window or MacOS are prompted by DL?
53. Deadlock Prevention Ordered resource allocation most common example
Consider link with two-phase-locking grow and shrink
Works but requires global view of all resources
A total order on resources must exist for the system
Process code must allocate resources in order
Under utilizes resources when period of use of a resource conflict with the total resource order
Consider process Pi and Pk using resources R1 and R2
Pi uses R1 90% of its execution time and R2 10%
Pk uses R2 90% of its execution time and R1 10%
One holds one resource far too long
54. Deadlock Avoidance General method: Refuse allocations that may lead to deadlock
Method for keeping track of states
Need to know resources required by a process
Banker’s algorithm
Must know maximum number allocated to Pi
Keep track of resources available
For each request, make sure maximum need will not exceed total available
Under utilizes resources
Never used
Advance knowledge not available and CPU-intensive
55. Deadlock Detection and Resolution Attractive for two main reasons
Prevention and avoidance are hard, have significant overhead, and require information difficult or impossible to obtain
Deadlock is comparatively rare in most systems so a form of the argument for optimistic concurrency control applies: detect and fix comparatively rare situations
Availability of transactions helps
DL resolution requires us to kill some participant(s)
Transactions are designed to be rolled back and restarted
56. Centralized Deadlock Detection General method: Construct a resource graph and analyze it
Analyze through resource reductions
If cycle exists after analysis, deadlock has occurred
Processes in cycle are deadlocked
Local graphs on each machine
Pi requests R1
R1’s machine places request in local graph
If cycle exists in local graph, perform reductions to detect deadlock
Need to calculate union of all local graphs
Deadlock cycle may transcend machine boundaries
57. Cycles don’t always mean deadlock! Graph Reduction
58. Waits-For Graphs (WFGs) Based on Resource Allocation Graph (SR)
An edge from Pi to Pj
means Pi is waiting for Pj to release a resource
Replaces two edges in SR graph
Pi ?R
R ? Pj
Deadlocked when a cycle is found
59. Centralized Deadlock Detection All hosts communicate resource state to coordinator
Construct global resource graph on coordinator
Coordinator must be reliable and fast
When to construct the graph is an important choice
Report every resource operation (request, acquire, release)
Large overhead and significant use latency
Periodically send set of operations
Lower overhead and use latency, ? detection latency
Whenever a need for cycle detection is indicated
Central or local decision
All have drawbacks b/c of false deadlocks
60. False Deadlock Problem: messages may not arrive in a timely fashion
Inconsistent and out-of-date world view at a particular machine
In particular, out-of-order arrival
Assume two processes on two machines and two resources
P2 releases R2 (message A)
P1 requests instance of R2 (message B)
61. Problem: Coordinator detects false deadlock after B False Deadlock
62. False Deadlock Lack of global message delivery order causes false DL
Could apply Lamport’s global virtual clock
Expensive
Coordinator detects potential DL
Requests all outstanding messages with lower timestamp
Aim is to establish a common global message order
Establishes a total order on resource operations
Establishes a common world view and thus common decision making
Fixes some false deadlocks, but others are harder
63. Distributed Deadlock Detection Chandry-Misra-Haas algorithm
Processes can request more than one resource with a single message ? process can wait on several resources
Amortize message overhead
Speed growing phase
Use waits-for graph to represent system state
Dependencies across machine boundaries make looking for cycles hard
A process sends probe messages when it has to wait
If message gets back, deadlock has occurred
64. Distributed Deadlock Detection When process has to wait
Send message to process holding resources
Recipient forwards to all processes it is waiting on
Creates concurrent probe of wait-for graph for cycles
If message gets back to originator
Cycle exists in wait-for graph so deadlock has occurred
Note that first field of message will always be the initiator
Many messages every time a process blocks
65. Distributed Deadlock DetectionAn Example P0 gets blocked, resource held by P1
Initial message from P0 to P1 : (0, 0, 1)
P1 waiting on P2
P1 sends message (0, 1, 2) to P2
P2 waiting on P3: (0, 2, 3)
P3 waiting on P4 and P5: (0, 3, 4) and (0, 3, 5)
P5 chain ends, but P4 ? P6? P8
But P8 is waiting on P0:
P0 gets message, sees itself as the initiator: (0, 8, 0)
A cycle thus exists
P0 knows there is deadlock
66. Distributed DeadlockResolution Some process in the cycle must be killed
Structuring resource use as transactions makes this better behaved and easier to understand
Race Condition:
Two processes block at the same time and send probes
Both discover the cycle in parallel
Damping difficult as it is hard to tell what messages may be killed ? killing process must know the cycle
Practice should emphasize the simplest and cheapest
Most cycles are between two processes
Example of importance of gathering performance data
67. Distributed Deadlock Prevention Prevention
Careful design to make deadlocks structurally impossible
Make sure at least one of the 4 necessary conditions for deadlock cannot hold
Process can only hold one resource at a time
Request all resources initially
Process releases all resources before requesting new one
Resource ordering
All are cumbersome in practice
Distribution opens some new possibilities
Lamport clocks create total order preventing cycles
68. Summary We began with clocks and saw how relaxing the semantic requirement for real-time made Lamport’s logical clocks possible
Given global clocks, virtual or real, we considered mutual exclusion
Centralized algorithms keep information in one place effectively becoming a monitor
Distribution handles mutual exclusion in parallel at the cost of O(N) messages per CS use
Token algorithm reduced messages under some circumstances but introduced heartbeat overhead
Each has strengths and weaknesses
69. Summary Many distributed algorithms require a coordinator
Creating the need to select, monitor, and replace the coordinator as required
Election algorithms provide a way to select a coordinator
Bully algorithm
Ring algorithm
Transactions provide a high level abstraction with significant power for organizing, expressing, and implementing distributed algorithms
Mutual Exclusion
Locking
Deadlock
70. Summary Transactions are useful because they can be aborted
Concurrency control issues were considered
Locking
Optimistic
Deadlock
Detection
Prevention
Yet again
Distributed systems have the same problems
Only more so