Distributed Operating Systems

Distributed Operating Systems Neeraj Suri www.deeds.informatik.tu-darmstadt.de

ME • TSL, Semaphores, monitors… (Single OS) • Do they work in DS given timing delays, ordering issues, and … ? • Last lecture: atomicity (2PC, 3PC) • Continue with • concurrency control • single coordinator • majority voting • synchronization

Distributed Mutual Exclusion • Solution #1: Build a lock server in a centralized manner (generally simple) • Lock server solution problems • Server is a single point of failure • Server is performance bottleneck • Failure of client holding lock also causes problems: No unlock sent • Similar to garbage collection problems in DSs … validity conditions etc What is the state of the lock server? For stateless servers? Works? Under what assumptions?

Distributed Mutual Exclusion (cont.) • Solution #2: decentralized alg • Replicate state in central server on all processes • Requesting process sends LOCK message to all others • Then waits for LOCK_GRANTED from all • To release critical region, send UNLOCK to all Works? What assumptions?

Distributed ME (all ACK) A A A C C C B B B • To request CS: send REQ msg. M to ALL; enQ M in local Q • Upon receiving M from Pi • if it does not want (or have) CS, send ACK • if it has CS, enQ request Pi : M • if it wants CS, enQ/ACK based on lowest ID (time-stamp would be so much nicer but lack of time  no time basis for timestamps) • To release CS: send ACK to all in Q, deQ [diff. from 2PC] • To enter CS: enter CS when ACK received from all enters CS {8} ACK 12 ACK 8 ACK 8 12 ACK enters CS {8,12} {12}

Token Based ME: Leader Election • Approach: • Allow any process to assume role of centralized server • Choose with a distributed leader election algorithm • Algorithm: • Everyone requests a lock and adds “timestamp”; first granted becomes leader • Don’t immediately grant: wait for all lock requests and grant to one with lowest “timestamp” • Break ties based on alternate dimension (# of prior accesses, …) • Works? Under which assumptions?

Leader Election (cont.) Another scheme Optimize: try to take charge, but check with higher ranks first

Leader Election: Bully Alg. • When a process P notices coordinator is no longer responding, initiates an election • P sends ELECTION message to all processes with higher IDs • If nobody responds, P wins the election and becomes coordinator • If one of the processes with higher IDs responds, P’s job is done • If a process Q gets an ELECTION message from process P • Q sends OK message back • Q starts an election • Failed process that restarts initiates ELECTION

Leader Election 1: Bully Algorithm 1 1 2 5 2 5 OK Election OK 4 6 4 6 Election Election 0 3 0 3 7 7 1 1 2 5 2 5 OK Leader 4 6 4 6 0 3 0 3 7 7 “Bully”: “biggest” one wins 1 2 5 Election 4 Election 6 Election 0 3 7

Leader Election 2: Ring Algorithm 1 2 3 • Assumptions • Assume processes are physically or logically ordered • i.e., each process knows its successor and can communicate with it • Note: no “token” used here (unlike many rings) • Process P notices coordinator not functioning • Builds ELECTION msg with its PID & sends to successor (higher PID) • If successor failed, send to next successor etc. till gets thru • At each step in the ring • Sender adds its own PID to list (makes it a candidate) • ELECTION msg gets back to initiator, P • Recognizes it initiated it, changes msg to COORDINATOR, re-circulates with the highest PID in the message • Q: what happens if two simultaneous ELECTION’s?

Ring Algorithm (cont.) • Setting: P2 and P5 both conclude that P7has crashed, so both initiate ELECTION 7 Q: how does this terminate?

All or nothing? Maekawa’s ME • Select “subset” of nodes to give permission to enter CS • Sufficient for majority to give permission? • For n (processes), a subset G is defined of size (sqrt(n) + 1) s.t there are sqrt(n) subsets existing • Basically there is always a node in the intersection of 2 subsets • To request CS: send request to all processes in G • To release CS: send release to all processes in G

Operations Via Vote Collection • Two or more processes accessing CS • E.g., voting on read / write operations • Voting: concurrency control to ensures conflicting operations don’t occur • Operation proceeds only if a quorum of processes yield access (they vote “yes” to do it) • Define quorums so conflicting operations intersect in at least one replica

Static Voting (~ ME) • Basic scheme (read-one; write-all or CS access) • Read operations require quorum = 1 • Write operations require quorum = All • Read operations: high availability and low cost • Write operations: high cost and block if one failure

Weighted Voting • Extend basic scheme: • Each replica gets a number of votes • Quorums based on # votes • Sum of quorums of conflicting ops must exceed total votes • Quorum definitions • V total votes in the system • R votes is the read quorum • W votes is the write quorum • Quorum requirements • W > V/2 • W+R > V

Quorum Voting r+w> v[5 total votes]w>v/2[5/2 = 3 (ceiling)] B 1 r = 1 (simplex) w = 5 “strict system” – all or none what happens if a node fails? C A 1 2 D 1

FT Quorum Voting r+w>vw>v/2 B 1 C A r = 3 w = 3 “tolerates” faults 1 2 D 1 degradability by dynamically re-assigning “votes” to a node

Asynchronous to Synchronous! • Time and Clocks • Local clocks • Global clocks • Clock synchronization

Achieving Synchrony • If time helps, why not just make all systems synchronous? • Need to provide “timeliness” properties of all resources involved in a computation: • Specified “at”, “within”, “before”, “every”, “after”, … • Synchrony + “ordering” instead of synchrony?

Time, Clocks & Ordering “A man with one watch knows what time it is, but one with two watches never knows.”  • Time  sequencing + synchronization in computer systems • Record and observe the place of events in the timeline • Enforce the future positioning of events in the timeline • Real + Logical time: abstract, monotonically increasing continuous function that marks the passage of time (not “Real-Time”) • Timestamping an event: associate the event to a point in the timeline (for comparison to other events) • Use of time in computers provided through • Timers • Local clocks

Issues • File updates ~ “make” • If file fB updated later than file fA, we assume time(fB) > time(fA) • We still want this to hold if different computers update the files (timestamp the update events) • Measure latency from AB • This is a distributed duration, i.e., computed with different local clock values • Measure roundtrip delay ABA (ping) • Closed chain: we can use A’s local clock • Global time? • physical clocks? • logical clocks?

Local Clocks logical time real time • Local physical clock used as time source • Model: a hardware, monotonically increasing discrete function T(real) that maps real time into a logical clock T(logical) or T(virtual) • Characteristics: rate of drift (ρ), granularity (g) ... • Drift rate (~ 1-10 μs/s)

Global Clocks • Global clock built by synchronizing all local clocks to the same initial value • Creates at each process p a virtual clock (vcp) • Must periodically resynchronize due to drift • 2 μs/sec seems small, but adds to 72 msecs apart in an hour! • Need clock sync. algorithm to do the re-synch. • Global clock is the set of {vcp} for all processes p

Clock Synchronization Alg’s • Goal: keep all clocks in a DS synchronized Approximately Sync. to within X: Convergence Exactly in Sync. with each other: Consensus • System model (for all algorithms) • Each host has a timer that interrupts H times/second (1/g) • Interrupt handler increments a SW (logical) clock, C • Counts ticks from some agreed-upon time • Cp(t) ≡ value of C on host p when UTC is t (physical time) • Definitions • UTC ≡ Universal Coordinated Time • WWV ≡ call name of NIST radio station broadcasting UTC

Convergence drift rate = ρ (in-sync) skew (δ) … C2 (t) – C1 (t) < δ re-sync. period = R max. skew (δ) = 2 ρ R  C2 (t) – C1 (t) < = δ

Anything more to worry about? Z: last msg rcvd X: first msg rcvd

Basic (interactive) Sync • At each re-sync interval (a) each node broadcasts its value to all (b) Each node “locally” collects all inputs (c) Each node determines its local ref. value from the set of collected values, computes and adopts its local correction -- repeat --  convergence

Cristian’s Algorithm • Suitable for DS where one host has WWV • “NIST WWV server” keeping Universal Time Co-Ordinated UTC • Architecture • Periodically each host (“client”) asks WWV server for CUTC • WWV Server sends CUTC to client

Cristian’s Algorithm (cont.) • Simplistic approach • When client gets CUTC, it just sets its C to that value (?) • Q: what can go wrong? • Time must never run backwards (or zoom forward a gap instantaneously) on a computer • Latencies: takes nonzero time for each request/reply to travel to/from WWV server • Solution • Introduce correction “slowly” • Cristian’s technique: try to measure latencies

Cristian’s Algorithm (cont.) • Compensating for latencies • Client notes • T0 ≡ time request sent • T1 ≡ time reply received • Set C = CUTC + ( (T1- T0) / 2) • Limitations? • Additional suggestion: take series of measurements and use … • Q: What makes sense here?

Berkeley Algorithm • Berkeley Unix uses active time servers • Algorithm • Time server polls each client periodically • Server computes an average • Server gives client an amount to (gradually) add or subtract from its C • Works with no WWV server • “Correct” time can drift from UTC!

Berkeley Algorithm (cont.) real (physical) time or logical time?

Network Time Protocols (NTP) • Synchronizes clocks of hosts and routers in the Internet. • 10-20 million NTP servers and clients deployed in the Internet. Every Windows/XP has an NTP client. • NTP provides nominal accuracies of low tens of milliseconds on WANs, submilliseconds on LANs, and submicroseconds using a precision time source such as a cesium oscillator or GPS receiver. • 231 radio/satellite/modem primary sources • 47 GPS satellite (worldwide), GOES satellite (western hemisphere) • 57 WWVB radio (US) • 17 WWV radio (US) • 63 DCF77 radio (Europe) • 6 MSF radio (UK) • 5 CHU radio (Canada) • 7 modem time service (NIST and USNO (US), PTB (Germany), NPL (UK)) • 25 other (precision PPS sources, etc.) • 1,502 local clock backup sources (used only if all other sources fail)

NTP Summary • Primary (stratum 1) servers synchronize to national time standards via radio, satellite and modem. • Secondary (stratum 2, ...) servers and clients synchronize to primary servers via hierarchical subnet. • Clients and servers operate in master/slave, symmetric and multicast modes with or without cryptographic authentication. • Reliability by redundant servers and diverse network paths. • Engineered algorithms reduce jitter, mitigate multiple sources and avoid improperly operating servers. • The system clock is disciplined in time and frequency using an adaptive algorithm responsive to network time jitter and clock oscillator frequency wander.

Distributed Operating Systems