云计算基础分布式时间与时钟

云计算基础分布式时间与时钟 文世挺博士浙江大学宁波理工学院(SL605-2) wensht@nit.zju.edu.cn 15058033236 2014.2.21

物理时间

分布式系统中时间戳作用？ • 准确度量系统性能 • 保证数据“up-to-date”和正确性 • 并发处理程序的时序事件逻辑排序 • 消息发送端和接收端的同步 • 联合活动协调 • 共享对象并发存取的串行化 • ……

物理时间 Physical time • Solar time (太阳时) • 1 sec = 1 day / 86400 • Problem: days are of different lengths (due to tidal friction, etc.) • mean solar second: averaged over many days • Greenwich Mean Time (GMT 格林尼治) • The mean solar time at Royal Observatory in Greenwich, London • Greenwich located at longitude 0, the line that divides east and west

协调世界时间Coordinated Universal Time (UTC) 国际原子时(TAI: International atomic time) • 1 秒  Cesium-133 原子发生 9,192,631,770 次状态转变 • TAI time is simply the number of Cesium-133 transitions since midnight on Jan 1, 1958. • Accuracy: better than 1 second in six million years • Problem: Atomic clocks do not keep in step with solar time 协调世界时间Coordinated Universal Time (UTC) • Based on the atomic time (TAI) and introduced from 1 Jan 1972 • A leap second is occasionally inserted or deleted to keep in step with solar time when the difference btw a solar-day and a TAI-day is over 800ms

计算机时钟 Computer Clocks • 石英振荡器驱动CMOS时钟电路 • 电脑关机情况下是通过电池供电驱动CMOS时钟电路 • 时钟电路主要由两部分组成：计数器（counter）和寄存器（register）。石英振荡器每震荡一次，计数器减1；当计数器为0时产生一个中断，同时从寄存器重新读取数值。不停重复。。。 • 操作系统（OS）捕获中断信号维护计算机时钟 • e.g., 60 or 100中断为1sec • 可编程中断控制器Programmable Interrupt Controller (PIC) CPU counter register

时钟漂移和偏移Clock drift and clock skew • 时钟漂移Clock Drift • Clocks tick at different rates • Ordinary quartz clocks drift by ~ 1sec in 11-12 days. (10-6 secs/sec). • High precision quartz clocks drift rate is ~ 10-7 or 10-8 secs/sec • Create ever-widening gap in perceived time • 时钟偏移（补偿）Clock Skew (offset) • Difference between two clocks at one point in time

(完美时钟)Perfect clock

Drift with a slow computer clock

Drift with a fast computer clock

Dealing with drift • No good to set a clock backward • Illusion of time moving backwards can confuse message ordering and software development environments • Go for gradual clock correction • If fast: Make clock run slower until it synchronizes • If slow: Make clock run faster until it synchronizes

线性补偿函数Linear compensating function • 操作系统可以调整中断的频率 • e.g.: if the system generates an interrupt every 17 ms but clock is too slow: generates an interrupt at (e.g.) 15 ms • 调整系统时间的斜率Adjustment changes slope of system time: Linear compensating function

再同步Resynchronization • 同步周期After synchronization period is reached • 周期性再同步(Resynchronize periodically), or • 当斜率达到阈值时同步(The skew is beyond a threshold) • 主动调节时间 • UNIX adjtime system call: int adjtime(struct timeval *delta, struct timeval *old-delta) • adjusts the system's notion of the current time, advancing or retarding it, by the amount of time specified in the struct timeval pointed to by delta. “old-delta”, output parameter, returns time left uncorrected since last call of “adjtime”

Getting UTC from Top Sources • 在每个计算机上安装GPS接收器 • 误差：± 1 ms of UTC • 连接 WWV (http://tf.nist.gov) 无线接收器 • 从Boulder or DC 获取时间广播 • 误差：± 3 ms of UTC (depending on distance) • 安装 GOES 接收器 (Geostationary Operational Environmental Satellites, http://www.goes.noaa.gov/) • 误差：± 0.1 ms of UTC Not practical for every machine – Cost, size, convenience, environment

Getting UTC for Client Computers • Client Computer 和 time server 同步时间 • 和更准确的时钟同步，or • 和UTC时间源同步（connected to UTC time source） • Also called external clock synchronization

What’s the time? server client 10:25:18 Synchronizing Clocks by using RPC • Simplest synchronization technique • Make an RPC to obtain time from the server • Set the local clock to the server time 没有考虑网络延迟

Cristian’s algorithm Compensate for network delays (assuming symmetric) • client sends a request at T0 • server replies with the current clock value Tserver • client receives response at T1 • client sets its clock to:

Cristian’s algorithm: example • Send request at 5:08:15.100 (T0) • Receive response at 5:08:15.900 (T1) • Response contains 5:09:25.300 (Tserver) • Round-trip time is T1−T0 5:08:15.900 - 5:08:15.100 = 800 ms • Best guess: timestamp was generated 400 ms ago • Set the local time to Tserver+ round-trip-time/2 5:09:25.300 + 400 = 5:09.25.700 • Accuracy: ± round-trip-time/2 Tserver server client T0 T1

Cristian’s algorithm: error bound Tmin: Minimum message travel time ( )

Problems with Cristian’s algorithm • Server might fail • Subject to malicious interference

（伯克利算法）Berkeley Algorithm • Proposed by Gusella & Zatti, 1989 and implemented in BSD version of UNIX • Aim: synchronize clocks of a group of machines as close as possible (also called internal synchronization) • Assumes no machine has an accurate time source (i.e., no differentiation of client and server) • Obtains average from participating computers • Synchronizes all clocks to average

（伯克利算法）Berkeley Algorithm • One machine is elected (or designated) as the master; others are slaves: • Master polls all slaves periodically, asking for their time • Cristian’s algorithm can be used to obtain more accurate clock values from other machines by counting network latency • When results are collected, compute the average • Including master’s time • Send each slave the offset that its clock need be adjusted • Avoids problems with network delays by sending “offset” instead of “timestamp”

（伯克利算法）Berkeley Algorithm • Algorithm has provisions for ignoring readings from clocks whose skew is too large • Compute a fault-tolerant average • Any slave can take over the master if master fails

Berkeley Algorithm: example

Berkeley Algorithm: example +0:05 3:00 -0:20 -6:05 +0:15 3:25 9:10 2:50 3. Send offset to each client

网络时间协议Network Time Protocol (NTP) • NTP 是非常常用的互联网时间协议，它的准确性也非常高 (RFC 1305, http://tf.nist.gov/service/its.htm ). • 计算机操作系统需要安装NTP软件协议。客户端软件周期性的从一个或多个获取时间更新（计算其平均值） • 时间服务器监听NTP协议，端口123，响应UDP/IP协议传送一个NTP数据的包（which is a 64-bit timestamp in UTC seconds since Jan 1, 1900 with a resolution of 200 pico-s）. • Many NTP client software for PC only gets time from a single server (no averaging). The client is called SNTP (Simple Network Time Protocol, RFC 2030), a simple version of NTP.

NTP synchronization subnet 第一层组织：客户机器直接连接到准确的时间服务器。第二层组织：客户机器和第一层组织机器同步。。。。

NTP goals • 使得客户机可以跨越Internet和UTC进行准确同步）（有消息延迟） • Use statistical techniques to filter data and improve quality of results • 提供可靠的时间同步服务（Provide reliable service） • Survive lengthy losses of connectivity • Redundant paths • Redundant servers • 使得客户端可以频繁同步Enable clients to synchronize frequently • Adjustment of clocks by using offset (for symmetric mode) • 提供抗干扰能力Provide protection against interference • Authenticate source of data

NTP Synchronization Modes • Multicast (for quick LANs, low accuracy) • server periodically multicasts its time to its clients in the subnet • Remote Procedure Call (medium accuracy) • server responds to client requests with its actual timestamp • like Cristian’s algorithm • Symmetric mode (high accuracy) • used to synchronize between the time servers (peer-peer) All messages delivered unreliably with UDP

Ti-2 Ti-1 Server B m m’ Server A time Ti-3 Ti Symmetric mode • The delay between the arrival of a request (at server B) and the dispatch of the reply is NOT negligible: • Delay = total transmission time of the two messages di = (Ti – Ti-3 ) – (Ti-1– Ti-2) • Offset of clock A relative to clock B: • Offset of clock A: • Set clock A: Ti + oi • Accuracy bound: di /2

Ti-1 Server B m m’ Server A time Ti-3 Ti Symmetric mode (another expression) Ti-2 • Delay = total transmission time of the two messages di = (Ti – Ti-3 ) – (Ti-1– Ti-2) • Clock A should set its time to (the best estimate of B’s time at Ti): Ti-1 +di/2, which is the same as Ti + oi

Ti-2 =800 Ti-1 =850 Server B m m’ Server A time Ti-3 =1100 Ti =1200 Symmetric NTP example Offset oi=((800 – 1100) + (850 – 1200))/2 = – 325 Set clock A to: Ti + oi = 1200 – 325 = 875 Note: Server A need to adjusts it current clock (1200ms) by gradual slowdown its pace until -325ms is compensated.

Improving accuracy • Data filtering from a single source • Retain the multiple most recent pairs < oi, di > • Filter dispersion: choose oj corresponding to the smallest dj • Peer-selection: synchronize with lower stratum servers • lower stratum numbers, lower synchronization dispersion • The stratum of a server is dynamically changing, depending on which server it synchronize with

Simple Network Time Protocol (SNTP) RFC 2030 • Targeted for machines that have no need of full NTP implementation, particularly for machines at the end of synchronization subnet (client nodes) • SNTP operate in one of the following modes: • 单播模式（Unicast mode）, the client sends a request to a designated server • 组播模式（Multicast mode）, the server periodically broadcast/multicast its time to the subnet and does not serve any requests from clients • 任播模式（Anycast mode）, the client broadcast/multicast a request to the local subnet and takes the first response for time synchronization

逻辑时间

使用逻辑时间的动机 • Cannot synchronize physical clocks perfectly in distributed systems. [Lamport 1978] • Main function of computer clocks – order events • If two processes don’t interact, there is no need to sync clocks. • This observation leads to “causality（因果关系）”

因果关系（Causality） • Order events with happened-before () relation • ab • a could have affected the outcome of b • a || b • a and b take place in different processes that don’t exchange data • Their relative ordering does not matter (they are concurrent)

Definition of happened-before Definition of “” relationship: • If a and b take place in the same process • a comes before b, then ab • If a and b take place in the different processes • a is a “send” and b is the corresponding “receive”, then ab • Transitive: if a  b and b  c, then a  c Partial ordering – unordered events are concurrent

Logical Clocks • A logical clock is a monotonically increasing software counter. It need not relate to a physical clock. • Corrections to a clock must be made by adding, not subtracting • Rule for assigning “time” values to events • if a  b then clock(a) < clock(b)

Event counting example • Three processes: P0, P1, P2, events a, b, c, … • A local event counter in each process. • Processes occasionally communicate with each other, where inconsistency occurs, … Bad ordering: e h, f k

Lamport’s algorithm, 1978 Each process Pihas a logical clock Li. Clock synchronization algorithm: • Li is initialized to 0; • Update Li: • LC1: Li is incremented by 1 for each new event happened in Pi • LC2: when Pisends message m, it attaches t = Li to m • LC3: when Pjreceives(m,t)it sets Lj := max{Lj, t} , and then applies LC1 to increment Lj for event receive(m)

Problem: Identical timestamps Concurrent events (e.g., a, g) may have the same timestamp

Make timestamps unique Append the process ID (or system ID) to the clock value after the decimal point: • e.g. if P1, P2 both have L1 = L2 = 40, make L1 = 40.1, L2 = 40.2

Problem: Detecting causal relations • If ab, thenL(a) < L(b), however: • If L(a) < L(b), we cannot conclude that ab • It is not very useful in distributed systems. • Solution: use vector clocks L(g) < L(c ), but g || c

Vector of Timestamps Suppose there are a group of people and each needs to keep track of events happened to others. Requirement: Given two events, you need to tell whether they are sequential or concurrent. Solution: you need to have a vector of timestamps, one element for each member. (?,?,?) (3,0,0)

Vector clocks Each process Pi keeps a clock Vi which is a vector of N integers • Initialization: for 1≤ i≤N and 1≤ k≤N, Vi[k] := 0 • Update Vi : • VC1: when there is a new event in Pi, it sets Vi[i] := Vi[i] +1 • VC2: when Pi sends a message m out, it attaches t = Vito m • VC3: when Pj receives (m,t), for 1≤ k≤N, it sets Vj[k] := max{Vj[k], t[k]}, then applies VC1 to increment Vj[j] for event receive(m,t) Note: Vi[j] is a timestamp indicating that Pi knows all events that happened in Pj upto this time.

Vector timestamps: example

云计算基础 分布式时间与时钟