620 likes | 1.08k Views
云计算基础 分布式时间与时钟. 文世挺 博士 浙江大学宁波理工学院 (SL605-2) wensht@nit.zju.edu.cn 15058033236 2014.2.21. 物理时间. 分布式系统中时间戳作用?. 准确度量系统性能 保证数据“ up-to-date ”和正确性 并发处理程序的时序事件逻辑排序 消息发送端和接收端的同步 联合活动协调 共享对象并发存取的串行化 ……. 物理时间 Physical time. Solar time ( 太阳时 ) 1 sec = 1 day / 86400
E N D
云计算基础分布式时间与时钟 文世挺 博士 浙江大学宁波理工学院(SL605-2) wensht@nit.zju.edu.cn 15058033236 2014.2.21
分布式系统中时间戳作用? • 准确度量系统性能 • 保证数据“up-to-date”和正确性 • 并发处理程序的时序事件逻辑排序 • 消息发送端和接收端的同步 • 联合活动协调 • 共享对象并发存取的串行化 • ……
物理时间 Physical time • Solar time (太阳时) • 1 sec = 1 day / 86400 • Problem: days are of different lengths (due to tidal friction, etc.) • mean solar second: averaged over many days • Greenwich Mean Time (GMT 格林尼治) • The mean solar time at Royal Observatory in Greenwich, London • Greenwich located at longitude 0, the line that divides east and west
协调世界时间Coordinated Universal Time (UTC) 国际原子时(TAI: International atomic time) • 1 秒 Cesium-133 原子发生 9,192,631,770 次状态转变 • TAI time is simply the number of Cesium-133 transitions since midnight on Jan 1, 1958. • Accuracy: better than 1 second in six million years • Problem: Atomic clocks do not keep in step with solar time 协调世界时间Coordinated Universal Time (UTC) • Based on the atomic time (TAI) and introduced from 1 Jan 1972 • A leap second is occasionally inserted or deleted to keep in step with solar time when the difference btw a solar-day and a TAI-day is over 800ms
计算机时钟 Computer Clocks • 石英振荡器驱动CMOS时钟电路 • 电脑关机情况下是通过电池供电驱动CMOS时钟电路 • 时钟电路主要由两部分组成:计数器(counter)和寄存器(register)。石英振荡器每震荡一次,计数器减1;当计数器为0时产生一个中断,同时从寄存器重新读取数值。不停重复。。。 • 操作系统(OS)捕获中断信号维护计算机时钟 • e.g., 60 or 100中断为1sec • 可编程中断控制器Programmable Interrupt Controller (PIC) CPU counter register
时钟漂移和偏移Clock drift and clock skew • 时钟漂移Clock Drift • Clocks tick at different rates • Ordinary quartz clocks drift by ~ 1sec in 11-12 days. (10-6 secs/sec). • High precision quartz clocks drift rate is ~ 10-7 or 10-8 secs/sec • Create ever-widening gap in perceived time • 时钟偏移(补偿)Clock Skew (offset) • Difference between two clocks at one point in time
Dealing with drift • No good to set a clock backward • Illusion of time moving backwards can confuse message ordering and software development environments • Go for gradual clock correction • If fast: Make clock run slower until it synchronizes • If slow: Make clock run faster until it synchronizes
线性补偿函数Linear compensating function • 操作系统可以调整中断的频率 • e.g.: if the system generates an interrupt every 17 ms but clock is too slow: generates an interrupt at (e.g.) 15 ms • 调整系统时间的斜率Adjustment changes slope of system time: Linear compensating function
再同步Resynchronization • 同步周期After synchronization period is reached • 周期性再同步(Resynchronize periodically), or • 当斜率达到阈值时同步(The skew is beyond a threshold) • 主动调节时间 • UNIX adjtime system call: int adjtime(struct timeval *delta, struct timeval *old-delta) • adjusts the system's notion of the current time, advancing or retarding it, by the amount of time specified in the struct timeval pointed to by delta. “old-delta”, output parameter, returns time left uncorrected since last call of “adjtime”
Getting UTC from Top Sources • 在每个计算机上安装GPS接收器 • 误差:± 1 ms of UTC • 连接 WWV (http://tf.nist.gov) 无线接收器 • 从Boulder or DC 获取时间广播 • 误差:± 3 ms of UTC (depending on distance) • 安装 GOES 接收器 (Geostationary Operational Environmental Satellites, http://www.goes.noaa.gov/) • 误差:± 0.1 ms of UTC Not practical for every machine – Cost, size, convenience, environment
Getting UTC for Client Computers • Client Computer 和 time server 同步时间 • 和更准确的时钟同步,or • 和UTC时间源同步(connected to UTC time source) • Also called external clock synchronization
What’s the time? server client 10:25:18 Synchronizing Clocks by using RPC • Simplest synchronization technique • Make an RPC to obtain time from the server • Set the local clock to the server time 没有考虑网络延迟
Cristian’s algorithm Compensate for network delays (assuming symmetric) • client sends a request at T0 • server replies with the current clock value Tserver • client receives response at T1 • client sets its clock to:
Cristian’s algorithm: example • Send request at 5:08:15.100 (T0) • Receive response at 5:08:15.900 (T1) • Response contains 5:09:25.300 (Tserver) • Round-trip time is T1−T0 5:08:15.900 - 5:08:15.100 = 800 ms • Best guess: timestamp was generated 400 ms ago • Set the local time to Tserver+ round-trip-time/2 5:09:25.300 + 400 = 5:09.25.700 • Accuracy: ± round-trip-time/2 Tserver server client T0 T1
Cristian’s algorithm: error bound Tmin: Minimum message travel time ( )
Problems with Cristian’s algorithm • Server might fail • Subject to malicious interference
(伯克利算法)Berkeley Algorithm • Proposed by Gusella & Zatti, 1989 and implemented in BSD version of UNIX • Aim: synchronize clocks of a group of machines as close as possible (also called internal synchronization) • Assumes no machine has an accurate time source (i.e., no differentiation of client and server) • Obtains average from participating computers • Synchronizes all clocks to average
(伯克利算法)Berkeley Algorithm • One machine is elected (or designated) as the master; others are slaves: • Master polls all slaves periodically, asking for their time • Cristian’s algorithm can be used to obtain more accurate clock values from other machines by counting network latency • When results are collected, compute the average • Including master’s time • Send each slave the offset that its clock need be adjusted • Avoids problems with network delays by sending “offset” instead of “timestamp”
(伯克利算法)Berkeley Algorithm • Algorithm has provisions for ignoring readings from clocks whose skew is too large • Compute a fault-tolerant average • Any slave can take over the master if master fails
Berkeley Algorithm: example +0:05 3:00 -0:20 -6:05 +0:15 3:25 9:10 2:50 3. Send offset to each client
网络时间协议Network Time Protocol (NTP) • NTP 是非常常用的互联网时间协议,它的准确性也非常高 (RFC 1305, http://tf.nist.gov/service/its.htm ). • 计算机操作系统需要安装NTP软件协议。客户端软件周期性的从一个或多个获取时间更新(计算其平均值) • 时间服务器监听NTP协议 ,端口123,响应UDP/IP协议传送一个NTP数据的包(which is a 64-bit timestamp in UTC seconds since Jan 1, 1900 with a resolution of 200 pico-s). • Many NTP client software for PC only gets time from a single server (no averaging). The client is called SNTP (Simple Network Time Protocol, RFC 2030), a simple version of NTP.
NTP synchronization subnet 第一层组织:客户机器直接连接到准确的时间服务器。 第二层组织:客户机器和第一层组织机器同步。 。。。
NTP goals • 使得客户机可以跨越Internet和UTC进行准确同步)(有消息延迟) • Use statistical techniques to filter data and improve quality of results • 提供可靠的时间同步服务(Provide reliable service) • Survive lengthy losses of connectivity • Redundant paths • Redundant servers • 使得客户端可以频繁同步Enable clients to synchronize frequently • Adjustment of clocks by using offset (for symmetric mode) • 提供抗干扰能力Provide protection against interference • Authenticate source of data
NTP Synchronization Modes • Multicast (for quick LANs, low accuracy) • server periodically multicasts its time to its clients in the subnet • Remote Procedure Call (medium accuracy) • server responds to client requests with its actual timestamp • like Cristian’s algorithm • Symmetric mode (high accuracy) • used to synchronize between the time servers (peer-peer) All messages delivered unreliably with UDP
Ti-2 Ti-1 Server B m m’ Server A time Ti-3 Ti Symmetric mode • The delay between the arrival of a request (at server B) and the dispatch of the reply is NOT negligible: • Delay = total transmission time of the two messages di = (Ti – Ti-3 ) – (Ti-1– Ti-2) • Offset of clock A relative to clock B: • Offset of clock A: • Set clock A: Ti + oi • Accuracy bound: di /2
Ti-1 Server B m m’ Server A time Ti-3 Ti Symmetric mode (another expression) Ti-2 • Delay = total transmission time of the two messages di = (Ti – Ti-3 ) – (Ti-1– Ti-2) • Clock A should set its time to (the best estimate of B’s time at Ti): Ti-1 +di/2, which is the same as Ti + oi
Ti-2 =800 Ti-1 =850 Server B m m’ Server A time Ti-3 =1100 Ti =1200 Symmetric NTP example Offset oi=((800 – 1100) + (850 – 1200))/2 = – 325 Set clock A to: Ti + oi = 1200 – 325 = 875 Note: Server A need to adjusts it current clock (1200ms) by gradual slowdown its pace until -325ms is compensated.
Improving accuracy • Data filtering from a single source • Retain the multiple most recent pairs < oi, di > • Filter dispersion: choose oj corresponding to the smallest dj • Peer-selection: synchronize with lower stratum servers • lower stratum numbers, lower synchronization dispersion • The stratum of a server is dynamically changing, depending on which server it synchronize with
Simple Network Time Protocol (SNTP) RFC 2030 • Targeted for machines that have no need of full NTP implementation, particularly for machines at the end of synchronization subnet (client nodes) • SNTP operate in one of the following modes: • 单播模式(Unicast mode), the client sends a request to a designated server • 组播模式(Multicast mode), the server periodically broadcast/multicast its time to the subnet and does not serve any requests from clients • 任播模式(Anycast mode), the client broadcast/multicast a request to the local subnet and takes the first response for time synchronization
使用逻辑时间的动机 • Cannot synchronize physical clocks perfectly in distributed systems. [Lamport 1978] • Main function of computer clocks – order events • If two processes don’t interact, there is no need to sync clocks. • This observation leads to “causality(因果关系)”
因果关系(Causality) • Order events with happened-before () relation • ab • a could have affected the outcome of b • a || b • a and b take place in different processes that don’t exchange data • Their relative ordering does not matter (they are concurrent)
Definition of happened-before Definition of “” relationship: • If a and b take place in the same process • a comes before b, then ab • If a and b take place in the different processes • a is a “send” and b is the corresponding “receive”, then ab • Transitive: if a b and b c, then a c Partial ordering – unordered events are concurrent
Logical Clocks • A logical clock is a monotonically increasing software counter. It need not relate to a physical clock. • Corrections to a clock must be made by adding, not subtracting • Rule for assigning “time” values to events • if a b then clock(a) < clock(b)
Event counting example • Three processes: P0, P1, P2, events a, b, c, … • A local event counter in each process. • Processes occasionally communicate with each other, where inconsistency occurs, … Bad ordering: e h, f k
Lamport’s algorithm, 1978 Each process Pihas a logical clock Li. Clock synchronization algorithm: • Li is initialized to 0; • Update Li: • LC1: Li is incremented by 1 for each new event happened in Pi • LC2: when Pisends message m, it attaches t = Li to m • LC3: when Pjreceives(m,t)it sets Lj := max{Lj, t} , and then applies LC1 to increment Lj for event receive(m)
Problem: Identical timestamps Concurrent events (e.g., a, g) may have the same timestamp
Make timestamps unique Append the process ID (or system ID) to the clock value after the decimal point: • e.g. if P1, P2 both have L1 = L2 = 40, make L1 = 40.1, L2 = 40.2
Problem: Detecting causal relations • If ab, thenL(a) < L(b), however: • If L(a) < L(b), we cannot conclude that ab • It is not very useful in distributed systems. • Solution: use vector clocks L(g) < L(c ), but g || c
Vector of Timestamps Suppose there are a group of people and each needs to keep track of events happened to others. Requirement: Given two events, you need to tell whether they are sequential or concurrent. Solution: you need to have a vector of timestamps, one element for each member. (?,?,?) (3,0,0)
Vector clocks Each process Pi keeps a clock Vi which is a vector of N integers • Initialization: for 1≤ i≤N and 1≤ k≤N, Vi[k] := 0 • Update Vi : • VC1: when there is a new event in Pi, it sets Vi[i] := Vi[i] +1 • VC2: when Pi sends a message m out, it attaches t = Vito m • VC3: when Pj receives (m,t), for 1≤ k≤N, it sets Vj[k] := max{Vj[k], t[k]}, then applies VC1 to increment Vj[j] for event receive(m,t) Note: Vi[j] is a timestamp indicating that Pi knows all events that happened in Pj upto this time.