260 likes | 503 Views
The Chubby Lock Service for Loosely-coupled Distributed Systems. Mike Burrow, Google Inc Presented by Xin (Joyce) Zhan. Outline. Design System structure Locks, caching, failovers Scaling mechanism Use and observations As name service Failover problems.
E N D
The Chubby Lock Service for Loosely-coupled Distributed Systems Mike Burrow, Google Inc Presented by Xin (Joyce) Zhan
Outline • Design • System structure • Locks, caching, failovers • Scaling mechanism • Use and observations • As name service • Failover problems
Lock service for distributed system • Synchronize access to shared resources • Other usage • Primary election, meta-data storage, name service • Reliability, availability
System Structure • Set of replicas • Periodically elected master • Master lease • Paxos protocol • All client requests are directed to master • updates propagated to replicas • Replace failed replicas • master periodically polls DNS
Design • Store small files • Event notification mechanism • Consistent caching • Advisory lock (vs.mandatory) • confilct only when others attempt to acquire the same lock • Coarse grained locks • survive lock server failures
Design - File Interface • Ease distribution • /ls/fool/wombat/pouch • Node meta-data include Access Control Lists • Handle • analogous to UNIX file descriptors • support for use across master changes
Design - Sequencer for lock • Delayed / Out-of-order messages • introduce sequence numbers into interactions that use locks • lock holder requests a sequencer, pass it to file server to validate • Alternative • lock-delay
Design - Events • Client subscribes when creating handle • Delivered async via up-call from client library • Event types • file contents modified • child node added / removed / modified • Chubby master failed over • handle / lock have become invalid • lock acquired / conflicting lock request (rarely used)
Design - Caching • Clients cache file data and meta data • Consistent, write-through • Invalidation • master keeps list of what clients may have cached • master sends invalidations on top of KeepAlive • clients flush changed data, ack. with KeepAlive • server proceeds the modification only after invalidation • Clients cache open handle and locks
Design - Sessions • Session maintained through KeepAlives • handles, locks, cached data remain valid • lease • Lease timeout advanced when • creation of a session • master fail-over occurs • master responds to KeepAlive RPC
Design - KeepAlive • Master responds close to lease timeout • Client sends another KeepAlive immediately • Client maintains local lease timeout • conservative approximation • When local lease expires • disable cache • session in jeopardy, client waits in grace period • cache enabled on reconnect • Application informed about session changes • Jeopardy/safe/expired event
Design - Failovers • In-memory state discarded • sessions, handles, locks, etc. • Lease timer “stops” • Fast master election • client reconnect before lease expires • Slow master election • clients flush cache, enter grace period • New master reconstruct the assumption of in-memory state of previous master
Design - Failovers Steps of newly-elected master: • Pick new epoch number • Respond only to master location requests • Build in-memory state for sessions / locks from database • Respond to KeepAlives • Emit fail-over events to sessions, flush caches • Wait for acknowledgements / session expire • Allow all operations to proceed • Allow clients to use handles created before fail-over • Delete ephemeral files w/o open handles after an interval
Design - Backup and Mirroring • Master writes snapshots every few hours • GFS server in different building • Collection of files mirrored across cells • /ls/global/master mirrored to /ls/cell/slave • Mostly for configuration files • Chubby’s own ACLs • Files advertising presence / location • pointers to Bigtable cells
Design - Scaling Mechanisms • 90,000 clients communicate with one cell • Regulate the number of Chubby cells • client use the nearby cell • Increase lease time • Client caching • Protocol-conversion servers
Scaling - Proxies • Proxies pass requests from clients to cell • Reduce traffic of KeepAlive and read requests • Not writes, but writes << 1% of workload • KeepAlive traffic by far most dominant • Overheads: • additional RPC for writes / first time reads • increased probability of unavailability
Scaling - Partitioning • Namespace of a cell partitioned between servers • N partitions, each with master and replicas • Node D/C stored on P(D/C) = hash(D) mod N • meta-data for D may be on different partition • Little cross-partition communication • Reduce R/W traffic, no necessarily KeepAlive
Use and Observations Many files for naming Config, ACL, meta-data common 10 clients use each cached file, on avg. Few locks held, no shared locks KeepAlives dominate RPC traffic
Use as Name Service • DNS uses TTL values • entries must be refreshed within that time • huge (and variable) load on DNS server • Chubby’s caching uses invalidations, no polling • client builds up needed entries in cache • name entries further grouped in batches
Failover problems • Master writes sessions to DB when created • Overload when start of many processes at once • Instead, store session at first modification / lock acquisition etc. • Active sessions recorded with probability on KeepAlive • spread out writes in time • young read-only session may be discarded in a fail-over
Failover problems • New design – do not record sessions in database • recreate them like handles after fail-over • new master waits full lease time before operations proceed
Lesson learnt • Developers rarely consider availability • should plan for short Chubby outages • Fine-grained locking not essential • Poor API choices • handles acquiring locks cannot be shared • RPC use affects transport protocols • forced to send KeepAlives by UDP for timeliness