590 likes | 606 Views
This article discusses the naming and networking aspects of distributed computing, including low-level interfaces, translating hostname to IP address, ARP protocol, client-server architecture, replication, and Grapevine services.
E N D
Grapevine: An Exercise in Distributed Computing Landon Cox February 16, 2016
Naming other computers • Low-level interface • Provide the destination MAC address • 00:13:20:2E:1B:ED • Middle-level interface • Provide the destination IP address • 152.3.140.183 • High-level interface • Provide the destination hostname • www.cs.duke.edu
Translating hostname to IP addr • Hostname IP address • Performed by Domain Name Service (DNS) • Used to be a central server • /etc/hosts at SRI • What’s wrong with this approach? • Doesn’t scale to the global Internet
DNS • Centralized naming doesn’t scale • Server has to learn about all changes • Server has to answer all lookups • Instead, split up data • Use a hierarchical database • Hierarchy allows local management of changes • Hierarchy spreads lookup work across many computers
Where is www.wikipidia.org?
Example:linux.cs.duke.edu • nslookup in interactive mode
Translating IP to MAC addrs • IP address MAC address • Performed by ARP protocol within a LAN • How does a router know the MAC address of 152.3.140.183? • ARP (Address Resolution Protocol) • If it doesn’t know the mapping, broadcast through switch • “Whoever has this IP address, please tell me your MAC address” • Cache the mapping • “/sbin/arp” • Why is broadcasting over a LAN ok? • Number of computers connected to a switch is relatively small
Broadcast on local networks • On wired ethernet switch • ARP requests/replies are broadcast • For the most part, IP communication is not broadcast (w/ caveats) • What about on a wireless network? • Everything is broadcast • Means hosts can see all unencrypted traffic • Why might this be dangerous? • Means any unencrypted traffic is visible to others • Open WiFi access points + non-SSL web requests and pages • Many sites send cookie credentials in the clear … • Use secure APs and SSL!
High-level network overview Server Workstation Workstation Workstation Ethernet Gateway Workstation Gateway Ethernet Workstation Ethernet Workstation Workstation Server Workstation
Client-server • Classic and convenient structure for distributed systems • How do clients and servers differ? • Servers have more physical resources (disk, RAM, etc.) • Servers are trusted by all clients • Why are servers more trustworthy? • Usually have better, more reliable hardware • Servers are better administered (paid staff watch over them) • Servers are kind of like the kernel of a distributed system • Centralized concentration of trust • Support coordinated activity of mutually distrusting clients
Client-server • Why not put everything on one server? • Scalability problems (server becomes overloaded) • Availability problems (server becomes single point of failure) • Want to retain organizational control of some data (some distrust) • How do we address these issues? • Replicate servers • Place multiple copies of server in network • Allow clients to talk to any server with appropriate functionality • What are some drawbacks to replication? • Data consistency (need sensible answers from servers) • Resource discovery (which server should I talk to?)
Client-server • Kernels are centralized too • Subject to availability, scalability problems • Does it make sense to replicate kernels? • Perhaps for multi-core machines • Assign a kernel to each core • Separate address spaces of each kernel • Coordinate actions via message passing • Multi-core starts to look a lot like a distributed system
Grapevine services • Message delivery • Send data to specified users • Access control • Only allow specified users to access name • Resource discovery • Where can I find a printer? • Authentication • How do I know who I am talking to?
Registration servers • What logical data structure is replicated? • The registry • RName Group entry | Individual entry • What does an RName look like? • Character string F.R • F is a name (individual or group) • R is a registry corresponding to a data partition • At what grain is registration data replicated? • Servers contain copies of whole registries • Individual server unlikely to have copy of all registries
RNames Group {RName1, …, RNameN} RName name.registry Individual Authenticator (password), Inbox sites, Connect site What two entities are represented by an individual entry? Users and servers
RNames Group {RName1, …, RNameN} RName name.registry Individual Authenticator (password), Inbox sites, Connect site How does an individual entry allow communication with a user? Inbox sites for users
RNames Group {RName1, …, RNameN} RName name.registry Individual Authenticator (password), Inbox sites, Connect site How does an individual entry allow communication with a server? Connect site for servers
Namespace • RNames provide a symbolic namespace • Similar to file-system hierarchy or DNS • Autonomous control of names within a registry • What is the most important part of the namespace? • *.gv (for Grapevine) • *.gv is replicated at every registration server • Who gets to define the other registries? • All other registries must have group entry under *.gv • Owners of *.gv have complete control over other registries • In what way do file systems and DNS operate similarly? • ICANN’s root DNS servers decide top-level domains • Root user controls root directory “/”
Resource discovery • How do clients locate server replicas? • Get list of all registries via “gv.gv” • Find registry name for service (e.g., “ms”) • Lookup group ms.gv at registration server • ms.gv returns a list of available servers (e.g., *.ms) • At this point control is transferred to service • Service has autonomous control of its namespace • Service can define its own namespace conventions
Implementing services • Mail servers are replicated • Any message server accepts any delivery request • All message servers can forward to others • An individual may have inboxes on many servers • How does a client identify a server to send a message? • Find well-known name “MailDrop.ms” in *.ms • MailDrop.ms maps to mail servers • Any mail server can accept a message • Mail servers forward message to servers hosting users’ inboxes • Note that the mail service makes “MailDrop.ms” special • Grapevine only defines semantics of *.gv • Grapevine delegates control of semantics of *.ms to mail service • Similar toimap.cs.duke.eduorwww.google.com
Resource discovery • Bootstrapping resource discovery • Rely on lower-level methods • Broadcast to name lookup server on Ethernet • Broadcast to registration server on Ethernet • What data does the name lookup server store? • Simple string to Internet address mappings • Infrequently updated (minimal consistency issues) • Well-known GrapevineRServeraddrs of registration servers • What does this remind you of on today’s networks? • Dynamic host configuration protocol (DHCP) • Clients broadcast DHCP request on Ethernet • DHCP server (usually on gateway) responds with IP addr, DNS info
Updating replicated servers • At some point need to update registration database • Want to add new machines • Want to reconfigure server locations • Why not require updates to be atomic at all servers? • Requires that most servers be accessible to even start • All kinds of reasons why this might not be true • Trans-Atlantic phone line might be down • Servers might be offline for maintenance • Servers might be offline due to failure • Instead embrace the chaos of eventual consistency • Might have transient differences between server state • Eventually everything will look the same (probably!)
Updating the database • Information included in timestamps • Time + server address • Timestamps are guaranteed to be unique • Provides a total order on updates from a server • Does the entry itself need a timestamp (a version)? • Not really, can just compute as the max of item timestamps • Entry version is a convenient optimization Registration Entry List 1 Active items:{str1|t1, …, strn|tn} Deleted items:{str1|t1, …, strm|tm} List 2 Active items Deleted items
Updating the database • Operations on an entries • Can add/delete items from lists • Can merge lists • Operations update item timestamps, modify list content Registration Entry List 1 Active items:{str1|t1, …, strn|tn} Deleted items:{str1|t1, …, strm|tm} List 2 Active items Deleted items
Updating the database • How are updates propagated? • Asynchronously via the messaging service (i.e., *.ms) • Does not require all servers to be online • Updates can be buffered and ordered Registration Entry List 1 Active items:{str1|t1, …, strn|tn} Deleted items:{str1|t1, …, strm|tm} List 2 Active items Deleted items
Updating the database • How fast is convergence? • Registration servers check their inbox every 30 seconds • If all are online, state will converge in ~30 seconds • If server is offline, may take longer Registration Entry List 1 Active items:{str1|t1, …, strn|tn} Deleted items:{str1|t1, …, strm|tm} List 2 Active items Deleted items
Updating the database • What happens if two admins update concurrently? • “it is hard to predict which one of them will prevail.” • “acceptable“ because admins aren’t talking to each other • Anyone make sense of this? Registration Entry List 1 Active items:{str1|t1, …, strn|tn} Deleted items:{str1|t1, …, strm|tm} List 2 Active items Deleted items
Updating the database • Why not just use a distributed lock? • What if a replica is offline during acquire, but reappears? • What if lock owner crashes? • What if lock maintainer crashes? Registration Entry List 1 Active items:{str1|t1, …, strn|tn} Deleted items:{str1|t1, …, strm|tm} List 2 Active items Deleted items
Updating the database • What if clients get different answers from servers? • Clients just have to deal with it (•_•) ( •_•)>⌐■-■ (⌐■_■) • Inconsistencies are guaranteed to be transient • May not be good enough for some applications Registration Entry List 1 Active items:{str1|t1, …, strn|tn} Deleted items:{str1|t1, …, strm|tm} List 2 Active items Deleted items
Updating the database • What happens if a change message is lost during prop.? • Could lead to permanent inconsistency • Periodic replica comparisons and mergers if needed • Not perfect since partitions can prevent propagation Registration Entry List 1 Active items:{str1|t1, …, strn|tn} Deleted items:{str1|t1, …, strm|tm} List 2 Active items Deleted items
Updating the database • What happens if namespace is modified concurrently? • Use timestamps to pick a winner (last writer wins) • Why is this potentially dangerous? • Later update could be trapped in offline machine • Updates to first namespace accumulate • When offline machine goes online, all work to first is thrown out Registration Entry List 1 Active items:{str1|t1, …, strn|tn} Deleted items:{str1|t1, …, strm|tm} List 2 Active items Deleted items
Updating the database • What was the solution? • “Shouldn’t happen in practice.” • Humans should coordinate out-of-band • Probably true, but a little unsatisfying Registration Entry List 1 Active items:{str1|t1, …, strn|tn} Deleted items:{str1|t1, …, strm|tm} List 2 Active items Deleted items
Why read Grapevine? • Describes many fundamental problems • Performance and availability • Caching and replication • Consistency problems We still deal with many of these issues
Keeping replicas consistent • Requirement: members of write set agree • Write request only returns if WS members agree • Problem: things fall apart • What do we do if something fails in the middle? • This is why we had multiple replicas in first place • Need agreement protocols that are robust to failures
Two-phase commit • Two phases • Voting phase • Completion phase • During the voting phase • Coordinator proposes value to rest of group • Other replicas tentatively apply update, reply “yes” to coordinator • During the completion phase • Coordinator tallies votes • Success (entire group votes “yes”): coordinator sends “commit” message • Failure (some “no” votes or no reply): coordinator sends “abort” message • On success, group member commits update, sends “ack” to coordinator • On failure, group member aborts update, sends “ack” to coordinator • Coordinator aborts/applies update when all “acks” have been received
Two-phase commit Phase 1 Replica Coordinator Replica Replica
Two-phase commit Phase 1 Replica Propose: X 1 Coordinator Replica Propose: X 1 Propose: X 1 Replica
Two-phase commit Phase 1 Replica X 1 Yes Coordinator Replica Yes X 1 Yes Replica X 1
Two-phase commit Phase 2 Replica X 1 Coordinator Replica 3 Yes votes X 1 Replica X 1
Two-phase commit Phase 2 Replica X 1 Commit: X 1 Coordinator Replica Commit: X 1 X 1 Commit: X 1 Replica X 1
Two-phase commit Phase 2 Replica X 1 Coordinator Replica X 1 Replica X 1
Two-phase commit Phase 2 Replica X 1 ACK Coordinator Replica ACK X 1 ACK Replica X 1
Two-phase commit Phase 1 • What if fewer than 3 Yes votes? • Replicas will time out and assume update is aborted Replica X 1 No Coordinator Replica Yes 2 Yes votes X 1 Yes Replica X 1
Two-phase commit Phase 1 • What if fewer than 3 Yes votes? • Replicas do not commit Replica X 1 Abort: X 1 Coordinator Replica Abort: X 1 2 Yes votes X 1 Abort: X 1 Replica X 1
Two-phase commit Phase 1 • Why might replica vote No? • Replicas will time out and assume update is aborted Replica X 1 No Coordinator Replica Yes 2 Yes votes X 1 Yes Replica X 1
Two-phase commit Phase 1 • Why might replica vote No? • Might not be able to acquire local write lock • Might be committing w/ another coord. Replica X 1 No Coordinator Replica Yes 2 Yes votes X 1 Yes Replica X 1
Two-phase commit Phase 2 • What if coord. fails after vote msg, before decision msg? • Replicas will time out and assume update is aborted Replica X 1 Coordinator Replica 3 Yes votes X 1 Replica X 1
Two-phase commit Phase 2 • What if coord. fails after vote msg, before decision msg? • Replicas will time out and assume update is aborted Replica X 1 Coordinator Replica 3 Yes votes X 1 Replica X 1
Two-phase commit Phase 2 • What if coord. fails after decision messages are sent? • Replicas commit update Replica X 1 Commit: X 1 Coordinator Replica Commit: X 1 X 1 Commit: X 1 Replica X 1