15-440 Distributed Systems

15-440 Distributed Systems Lecture 1 – Introduction to Distributed Systems

What Is A Distributed System? “A collection of independent computers that appears to its users as a single coherent system.” • Features: • No shared memory – message-based communication • Each runs its own local OS • Heterogeneity • Expandability • Ideal: to present a single-system image: • The distributed system “looks like” a single computer rather than a collection of separate computers.

Definition of a Distributed System Figure 1-1. A distributed system organized as middleware. The middleware layer runs on all machines, and offers a uniform interface to the system

Distributed Systems: Goals • Resource Availability: remote access to resources • Distribution Transparency: single system image • Access, Location, Migration, Replication, Failure,… • Openness: services according to standards (RPC) • Scalability: size, geographic, admin domains, … • Example of a Distributed System? • Web search on google • DNS: decentralized, scalable, robust to failures, ... • ...

15-440 Distributed Systems Lecture 2 & 3 – 15-441 in 2 Days

Packet Switching – Statistical Multiplexing • Switches arbitrate between inputs • Can send from any input that’s ready • Links never idle when traffic to send • (Efficiency!) Packets

Model of a communication channel • Latency - how long does it take for the first bit to reach destination • Capacity - how many bits/sec can we push through? (often termed “bandwidth”) • Jitter - how much variation in latency? • Loss / Reliability - can the channel drop packets? • Reordering

Packet Switching • Source sends information as self-contained packets that have an address. • Source may have to break up single message in multiple • Each packet travels independently to the destination host. • Switches use the address in the packet to determine how to forward the packets • Store and forward • Analogy: a letter in surface mail.

Internet • An inter-net: a network of networks. • Networks are connected using routers that support communication in a hierarchical fashion • Often need other special devices at the boundaries for security, accounting, .. • The Internet: the interconnected set of networks of the Internet Service Providers (ISPs) • About 17,000 different networks make up the Internet Internet

Network Service Model • What is the service model for inter-network? • Defines what promises that the network gives for any transmission • Defines what type of failures to expect • Ethernet/Internet: best-effort – packets can get lost, etc.

Possible Failure models • Fail-stop: • When something goes wrong, the process stops / crashes / etc. • Fail-slow or fail-stutter: • Performance may vary on failures as well • Byzantine: • Anything that can go wrong, will. • Including malicious entities taking over your computers and making them do whatever they want. • These models are useful for proving things; • The real world typically has a bit of everything. • Deciding which model to use is important!

What is Layering? • Modular approach to network functionality • Example: Application Application-to-application channels Host-to-host connectivity Link hardware

IP Layering • Relatively simple Application Transport Network Link Physical Host Bridge/Switch Router/Gateway Host

Protocol Demultiplexing • Multiple choices at each layer FTP HTTP NV TFTP TCP UDP Network IP TCP/UDP IPX IP Type Field Protocol Field Port Number NET1 NET2 … NETn

Goals [Clark88] • Connect existing networks initially ARPANET and ARPA packet radio network • Survivability ensure communication service even in the presence of network and router failures • Support multiple types of services • Must accommodate a variety of networks • Allow distributed management • Allow host attachment with a low level of effort • Be cost effective • Allow resource accountability

Goal 1: Survivability • If network is disrupted and reconfigured… • Communicating entities should not care! • No higher-level state reconfiguration • How to achieve such reliability? • Where can communication state be stored?

CIDR IP Address Allocation Provider is given 201.10.0.0/21 Provider 201.10.0.0/22 201.10.4.0/24 201.10.5.0/24 201.10.6.0/23

Ethernet Frame Structure (cont.) Addresses: 6 bytes Each adapter is given a globally unique address at manufacturing time Address space is allocated to manufacturers 24 bits identify manufacturer E.g., 0:0:15:*  3com adapter Frame is received by all adapters on a LAN and dropped if address does not match Special addresses Broadcast – FF:FF:FF:FF:FF:FF is “everybody” Range of addresses allocated to multicast Adapter maintains list of multicast groups node is interested in 18

End-to-End Argument • Deals with where to place functionality • Insidethe network (in switching elements) • At the edges • Argument • If you have to implement a function end-to-end anyway (e.g., because it requires the knowledge and help of the end-point host or application), don’t implement it inside the communication system • Unless there’s a compelling performance enhancement • Key motivation for split of functionality between TCP,UDP and IP Further Reading: “End-to-End Arguments in System Design.” Saltzer, Reed, and Clark.

User Datagram Protocol (UDP): An Analogy Postal Mail • Single mailbox to receive messages • Unreliable  • Not necessarily in-order delivery • Each letter is independent • Must address each reply UDP • Single socket to receive messages • No guarantee of delivery • Not necessarily in-order delivery • Datagram – independent packets • Must address each packet Postal Mail • Single mailbox to receive letters • Unreliable  • Not necessarily in-order delivery • Letters sent independently • Must address each letter Example UDP applications Multimedia, voice over IP

Transmission Control Protocol (TCP): An Analogy TCP • Reliable – guarantee delivery • Byte stream – in-order delivery • Connection-oriented – single socket per connection • Setup connection followed by data transfer Telephone Call • Guaranteed delivery • In-order delivery • Connection-oriented • Setup connection followed by conversation Example TCP applications Web, Email, Telnet

15-440 Distributed Systems Lecture 5 – Classical Synchronization

Classic synchronization primitives • Basics of concurrency • Correctness (achieves Mutex, no deadlock, no livelock) • Efficiency, no spinlocks or wasted resources • Fairness • Synchronization mechanisms • Semaphores (P() and V() operations) • Mutex (binary semaphore) • Condition Variables (allows a thread to sleep) • Must be accompanied by a mutex • Wait and Signal operations • Work through examples again + GO primitives

15-440 Distributed Systems Lecture 6 – RPC

RPC Goals • Ease of programming • Hide complexity • Automates task of implementing distributed computation • Familiar model for programmers (just make a function call) Historical note: Seems obvious in retrospect, but RPC was only invented in the ‘80s. See Birrell & Nelson, “Implementing Remote Procedure Call” ... or Bruce Nelson, Ph.D. Thesis, Carnegie Mellon University: Remote Procedure Call., 1981 :)

Passing Value Parameters (1) • The steps involved in a doing a remote computation through RPC.

Stubs: obtaining transparency • Compiler generates from API stubs for a procedure on the client and server • Client stub • Marshals arguments into machine-independent format • Sends request to server • Waits for response • Unmarshals result and returns to caller • Server stub • Unmarshals arguments and builds stack frame • Calls procedure • Server stub marshals results and sends reply

Real solution: break transparency • Possible semantics for RPC: • Exactly-once • Impossible in practice • At least once: • Only for idempotent operations • At most once • Zero, don’t know, or once • Zero or once • Transactional semantics

Asynchronous RPC (3) • A client and server interacting through two asynchronous RPCs.

Important Lessons • Procedure calls • Simple way to pass control and data • Elegant transparent way to distribute application • Not only way… • Hard to provide true transparency • Failures • Performance • Memory access • Etc. • How to deal with hard problem  give up and let programmer deal with it • “Worse is better”

15-440 Distributed Systems Lecture 7,8 – Distributed File Systems

Why DFSs? • Why Distributed File Systems: • Data sharing among multiple users • User mobility • Location transparency • Backups and centralized management • Examples: NFS (v1 – v4), AFS, CODA, LBFS • Idea: Provide file system interfaces to remote FS’s • Challenge: heterogeneity, scale, security, concurrency,.. • Non-Challenges: AFS meant for campus community • Virtual File Systems: pluggable file systems • Use RPC’s

NFS vs AFS Goals • AFS: Global distributed file system • “One AFS”, like “one Internet” • LARGE numbers of clients, servers (1000’s cache files) • Global namespace (organized as cells) => location transparency • Clients w/disks => cache, Write sharing rare, callbacks • Open-to-close consistency (session semantics) • NFS: Very popular Network File System • NFSv4 meant for wide area • Naming: per-client view (/home/yuvraj/…) • Cache data in memory, not on disk, write through cache • Consistency model: buffer data (eventual, ~30seconds) • Requires significant resounces as users scale

DFS Important bits (1) • Distributed filesystems almost always involve a tradeoff: consistency, performance, scalability. • We’ve learned a lot since NFS and AFS (and can implement faster, etc.), but the general lesson holds. Especially in the wide-area. • We’ll see a related tradeoff, also involving consistency, in a while: the CAP tradeoff. Consistency, Availability, Partition-resilience.

DFS Important Bits (2) • Client-side caching is a fundamental technique to improve scalability and performance • But raises important questions of cache consistency • Timeouts and callbacks are common methods for providing (some forms of) consistency. • AFS picked close-to-open consistency as a good balance of usability (the model seems intuitive to users), performance, etc. • AFS authors argued that apps with highly concurrent, shared access, like databases, needed a different model

Coda Summary • Distributed File System built for mobility • Disconnected operation key idea • Puts scalability and availability beforedata consistency • Unlike NFS • Assumes that inconsistent updates are very infrequent • Introduced disconnected operation mode and file hoarding and the idea of “reintegration”

Low Bandwidth File SystemKey Ideas • A network file systems for slow or wide-area networks • Exploits similarities between files • Avoids sending data that can be found in the server’s file system or the client’s cache • Uses RABIN fingerprints on file content (file chunks) • Can deal with byte offsets when part of file change • Also uses conventional compression and caching • Requires 90% less bandwidth than traditional network file systems

15-440 Distributed Systems Lecture 9 – Time Synchronization

Impact of Clock Synchronization • When each machine has its own clock, an event that occurred after another event may nevertheless be assigned an earlier time.

Clocks in a Distributed System • Computer clocks are not generally in perfect agreement • Skew: the difference between the times on two clocks (at any instant) • Computer clocks are subject to clock drift (they count time at different rates) • Clock drift rate: the difference per unit of time from some ideal reference clock • Ordinary quartz clocks drift by about 1 sec in 11-12 days. (10-6 secs/sec). • High precision quartz clocks drift rate is about 10-7 or 10-8 secs/sec

Perfect networks • Messages always arrive, with propagation delay exactly d • Sender sends time T in a message • Receiver sets clock to T+d • Synchronization is exact

mr p Time server,S mt Cristian’s Time Sync • A time server S receives signals from a UTC source • Process p requests time in mr and receives t in mtfrom S • p sets its clock to t + RTT/2 • Accuracy ± (RTT/2- min) : • because the earliest time S puts t in message mt is min after p sent mr. • thelatest time was min before mtarrivedat p • the time by S’s clock when mt arrives is in the range [t+min, t + RTT - min] Tround is the round trip time recorded by p min is an estimated minimum round trip time

Berkeley algorithm • Cristian’s algorithm - • a single time server might fail, so they suggest the use of a group of synchronized servers • it does not deal with faulty servers • Berkeley algorithm (also 1989) • An algorithm for internal synchronization of a group of computers • A master polls to collect clock values from the others (slaves) • The master uses round trip times to estimate the slaves’ clock values • It takes an average (eliminating any above average round trip time or with faulty clocks) • It sends the required adjustment to the slaves (better than sending the time which depends on the round trip time) • Measurements • 15 computers, clock synchronization 20-25 millisecs drift rate < 2x10-5 • If master fails, can elect a new master to take over (not in bounded time) •

Server T1 T2 Time m m' Time Client T0 T3 NTP Protocol • All modes use UDP • Each message bears timestamps of recent events: • Local times of Send and Receive of previous message • Local times of Send of current message • Recipient notes the time of receipt T3 (we have T0, T1, T2, T3)

Logical time and logical clocks (Lamport 1978) • Instead of synchronizing clocks, event ordering can be used • If two events occurred at the same process pi (i = 1, 2, … N) then they occurred in the order observed by pi, that is the definition of: “i” • when a message, m is sent between two processes, send(m) happens before receive(m) • The happened before relation is transitive • The happened before relation is the relation of causal ordering

Total-order Lamport clocks • Many systems require a total-ordering of events, not a partial-ordering • Use Lamport’s algorithm, but break ties using the process ID • L(e) = M * Li(e) + i • M = maximum number of processes • i = process ID

Vector Clocks • Note that e  e’ implies V(e)<V(e’). The converse is also true • Can you see a pair of parallel events? • c || e (parallel) because neither V(c) <= V(e) nor V(e) <= V(c)

15-440 Distributed Systems Lecture 10 – Mutual Exclusion

(+$100) (+1%) (San Francisco) (New York) Example: Totally-Ordered Multicasting • San Fran customer adds $100, NY bank adds 1% interest • San Fran will have $1,111 and NY will have $1,110 • Updating a replicated database and leaving it in an inconsistent state. • Can use Lamport’s to totally order

Mutual ExclusionA Centralized Algorithm (1) @ Client  Acquire: Send (Request, i) to coordinator Wait for reply @ Server: while true: m = Receive() If m == (Request, i): If Available(): Send (Grant) to i

15-440 Distributed Systems