610 likes | 742 Views
CIS 620 Advanced Operating Systems. Lecture 11 – Distributed File Systems, Consistency and Replication Prof. Timothy Arndt BU 331. Distributed File Systems. File service vs. file server The file service is the specification.
E N D
CIS 620 Advanced Operating Systems Lecture 11 – Distributed File Systems, Consistency and Replication Prof. Timothy Arndt BU 331
Distributed File Systems • File service vs. file server • The file service is the specification. • A file server is a process running on a machine to implement the file service for (some) files on that machine. • In a normal distributed system would have one file service but perhaps many file servers. • If have very different kinds of file systems we might not be able to have a single file service as perhaps some functions are not available.
Distributed File Systems • File Server Design • File • Sequence of bytes • Unix • MS-Dos • Windows • Sequence of Records • Mainframes • Keys • We do not cover these file systems. They are often discussed in database courses.
Distributed File Systems • File attributes • rwx and perhaps a (append) • This is really a subset of what is called ACL -- access control list or Capability. • You get ACLs and Capabilities by reading columns and rows of the access matrix. • owner, group, various dates, size • dump, autocompress, immutable
Distributed File Systems • Upload/download vs. remote access. • Upload/download means the only file services supplied are read file and write file. • All modifications done on a local copy of file. • Conceptually simple at first glance. • Whole file transfers are efficient (assuming you are going to access most of the file) when compared to multiple small accesses. • Not an efficient use of bandwidth if you access only a small part of a large file. • Requires storage on client.
Distributed File Systems • What about concurrent updates? • What if one client reads and "forgets" to write for a long time and then writes back the "new" version overwriting newer changes from others? • Remote access means direct individual reads and writes to the remote copy of the file. • File stays on the server. • Issue of (client) buffering • Good to reduce number of remote accesses. • But what about semantics when a write occurs?
Distributed File Systems • Note that meta-data is written for a read so if you want faithful semantics every client read must modify metadata on server or all requests for metadata (e.g ls or dir commands) must go to server. • Cache consistency question. • Directories • Mapping from names to files/directories. • Contains rules for names of files and (sub)directories. • Hierarchy i.e. tree • (hard) links
Distributed File Systems • With hard links the filesystem becomes a Directed Acyclic Graph instead of a simple tree. • Symbolic links • Symbolic not symmetric. Indeed asymmetric. • Consider cd ~ mkdir dir1 touch dir1/file1 ln -s dir1/file1 file2
Distributed File Systems • file2 has a new inode it is a new type of file called a symlink and its "contents" are the name of the file dir/file1 • When accessed file2 returns the contents of file1, but it is not equal to file1. • If file1 is deleted, file2 "exists" but is invalid. • If a new file2 is created, file2 now points to it. • Symbolic links can point to directories as well. • With symbolic links pointing to directories, the file system becomes a general graph, i.e. directed cycles are permitted.
Distributed File Systems • Imagine hard links pointing to directories (Unix does not permit this). cd ~ mkdir B; mkdir C mkdir B/D; mkdir B/E ln B B/D/oh-my • Now you have a loop with honest looking links. • Normally you can't remove a directory (i.e. unlink it from its parent) unless it is empty. • But when can have multiple hard links to a directory, you should permit removing (i.e. unlinking) one even if the directory is not empty.
Distributed File Systems • So in the above example you could unlink B from A. • Now you have garbage (unreachable, i.e. unnamable) directories B, D, and E. • For a centralized system you need a conventional garbage collection. • For distributed system you need a distributed garbage collector, which is much harder. • Transparency • Location transparency • Path name (i.e. full name of file) does not say where the file is located.
Distributed File Systems • Location Independence • Path name is independent of the server. Hence you can move a file from server to server without changing its name. • Have a namespace of files and then have some (dynamically) assigned to certain servers. This namespace would be the same on all machines in the system. • Root transparency • made up name • / is the same on all systems • This would ruin some conventions like /tmp
Distributed File Systems • Examples • Machine + path naming • /machine/path • machine:path • Mounting remote file system onto local hierarchy • When done intelligently we get location transparency • Single namespace looking the same on all machines
Distributed File Systems • Two level naming • We said above that a directory is a mapping from names to files (and subdirectories). • More formally, the directory maps the user name /home/me/class-notes.html to the OS name for that file 143428 (the Unix inode number). • These two names are sometimes called the symbolic and binary names. • For some systems the binary names are available.
Distributed File Systems • The binary name could contain the server name so that could directly reference files on other filesystems/machines • Unix doesn't do this • We could have symbolic names contain the server name • Unix doesn't do this either • VMS did something like this. Symbolic name was something like nodename::filename • Could have the name lookup yield multiple binary names.
Distributed File Systems • Redundant storage of files for availability • Naturally must worry about updates • When visible? • Concurrent updates? • Whenever you hear of a system that keeps multiple copies of something, an immediate question should be "are these immutable?". If the answer is no, the next question is "what are the update semantics?” • Sharing semantics • Unix semantics - A read returns the value stored by the last write.
Distributed File Systems • Actually Unix doesn't quite do this. • If a write is large (several blocks), do seeks for each • During a seek, the process sleeps (in the kernel) • Another process can be writing a range of blocks that intersects the blocks for the first write. • The result could be (depending on disk scheduling), that the result does not have a last write. • Perhaps Unix semantics means - A read returns the value stored by the last write, providing one exists. • Perhaps Unix semantics means - A write syscall should be thought of as a sequence of write-block syscalls and similar for reads. A read-block syscall returns the value of the last write-block syscall for that block
Distributed File Systems • Easy to get this same semantics for systems with file servers providing • No client side copies (Upload/download) • No client side caching • Session semantics • Changes to an open file are visible only to the process (machine???) that issued the open. When the file is closed the changes become visible to all. • If you are using client caching you cannot flush dirty blocks until close. (What if you run out of buffer space?)
Distributed File Systems • May mess up file-pointer semantics • The file pointer is shared across the fork so all the children of a parent share it. • But if the children run on another machine with session semantics, the file pointer can't be shared since the other machine does not see the effect of the writes done by the parent). • Immutable files • Then there is "no problem” • Fine if you don't want to change anything
Distributed File Systems • Can have "version numbers" • Old version may become inaccessible (at least under the current name) • With version numbers if you use name without number you get the highest numbered version • But really you do have the old (full) name accessible • VMS definitely did this • Note that directories are still mutable • Otherwise no create-file is possible
Distributed File Systems • Distributed File System Implementation • File Usage characteristics • Measured under Unix at a university • Not obvious that the same results would hold in a different environment • Findings • 1. Most files are small (< 10K) • 2. Reading dominates writing • 3. Sequential accesses dominate • 4. Most files have a short lifetime
Distributed File Systems • 5. Sharing is unusual • 6. Most processes use few files • 7. File classes with different properties exist • Some conclusions • 1 suggests whole-file transfer may be worthwhile (except for really big files). • 2+5 suggest client caching and dealing with multiple writers somehow, even if the latter is slow (since it is infrequent). • 4 suggests doing creates on the client
Distributed File Systems • Not so clear. Possibly the short lifetime files are temporaries that are created in /tmp or /usr/tmp or /somethingorother/tmp. These would not be on the server anyway. • 7 suggests having multiple mechanisms for the several classes. • Implementation choices • Servers & clients homogeneous? • Common Unix+NFS: any machine can be a server and/or a client
Distributed File Systems • User-mode implementation: Servers for files and directories are user programs so can configure some machines to offer the services and others not to • Fundamentally different: Either the hardware or software is fundamentally different for clients and servers. • In Unix some server code is in the kernel but other code is a user program (run as root) called nfsd • File and directory servers together?
Distributed File Systems • If yes, less communication • If no, more modular "cleaner” • Looking up a/b/c/ when a a/b a/b/c on different servers • Natural solution is for server-a to return name of server-a/b • Then client contacts server-a/b gets name of server-a/b/c etc. • Alternatively server-a forwards request to server-a/b who forwards to server-a/b/c. • Natural method takes 6 communications (3 RPCs)
Distributed File Systems • Alternative is 4 communications but is not RPC • Name caching • The translation from a/b/c to the inode (i.e. symbolic to binary name) is expensive even for centralized systems. • Called namei in Unix and was once measured to be a significant percentage of all of kernel activity. • Later Unix added "namei caching" • Potentially an even greater time saver for distributed systems since communication is expensive. • Must worry about obsolete entries.
Distributed File Systems • Stateless vs. Stateful • Should the server keep information between requests from a user, i.e. should the server maintain state? • What state? • Recall that the open returns an integer called a file descriptor that is subsequently used in read/write. • With a stateless server, the read/write must be self contained, i.e. cannot refer to the file descriptor. • Why?
Distributed File Systems • Advantages of stateless • Fault tolerant - No state to be lost in a crash • No open/close needed (saves messages) • No space used for tables (state requires storage) • No limit on number of open files (no tables to fill up) • No problem if client crashes (no state to be confused by) • Advantages of stateful • Shorter read/write (descriptor shorter than name)
Distributed File Systems • Better performance • Since we keep track of what files are open, we know to keep those inodes in memory • But stateless could keep a memory cache of inodes as well (evict via LRU instead of close, not as good) • Blocks can be read in advance (read ahead) • Of course stateless can read ahead. • Difference is that with stateful we can better decide when accesses are sequential. • Idempotency easier (keep sequence numbers) • File locking possible (the lock is state) • Stateless can write a lock file by convention. • Stateless can call a lock server
Caching • There are four places to store a file supplied by a file server (these are not mutually exclusive): • Server's disk • always done • Server's main memory • normally done • Standard buffer cache • Clear performance gain • Little if any semantics problems
Caching • Client's main memory • Considerable performance gain • Considerable semantic considerations • The one we will study • Client’s disk • Not so common now with cheaper memory • Unit of caching • File vs. block • Tradeoff of fewer access vs. storage efficiency
Caching • What eviction algorithm? • Exact LRU feasible because we can afford the time to do it (via linked lists) since access rate is low. • Where in client's memory to put cache? • The user's process • The cache will die with the process • No cache reuse among distinct processes • Not done for normal OS. • Big deal in databases • Cache management is a well studied DB problem
Caching • The kernel (i.e. the client's kernel) • System call required for cache hit • Quite common • Another process • "Cleaner" than in kernel • Easier to debug • Slower • Might get paged out by kernel! • Cache consistency • Big question
Caching • Write-through • All writes are sent to the server (as well as the client cache) • Hence does not lower traffic for writes • Does not by itself fix values in other caches • We need to invalidate or update other caches • Can have the client cache check with server whenever supplying a block to ensure that the block is not obsolete • Hence still need to reach server for all accesses but at least the reads that hit in the cache only need to send tiny message (timestamp not data).
Caching • Delayed write • Wait a while (30 seconds is used in some NFS implementations) and then send a bulk write message. • This is more efficient than a bunch of small write messages. • If file is deleted quickly, you might never write it. • Semantics are now time dependent (and ugly).
Caching • Write on close • Session semantics • Fewer messages since more writes than closes. • Not beautiful (think of two files simultaneously opened). • Not much worse than normal (uniprocessor) semantics. The difference is that it (appears) to be much more likely to hit the bad case. • Delayed write on close • Combines the advantages and disadvantages of delayed write and write on close.
Caching • Doing it "right”. • Multiprocessor caching (of central memory) is well studied and many solutions are known. • Use cache consistency (a.k.a. cache coherence) methods which are well-known. • Centralized solutions are possible. • But none are cheap. • Perhaps NSF is good enough and not enough reason to change (NFS predates cache coherence work).
Replication • Some issues are similar to (client) caching. • Why? • Because whenever you have multiple copies of anything, ask • Are they immutable? • What is update policy? • How do you keep copies consistent? • Purposes of replication • Reliability • A "backup" is available if data is corrupted on one server.
Replication • Availability • Only need to reach any of the servers to access the file (at least for queries). • Not the same as reliability • Performance • Each server handles less than the full load (for a query-only system much less). • Can use closest server - lowering network delays. • Not important for distributed system on one physical network. • Very important for web mirror sites.
Replication • Transparency • If we can't tell files are replicated, we say the system has replication transparency • Creation can be completely opaque • i.e. fully manual • users use copy commands • if directory supports multiple binary names for a single symbolic name, • use this when making copies • presumably subsequent opens will try the binary names in order (so they are not opaque)
Replication • Creation can use lazy replication. • User creates original • system later makes copies • subsequent opens can be (re)directed at any copy • Creation can use group communication. • User directs requests at a group. • Hence creation happens to all copies in the group at once.
Replication • Update protocols • Primary copy • All updates are done to the primary copy. • This server writes the update to stable storage and then updates all the other (secondary) copies. • After a crash, the server looks at stable storage and sees if there are any updates to complete. • Reads are done from any copy. • This is good for reads (read any one copy). • Writes are not so good. • Can't write if primary copy is unavailable.
Replication • Semantics • The update can take a long time (some of the secondaries can be down) • While the update is in progress, reads are concurrent with it. That is you might get old or new value depending which copy they read. • Voting • All copies are equal (symmetric) • To write you must write at least WQ of the copies (a write quorum). Set the version number of all these copies to 1 + max of current version numbers. • To read you must read at least RQ copies and use the value with the highest version.
Replication • Require WQ+RQ > number copies • Hence any write quorum and read quorum intersect. • Hence the highest version number in any read quorum is the highest ver number there is. • Hence always read the current version • Consider extremes (WQ=1 and RQ=1) • Fine points • To write, you must first read all the copies in your WQ to get the version number. • Must prevent races • Let N=2, WQ=2, RQ=1. Both copies (A and B) have version number 10.
Replication • Two updates start. U1 wants to write 1234, U2 wants to write 6789. • Both read version numbers and add 1 (get 11). • U1 writes A and U2 writes B at roughly the same time. • Later U1 writes B and U2 writes A. • Now both are at version 11 but A=6789 and B=1234. • Voting with ghosts • Often reads dominate writes so we choose RQ=1 (or at least RQ very small so WQ very large).
Replication • This makes it hard to write. E.g. RQ=1 so WQ=n and hence can't update if any machine is down. • When one detects that a server is down, a ghost is created. • Ghost cannot participate in read quorum, but can in write quorum • write quorum must have at least one non-ghost • Ghost throws away value written to it • Ghost always has version 0 • When crashed server reboots, it accesses a read quorum to update its value
Structured Peer-to-Peer Systems • Balancing load in a peer-to-peer system by replication.
Handling Byzantine Failures • The different phases in Byzantine fault tolerance.
High Availability in Peer-to-Peer Systems • The ratio rrep /rec as a function of node availability a.
NFS • NFS - Sun Microsystems's Network File System. • "Industry standard", dominant system. • Machines can be (and often are) both clients and servers. • Basic idea is that servers export directories and clients mount them. • When server exports a directory, the subtree routed there is exported. • In Unix exporting is specified in /etc/exports