410 likes | 530 Views
COMP 734 -- Distributed File Systems. With Case Studies: NFS, Andrew and Google. Mobility (user & data). Sharing. Administration Costs. Content Management. Security. Backup. Distributed File Systems Were Phase 0 of “Cloud” Computing. Phase 0: Centralized Data, Distributed Processing.
E N D
COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google COMP 734 – Fall 2011
Mobility (user & data) Sharing Administration Costs Content Management Security Backup Distributed File Systems Were Phase 0 of “Cloud” Computing Phase 0: Centralized Data, Distributed Processing Performance??? COMP 734 – Fall 2011
response (e.g., file block) request (e.g., read) Distributed File System Clients and Servers Client Server network Client Client Most Distributed File Systems Use Remote Procedure Calls (RPC) COMP 734 – Fall 2011
RPC runtime client stub RPC runtime server stub local call call packet pack args unpack args call transmit receive work wait unpack result pack result return receive transmit result packet exporter importer interface RPC Structure (Birrell & Nelson)11Fig. 1 (slightly modified) Birrell, A. D. and B. J. Nelson, Implementing Remote Procedure Calls, ACM Transactions on Computer Systems, Vol. 2, No. 1, February 1984, pp. 39-59 Callee machine Caller machine client server network local return exporter importer interface COMP 734 – Fall 2011
happens before happens before B A B A Unix Local File Access Semantics – Multiple Processes Read/Write a File Concurrently read by C write by A write by B read by C write by A write by B read by C write by A writes are “atomic” reads always get the atomic result of the most recently completed write write by B COMP 734 – Fall 2011
What Happens If Clients Cache File/Directory Content? Do the Cache Consistency Semantics Match Local Concurrent Access Semantics? Client Server write() client cache read() network response Client write() client cache Client read() COMP 734 – Fall 2011
File Usage Observation #1:Most Files are Small Source: Mary Baker, et at, “Measurements of a Distributed File System,” Proceedings 13th ACM SOSP, 1991, pp. 198-212. Source: Andrew W. Leung, et at, “Measurement and Analysis of Large-Scale Network File System Workloads,” Proceedings USENIX Annual Technical Conference, 2008, pp. 213-226. Unix, 1991 Windows, 2008 80% of files < 10 KB 80% of files < 30 KB COMP 734 – Fall 2011
File Usage Observation #2:Most Bytes Are Transferred from Large Files (and Large Files Are Larger) Source: Mary Baker, et at, “Measurements of a Distributed File System,” Proceedings 13th ACM SOSP, 1991, pp. 198-212. Source: Andrew W. Leung, et at, “Measurement and Analysis of Large-Scale Network File System Workloads,” Proceedings USENIX Annual Technical Conference, 2008, pp. 213-226. Windows, 2008 Unix, 1991 80% of bytes from files > 10 KB 70% of bytes from files > 30 KB COMP 734 – Fall 2011
File Usage Observation #3:Most Bytes Are Transferred in Long Sequential Runs – Most Often the Whole File Source: Mary Baker, et at, “Measurements of a Distributed File System,” Proceedings 13th ACM SOSP, 1991, pp. 198-212. Source: Andrew W. Leung, et at, “Measurement and Analysis of Large-Scale Network File System Workloads,” Proceedings USENIX Annual Technical Conference, 2008, pp. 213-226. Windows, 2008 Unix, 1991 85% of sequential bytes from runs > 100 KB 60% of sequential bytes from runs > 100 KB COMP 734 – Fall 2011
Chronology of Early Distributed File Systems – Designs for Scalability open source (AFS) product Clients supported by one installation Time COMP 734 – Fall 2011
NFS-2 Design Goals • Transparent File Naming • Scalable -- O(100s) • Performance approximating local files • Fault tolerant (client, server, network faults) • Unix file-sharing semantics (almost) • No change to Unix C library/system calls COMP 734 – Fall 2011
usr joe sam doc bin proj readme.doc source tools report.doc usr bin proj main.c data.h source tools joe sam main.c data.h doc readme.doc report.doc NFS File Naming – Exporting Names / host#1 exportfs /usr/proj host#3 / usr tools exportfs /usr exportfs /bin local fred / host#2 usr local bob COMP 734 – Fall 2011
usr joe sam doc bin proj readme.doc source tools report.doc usr bin proj tools main.c data.h source tools joe sam main.c data.h doc readme.doc report.doc joe source sam main.c data.h doc readme.doc report.doc NFS File Naming – Import (mount) Names mount host#3:/bin /tools / host#1 exportfs /usr/proj host#3 / usr tools exportfs /usr exportfs /bin local fred / host#2 usr local mount host#3:/usr /usr bob mount host#1:/usr/proj /local COMP 734 – Fall 2011
NFS-2 Remote Procedure Calls • NFS server is Stateless • each call is self-contained and independent • avoids complex state recovery issues in the event of failures • each call has an explicit file handle parameter COMP 734 – Fall 2011
NFS server file system NFS-2 Read/Write with Client Cache application read(fh, buffer, length) file system NFS client network OS kernel Buffer cache blocks OS kernel COMP 734 – Fall 2011
NFS-2 Cache Operation and Consistency Semantics • Application writes modify blocks in buffer cache and on the server with write() RPC (“write-through”) • Consistency validations of cached data compare the last-modified timestamp in the cache with the value on the server • If server timestamp is later, the cached data is discarded • Note the need for (approximately) synchronized clocks on client and server • Cached file data is validated (a getattr() RPC) each time the file is opened. • Cached data is validated (a getattr() RPC) if an application accesses it and a time threshold has passed since it was last validated • validations are done each time a last-modified attribute is returned on RPC for read, lookup, etc.) • 3 second threshold for files • 30 second threshold for directories • If the time threshold has NOT passed, the cached file/directory data is assumed valid and the application is given access to it. COMP 734 – Fall 2011
Andrew Design Goals • Transparent file naming in single name space • Scalable -- O(1000s) • Performance approximating local files • Easy administration and operation • “Flexible” protections (directory scope) • Clear (non-Unix) consistency semantics • Security (authentication) COMP 734 – Fall 2011
proj home pkg tools / / / / reiter smithfd bin bin bin bin etc etc etc etc win32 cache cache cache cache doc Users Share a Single Name Space afs COMP 734 – Fall 2011
home home home / / bin bin etc etc reiter carol bob smithfd alice ted cache cache doc doc doc Server “Cells” Form a “Global” File System /afs/cs.unc.edu /afs/cs.cmu.edu /afs/cern.ch COMP 734 – Fall 2011
(1) - open request passed to Andrew client (2) - client checks cache for valid file copy (3) - if not in cache, fetch whole file from server and write to cache; else (4) Server (4) - when file is in cache, return handle to local file (4) (2) (1) (3) file system Andrew File Access (Open) application open(/afs/usr/fds/foo.c) file system Client network OS kernel OS kernel On-disk cache COMP 734 – Fall 2011
(5) - read and write operations take place on local cache copy Server file system (5) Andrew File Access (Read/Write) application read(fh, buffer, length) write(fh, buffer, length) file system Client network OS kernel OS kernel On-disk cache COMP 734 – Fall 2011
(6) - close request passed to Andrew client (7) - client writes whole file back to server from cache (8) - server copy of file is entirely replaced Server (6) file system (7) (8) Andrew File Access (Close-Write) application close(fh) file system Client network OS kernel OS kernel On-disk cache COMP 734 – Fall 2011
Andrew Cache Operation and Consistency Semantics • Directory lookup uses valid cache copy; directory updates (e.g., create or remove) to cache “write-through” to server. • When file or directory data is fetched, the server “guarantees” to notify (callback) the client before changing the server’s copy • Cached data is used without checking until a callback is received for it or 10 minutes has elapsed without communication with its server • On receiving a callback for a file or directory, the client invalidates the cached copy • Cached data can also be revalidated (and new callbacks established) by the client with an RPC to the server • avoids discarding all cache content after network partition or client crash COMP 734 – Fall 2011
Andrew Benchmark -- Server CPU Utilization Source: Howard, et al, “Scale and Performance in a Distributed File System”, ACM TOCS, vol. 6, no. 1, February 1988, pp. 51-81. NFS server utilization saturates quickly with load Andrew server utilization increases slowly with load COMP 734 – Fall 2011
Google is Really Different…. The Dalles, OR (2006) • Huge Datacenters in 30+ Worldwide Locations • Datacenters house multiple server clusters • Even nearby in Lenior, NC each > football field 4 story cooling towers 2008 2007 COMP 734 – Fall 2011
Google is Really Different…. • Each cluster has hundreds/thousands of Linux systems on Ethernet switches • 500,000+ total servers COMP 734 – Fall 2011
Custom Design Servers Clusters of low-cost commodity hardware • Custom design using high-volume components • SATA disks, not SAS (high capacity, low cost, somewhat less reliable) • No “server-class” machines Battery power backup COMP 734 – Fall 2011
Facebook Enters the Custom Server Race (April 7, 2011) Announces the Open Compute Project (the Green Data Center) COMP 734 – Fall 2011
Google File System Design Goals • Familiar operations but NOT Unix/Posix Standard • Specialized operation for Google applications • record_append() • Scalable -- O(1000s) of clients per cluster • Performance optimized for throughput • No client caches (big files, little cache locality) • Highly available and fault tolerant • Relaxed file consistency semantics • Applications written to deal with consistency issues COMP 734 – Fall 2011
File and Usage Characteristics • Many files are 100s of MB or 10s of GB • Results from web crawls, query logs, archives, etc. • Relatively small number of files (millions/cluster) • File operations: • Large sequential (streaming) reads/writes • Small random reads (rare random writes) • Files are mostly “write-once, read-many.” • File writes are dominated by appends, many from hundreds of concurrent processes (e.g., web crawlers) process Appended file process process COMP 734 – Fall 2011
GFS Basics • Files named with conventional pathname hierarchy (but no actual directory files) • E.g., /dir1/dir2/dir3/foobar • Files are composed of 64 MB “chunks” (Linux typically uses 4 KB blocks) • Each GFS cluster has many servers (Linux processes): • One primary Master Server • Several “Shadow” Master Servers • Hundreds of “Chunk” Servers • Each chunk is represented by a normal Linux file • Linux file system buffer provides caching and read-ahead • Linux file system extends file space as needed to chunk size • Each chunk is replicated (3 replicas default) • Chunks are check-summed in 64KB blocks for data integrity COMP 734 – Fall 2011
GFS Protocols for File Reads Minimizes client interaction with master: - Data operations directly with chunk servers. - Clients cache chunk metadata until new open or timeout Ghemawat, S., H. Gobioff, and S-T. Leung, The Google File System, Proceedings of ACM SOSP 2003, pp. 29-43 COMP 734 – Fall 2011
Master Server Functions • Maintain file name space (atomic create, delete names) • Maintain chunk metadata • Assign immutable globally-unique 64-bit identifier • Mapping from files name to chunk(s) • Current chunk replica locations • Refresh dynamically from chunk servers • Maintain access control data • Manage replicas and other chunk-related actions • Assign primary replica and version number • Garbage collect deleted chunks and stale replicas • Migrate chunks for load balancing • Re-replicate chunks when servers fail • Heartbeat and state-exchange messages with chunk servers COMP 734 – Fall 2011
GFS Relaxed Consistency Model • Writes that are large or cross chunk boundaries may be broken into multiple smaller ones by GFS • Sequential writes (successful): • One copy semantics*, writes serialized. • Concurrent writes (successful): • One copy semantics • Writes not serialized in overlapping regions • Sequential or concurrent writes (failure): • Replicas may differ • Application should retry All replicas equal *Informally, there exists exactly one current value at all replicas and that value is returned for a read of any replica COMP 734 – Fall 2011
GFS Applications Deal with Relaxed Consistency • Writes • Retry in case of failure at any replica • Regular checkpoints after successful sequences • Include application-generated record identifiers and checksums • Reads • Use checksum validation and record identifiers to discard padding and duplicates. COMP 734 – Fall 2011
GFS Chunk Replication (1/2) LRU buffers at chunk servers Master 1. Client contacts master to get replica state and caches it FindLocation C1,C2(primary),C3 1 2 C1 ACK Client primary Client 1 2 C2 2. Client picks any chunk server and pushes data. Servers forward data along “best” path to others. ACK 1 2 C3 ACK COMP 734 – Fall 2011
1 2 1 2 1 2 GFS Chunk Replication (2/2) Master 4. Primary assigns write order and forwards to replicas 3. Client sends write request to primary 1 2 C1 ACK Client Write write order Write Client success/failure C2 success/failure write order ACK 5. Primary collects ACKs and responds to client. Applications must retry write if there is any failure. 1 2 C3 COMP 734 – Fall 2011
GFS record_append() • Client specifies only data content and region size; server returns actual offset to region • Guaranteed to append at least once atomically • File may contain padding and duplicates • Padding if region size won’t fit in chunk • Duplicates if it fails at some replicas and client must retry record_append() • If record_append() completes successfully, all replicas will contain at least one copy of the region at the same offset COMP 734 – Fall 2011
GFS Record Append (1/3) LRU buffers at chunk servers Master 1. Client contacts master to get replica state and caches it FindLocation C1,C2(primary),C3 1 2 C1 ACK Client primary Client 1 2 C2 2. Client picks any chunk server and pushes data. Servers forward data along “best” path to others. ACK 1 2 C3 ACK COMP 734 – Fall 2011
1@ 2@ 1@ 2@ 1 2 GFS Record Append (2/3) Master 4. If record fits in last chunk, primary assigns write order and offset and forwards to replicas 3. Client sends write request to primary 1 2 C1 ACK Client Write write order Write Client offset/failure C2 success/failure write order ACK 1 2 5. Primary collects ACKs and responds to client with assigned offset. Applications must retry write if there is any failure. C3 COMP 734 – Fall 2011
1 2 GFS Record Append (3/3) 4. If record overflows last chunk, primary and replicas pad last chunk and offset points to next chunk Master 3. Client sends write request to primary 1 2 C1 Client Write Pad to next chunk Retry on next chunk C2 Pad to next chunk 1 2 5. Client must retry write from beginning C3 COMP 734 – Fall 2011