350 likes | 597 Views
Distributed File Systems. DFS. A distributed file system is a module that implements a common file system shared by all nodes in a distributed system DFS should offer network transparency high availability key DFS services file server (store, and read/write files)
E N D
DFS • A distributed file system is a module that • implements a common file system shared by all nodes in a distributed system • DFS should offer • network transparency • high availability • key DFS services • file server (store, and read/write files) • name server (map names to stored objects) • cache manager (file caching at clients or servers)
DFS Mechanisms • Mounting • Caching • Hints • Bulk data transfers
DFS mechanisms • mounting • name space = collection of names of stored objects which may or may not share a common name resolution mechanism • binds a name space to a name (mount point) in another name space • mount tables maintaining the map of mount points to stored objects • mount tables can be kept at clients or servers • caching • amortize access cost of remote or disk data over many references • can be done at clients and/or servers • can be main memory or disk caches • helps to reduce delays (disk or network) in accessing stored objects • helps to reduce server loads and network traffic
DFS mechanisms • hints • caching introduces the problem a cache consistency • ensuring cache consistency is expensive • cached info can be used as a hint (e.g. mapping of a name to a stored object) • bulk data transfers • overhead in executing network protocols is high • network transit delays are small • solution: amortize protocol processing overhead and disk seek times and latencies over many file blocks
Name Resolution Issues • naming schemes • host:filename • simple and efficient • no location-transparency • mounting • single global name space • uniqueness of names requires cooperating servers • Context-aware • partition the name space into contexts • name resolution is always performed with respect to a given context • name servers • single name server • different name servers for different parts of a name space
Caching Issues • Main memory caches • faster access • diskless clients can also use caching • single design for both client and server caches • compete with Virtual Memory manager for physical memory • can not completely cache large stored objects • block-level caching is complex to implement • can not be used by portable clients • Disk caches • remove some of drawbacks of the main memory caches
Caching Issues • Writing policy • write-through • every client’s write request is performed at the servers immediately • delayed writing • client’s writes are reflected to the stored objects at servers after some delay • many writes in the cache • writes to short-lived objects are not done at servers • 20-30% of new data are deleted within 30 secs • lost data is an issue • delayed writing until file close • most files are open for a short time
Caching Issues • Approaches to deal with the cache consistency problem • server-initiated • servers inform client cache managers whenever their cached data become stale • servers need to keep track who cached which file blocks • client-initiated • clients validate data with servers before using • partially negates caching benefits • disable caching when concurrent-write sharing is detected • concurrent-write sharing: multiple clients opened a file with at least one of them opened for writing • avoid concurrent-write sharing by using locking
More Caching Consistency Issues • The sequential-write sharing problem • occurs when a client opens a (previously opened) file that has recently been modified and closed by another client • causes problems • A client may still have (outdated) file blocks in its cache • Other client may have not written its modified cached file blocks to file server • solutions • associate file timestamps with all cached file blocks; at file open request current file timestamp from file server • file server asks the client with the modified cached blocks to flush its data to server when another client opens a file for writing
Availability Issues • Replication can help in increasing data availability • is expensive due to extra storage for replicas and due to overhead in maintaining the replicas consistent • Main problems • maintaining replica consistency • detecting replica inconsistencies and recovering from them • handle network partitions • placing replicas where needed • keep the rate of deadlocks small and availability high
Availability Issues • Unit of replication • complete file or file block • allows replication of only the data that are needed • replica management is harder (locating replicas, ensuring file protection, etc) • volume (group) of files • wasteful if many files are not needed • replica management simpler • pack, a subset of the files in a user’s primary pack • mutual consistency among replicas • Let most current replica= replica with highest timestamp in a quorum • Use voting to read/write replicas and keep at least one replica current • Only votes from most current replicas are valid
Scalability & Semantic Issues • Caching & cache consistency • take advantage of file usage patterns • many widely used and shared files are accessed in read-only mode • data a client needs are often found in another client’s cache • organize client caches and file servers in a hierarchy for each file • implement file servers, name servers, and cache managers as multithreaded processes • common FS semantics: each read operation returns data due to the most recent write operation • providing these semantics in DFS is difficult and expensive
NFS • Interfaces • file system • virtual file system (VFS) • vnodes uniquely identify objects in the FS • contain mount table info (pointers to parent FS and mounted FS) • RPC and XDR (external data representation)
NFS Naming and Location • Filenames are mapped to represented object at first use • mapping is done at the servers by sequentially resolving each element of a pathname using the vnode information until a file handle is obtained
NFS Caching • File Caching • read ahead and 8KB file blocks are used • files or file blocks are cached with timestamp of last update • cached blocks are assumed valid for a preset time period • block validation is performed at file open and after timeout at the server • upon detecting an invalid block all blocks of the file are discarded • delayed writing policy with modified blocks flushed to server upon file close
NFS Caching • Directory name lookup caching • directory names ==> vnodes • cached entries are updated upon lookup failure or when new info is received • File/Directory attribute cache • access to file/dir attributes accounts for 90% of file requests • file attributes are discarded after 3 secs • dir. Attributes are discarded after 30 secs • dir. Changes are performed at the server • NFS servers are stateless
Sprite File System • Name space is a single hierarchy of domains • Each server stores one or more domains • Domains have unique prefixes • mount points link domains in single hierarchy • clients maintain prefix table
Sprite FS - Prefix tables • locating files in Sprite • each client finds longest prefix match in its prefix table and then sends remaining of pathname to the matching server together with the domain token in its prefix table • server replies with file token or with a new pathname if the “file” is a remote link • each client request contains the filename and domain token • when client fails to find matching prefix or fails during a file open • client broadcasts pathname and server with matching domain replies with domain/file token • entries in prefix table are hints
Sprite FS - Caching • Client-cache in main memory • file block size is 4KB • cache entries are addressed with file token and block#, which allows • blocks to be added without contacting the server • blocks can be accessed without accessing file’s disk map to get block’s disk address • clients do not cache directories to avoid inconsistencies • servers have main memory caches as well • delayed writing policy is used
Sprite FS - Cache Writing Policy • Observations • BSD • 20-30% of new data live less than 30 secs • 75% of files are open for less than 0.5 secs • 90% of files are open for less than 10 secs • recent study • 65-80% of files are open for less than 30 secs • 4-27% of new data are deleted within 30 secs • One can reduce traffic by • not updating servers at file close immediately • not updating servers when caches are updated
Sprite Cache Writing Policy • Delayed writing policy • every 5 secs flush client’s cached (modified) blocks to server if they haven’t been modified within the last 30 secs • flush blocks from server’s cache to disk within 30-60 secs afterwards • replacement policy: LRU • 80% of time blocks ejected to make room for other blocks • 20% of time to return memory to VM • cache blocks are unreferenced for about 1hr before ejected • cache misses • 40% on reads and 1% on writes
Sprite Cache Consistency • Server initiated • avoid concurrent-write sharing by disabling caching for files open concurrently for reading and writing • ask client writing file to flush its blocks • inform all other clients that file is not cacheable • file becomes cacheable when all clients close the file again • solve sequential-write sharing using version numbers • each client keeps the version# of file whose blocks it caches • server increments version# each time file is opened for writing • client is informed of file version# at file open • server keeps track of last writer; server asks last writer to flush its cached blocked if file is opened by another client
Sprite VM and FS Cache Contention • VM and FS compete for physical memory • VM and FS negotiate for physical memory usage • separate pools of blocks using the time of last access to determine winner; VM is given slight preference (it losses only if a block hasn’t been referenced for 20 mins) • double caching is a problem • FS marks blocks of newly compiled code with infinite time of last reference • backing files=swapped-out pages (including process state and data segments) • clients bypass FS cache when reading/writing backing files
CODA • Goals • scalability • availability • disconnected operation • Volume = collection of files and directories on a single server • unit of replication • FS objects have a unique FID which consists of • 32-bit volume number • 32-bit vnode number • 32-bit uniquifier • replicas of a FS object have the same FID
CODA Location • Volume Location database • replicated at each server • Volume Replication database • replicated at each server • Volume Storage Group (VSG) • Venus • client cache manager • caching in local disk • AVSG=client accessible nodes in VSG • preferred server in AVSG
CODA Caching & Replication • Venus caches files/dirs on demand • from the server in AVSG with the most up-to-date data • on file access • users can indicated caching priorities for file/dirs • users can bracket action sequences • Venus established callbacks at preferred server for each FS object • Server callbacks • server tells client that cached object is invalid • lost callbacks can happen
CODA AVSG Maintenance • Venus tracks changes in AVSG • new nodes in VSG that should or should not be in its AVSG by periodically probing every node in VSG • removes a node from AVSG if operation fails • chooses a new preferred server if needed • Coda Version Vector (CVV) • both for volumes and files/dirs • vector with one entry for each node in VSG indicating the number of updates of the volume or FS object
Coda Replica Management • State of an object or replica • each modification is tagged with a storeid • update history = sequence of storeids • state is a truncated update history • latest storeid LSID • CVV
Coda Replica Management • comparing replicas A & B leads to one of four cases • LSID-A = LSID-B & CVV-A = CVV-B => strong equality • LSID-A = LSID-B & CVV-A != CVV-B => weak equality • LSID-A != LSID-B & CVV-A >= CVV-B => A dominates B • otherwise => inconsistent • when S receives an update for a replica C • checks the state of S and C; test is successful if • for files, it leads to strong equality or dominance • for dirs, it leads to strong equality
Coda Replica Management • When C wants to update a replicated object • phase I • sent update to every node in its AVSG • each node performs a check of replica states (cached object and replicated object), and informs the client of the result, and performs the update if successful • if unsuccessfull, pauses client, server tries to resolve problem automatically, if not then client aborts else client resumes • phase II • client sends updated object state to every site in AVSG
Coda Replica Management • Force operation between servers • happens when Venus informs AVSG of weak consistency in AVSG • server with dominant replica overwrites data and state of dominated server • for directories is done with the help of locking one directory at a time • repair operation • automatic; proceeds in two phases as in an update • migrate operation • moves inconsitent data to a covolume for manual repair
Conflict Resolution • Conflicts between • files are done by the user using the repair tool which bypasses Coda update rules; inconsistent files are inaccessible to CODA • directories • uses the fact that a dir is a list of files • non-automated conflicts • update/update (for attributes) • remove/update • create/create (adding identical files) • all other conflicts can be resolved easily • inconsistent objects and objects without automatic conflict resolution are placed in covolumes