Distributed File System

Distributed File System • File system spread over multiple, autonomous computers. • A distributed file system should provide: • Network transparency: hide the details of where a file is located. • High availability: ease of accessibility irrespective of the physical location of the file. • This objective is difficult to achieve because the distributed file system is vulnerable to problems in underlying networks as well as crashes of systems that are the “file sources”. • Replication / mirroring can be used to alleviate the above problem. • However, replication/mirroring introduces additional issues such as consistency. B. Prabhakaran

DFS: Architecture • In general, files in a DFS can be located in “any” system. We call the “source(s)” of files to be servers and those accessing them to be clients. • Potentially, a server for a file can become a client for another file. • However, most distributed systems distinguish between clients and servers in more strict way: • Clients simply access files and do not have/share local files. • Even if clients have disks, they (disks) are used for swapping, caching, loading the OS, etc. • Servers are the actual sources of files. • In most cases, servers are more powerful machines (in terms of CPU, physical memory, disk bandwidth, ..) B. Prabhakaran

DFS: Architecture … … … … Server Server …. Server Computer Network Client Client B. Prabhakaran

DFS Data Access Request to Access data Load data to client cache Load server cache Return data to client Check client cache Data present Issue disk read Data Not present Data Not present Check Local disk (if any) Check Server cache Data present Data present Data Not present Send request to File server Network B. Prabhakaran

Mechanisms for DFS • Mounting: to help in combining files/directories in different systems and form a single file system structure. • Caching: to reduce the response time in bringing data from remote machines. • Hints: modified caching • Bulk data transfer: helps in reducing the delay due to transfer of files over the network. Bulk: • Obtain multiple number of blocks with a single seek • Format, transfer large number of packets in a single context switch. • Reduce the number of acknowledgements to be sent. • (e.g.,) useful when downloading OS onto a diskless client. • Encryption: Establish a key for encryption with the help of an authentication server. B. Prabhakaran

Mounting • Mounting helps to build a hierarchy of file directories. • A collection of files can be mounted at an internal node of the hierarchy. • Node at which this collection of files is mounted: mount point. • Operating systems kernel maintains a structure called the mount table, mapping mount points to appropriate storage devices. • Mount table can be maintained at: • Each client. Employed in Sun Network File System (NFS). • Servers. All clients see the same file system structure. Employed in Sprite file system. B. Prabhakaran

Name Space Hierarchy Server X Root (/) Mount Points a c b Server Z Server Y g e h d f i B. Prabhakaran

Caching • Performance of distributed file system, in terms of response time, depends on the ability to “get” the files to the user. • When files are in different servers, caching might be needed to improve the response time. • A copy of data (in files) is brought to the client (when referenced). Subsequent data accesses are made on the client cache. • Client cache can be on disk or main memory. • Data cached may include future blocks that may be referenced too. • Caching implies DFS needs to guarantee consistency of data. B. Prabhakaran

Hints • Hints can be used when cached data need not be completely be accurate. • Example: Mapping of the name of a file/directory to the actual physical device. The address/name of device can be stored as a hint. • If this address fails to access the requested file, the cached data can be purged. • The file server can refer to a name server, determine the actual location of file/directory, and update the cache. • In hints, a cache is neither updated nor invalidated when a change occurs to the content. B. Prabhakaran

Design Issues • Naming: Locating the file/directory in a DFS based on name. • Location of cache: disk, main memory, both. • Writing policy: Updating original data source when cache content gets modified. • Cache consistency: Modifying cache when data source gets modified. • Availability: More copies of files/resources. • Scalability: Ability to handle more clients/users. • Semantics: Meaning of different operations (read, write,…) B. Prabhakaran

Naming • Name space: (e.g.,) /home/students/jack, /home/staff/jill. • Name space is a collection of names. • Location transparency: file names do not indicate their physical locations. • Name resolution: mapping name space to an object/device/file/directory. • Naming approaches: • Simple Concatenation: add hostname to file names. • Guarantees unique names. • No transparency. Moving a file to another host involves a file name change. B. Prabhakaran

Naming: Approaches ... • Naming approaches: • ..... • Mounting: mount remote directories to local ones. Location transparent after mounting. (followed in Sun NFS). • Example: /students is mounted at /home. • Remember: different clients in the system can mount in different ways. (e.g.,) In client 1: mount /students at /. i.e., /students/jack, /students/jill. In client 2: mount /students at /usr, i.e., /usr/students/jack, /usr/students/jill. • Single Global Directory:all files in the system belong to a single name space. (followed in Sprite OS). • System wide unique names, i.e., all clients mount the same way. • Difficult to enforce this restriction. Can work only among (highly) cooperating systems (or system administrators !) B. Prabhakaran

Naming: Context • Context: identifying the name space within which name resolution is to be done. • Example: context using ~ (tilde). • ~jill/t: /home/staff/jill/t • ~john/t: /home/students/john/t • ~name: represents the directory structure associated with a person or a project. • Whenever file “t” is accessed, it is interpreted with reference to ~’s environment. • ~ helps when different clients mount in different ways, still sharing the same of users and their home directories. • (e.g.,) ~john may be mapped to /home/students/john in client 1 and to /usr/students/john in client 2. B. Prabhakaran

Name Resolution • Done by name servers that map file names to actual files. • Centralized name server: send names to the server and get the path of servers+devices that lead to the requested file. • Name server becomes a bottle neck. • Distributed name server: (e.g.,) consider access to a file /a/b/c/d/e • Local name server identifies the remote server that handles the part /b/c/d/e • This procedure may be recursively done till ../e is resolved. B. Prabhakaran

Caching • In main memory: • Faster than disks. • Diskless workstations can also cache. • Server-cache is in main memory -> same design can be used in clients also. • Disadvantage: clients need main memory for virtual memory management too. • In disks: • Large files can be cached. • Virtual memory management is straight forward. • After caching the necessary files, the client can get disconnected from network (if needed, for instance, to help its mobility). B. Prabhakaran

Writing Policy • When should a modified cache content be transferred to the server? • Write-through policy: • Immediate writing at server when cache content is modified. • Advantage: reliability, crash of cache (client) does not mean loss of data. • Disadvantage: Several writes for each small change. • Delayed writing policy: • Write at the server, after a delay. • Advantage: small/frequent changes do not increase network traffic. • Disadvantage: less reliable, susceptible to client crashes. • Write at the time of file closing. B. Prabhakaran

Cache Consistency • When should a modified source content be transferred to the cache? • Server-initiated policy: • Server cache manager informs client cache managers that can then retrieve the data. • Client-initiated policy: • Client cache manager checks the freshness of data before delivering to users. Overhead for every data access. • Concurrent-write sharing policy: • Multiple clients open the file, at least one client is writing. • File server asks other clients to purge/remove the cached data for the file, to maintain consistency. • Sequential-write sharing policy: a client opens a file that was recently closed after writing. B. Prabhakaran

Cache Consistency ... • Sequential-write sharing policy: a client opens a file that was recently closed after writing. • This client may have outdated cache blocks of the file (since the other client might have modified the file contents). • Use time stamps for both cache and files. Compare the time stamps to know the freshness of blocks. • The other client (which was writing previously) may still have modified data in its cache that has not yet been updated on server. (e.g.,) due to delayed writing. • Server can force the previous client to flush its cache whenever a new client opens the file. B. Prabhakaran

Availability • Intention: overcome the failure of servers or network links. • Solution: replication, i.e., maintain copies of files at different servers. • Issues: • Maintaining consistency • Detecting inconsistencies, if they happen despite best efforts. Possible reasons for such inconsistencies: • Replica is not updated due to a server failure or a broken network link. • Inconsistency problems and their recovery may reduce the benefit of replication. B. Prabhakaran

Availability: Replication • Unit of replication: is mostly a file. • Replicas of a file in a directory may be handled by different servers, requiring extra name resolutions to locate the replicas. • Replication unit: group of files: • Advantage: process of name resolution, etc., to locate replicas can be done for a set of files and not for individual files. • Disadvantage: wasteful of disk space if only very few of this group of files is needed by users often. B. Prabhakaran

Replica Management • Two-phase commit protocols can be used to update all replicas. • Other schemes: • Weighted votes: • A certain number of votes r or w is to be obtained before reading or writing. • Current synchronization site (CSS): • Designate a process/site to control the modifications. • File open/close are done through CSS. • CSS can become a bottleneck. B. Prabhakaran

Scalability • Ease of adding more servers and clients with respect to the problems / design issues discussed before such as caching, replication management, etc. • Server-initiated cache invalidation scales up better. • Using the clients cache: • A server serves only X clients. • New clients (after the first X) are informed of the X clients from whom they can get the data (sort of chaining/hierarchy). • Cache misses & invalidations are propagated up and down this hierarchy, i.e., each node serves as a mini-file server for its children. • Structure of a server: • I/O operations through threads (light weight processes) can help in handling more clients. B. Prabhakaran

Semantics • What is the effect / meaning of an operation? • (e.g.,) read returns the data due to latest write operation. • Guaranteeing the above semantics in the presence of caching can be difficult. • We saw techniques for these under caching. B. Prabhakaran

Case Study: Sun NFS • Major goal: keep the distributed file system independent of underlying hardware and operating system. • NFS (Network File System): uses the Remote Procedure Call (RPC) for remote file operations. • Virtual file system (VFS) interface: provides uniform, virtual file operations that are mapped to the actual file system. (e.g.,) VFS can be mapped to DOS, so NFS can work with PCs. • VFS uses a structure called vnode (virtual node) that is unique in a NFS. • Each vnode has a mount table that provides a pointer to its parent file system and to the system over which it is mounted. B. Prabhakaran

Sun NFS... • A vnode can be a mount point. • Using mount tables, VFS interface can distinguish between local and remote file systems. • Requests to remote files are routed to the NFS by the VFS interface. • RPCs are used to reach remote VFS interface. • Remote VFS invokes appropriate local file operation. B. Prabhakaran

Sun NFS Architecture Client Kernel OS Interface Server Server Routines VFS Interface VFS Interface Others Unix NFS Disks Disk RPC/XDR RPC/XDR Network B. Prabhakaran

NFS: Naming & Location • Each client can configure its file system independent of others. i.e., different clients can see different name spaces. • Name resolution example: • Look up for a/b/c. a corresponds to vnode1 (assume). • Look up on vnode1/b returns vnode2 that might say the object is on server X. • Look up on vnode2/c is sent to X. X returns a file handle (if the file exists, permission matches, etc). • File handle is used for subsequent file operations. • Name resolution in NFS is an iterative process (slow). • Name space information is not maintained at each server as the servers in NFS are stateless (to be discussed later). B. Prabhakaran

NFS: Caching • NFS Client Cache: • File blocks: cached on demand. • Employs read ahead. Large block sizes (8 Kbytes) for data transfer to improve the sequential read performance. • Entire files cached, if they are small. Timestamps of files are also cached. • Cached blocks are valid for certain period after which validation is needed from server. Validation done by comparing time stamps of file at server. • Delayed writing policy used. Modified files are flushed after closing to handle sequential-write sharing. • File name to vnode translations: directory name lookup cache holds the vnodes for remote directory names. • Cache updated when lookup fails (cache acts as hints). • Attributes of files & directories: B. Prabhakaran

NFS: Caching • NFS Client Cache: .... • Attributes of files & directories: • Attribute inquiries form 90% of calls made to servers. • Cache entries are updated every time new attributes are received from server. • File attributes are discarded after 3 seconds and directory attributes after 30 seconds. B. Prabhakaran

NFS: Stateless Server • NFS servers are stateless to help crash recovery. • Stateless: no record of past requests (e.g., whether file is open, position of file pointer, etc.,). • Client requests contain all the needed information. No response, client simply re-sends the request. • After a crash, a stateless server simply restarts. No need to: • Restore previous transaction records. • Update clients or negotiate with clients on file status. • Disadvantages: • Client message sizes are larger. • Server cache management difficult since server has no idea on which files have been opened/closed. • Server can provide little information for file sharing. B. Prabhakaran

Un/mounting in NFS • Mounting of files in Unix is done by using a mount table stored in a file: /etc/mnttab. • mnttab is read by programs using procedures such as getmntent. • mount command adds an entry in mnttab, i.e., every time a file system is mounted in the system. • umount command removes an entry in mnttab, i.e., every time a file system is unmounted from the system. B. Prabhakaran

Un/Mounting • First entry in mnttab: file system that was mounted first. • Usually, file systems get mounted at boot time. • Mount: term used for mounting tapes onto systems, I guess. • Each entry is a line of fields separated by spaces in the form: <special> <mount_point> <fstype> <options> <time> • <special>: The name of the resource to be mounted. • <mount_point> : pathname of the directory on which the filesystem is mounted. • <fstype> : file system type of the mounted file system. • <options> : mount options. • <time> : time at which the file system was mounted. • Entries for <special>: path-name of a block-special device (e.g., /dev/fd0), the name of a remote filesystem (casa:/export/home, i.e., host:pathname), or the name of a swap file. B. Prabhakaran

Sharing Filesystems • In SunOS, share command is used to specify the file systems that can be mounted by other systems. • (e.g.), share [ -F FSType ] [ -o specific_options ] [-d description ] [ pathname ] • Share command makes a resource available to remote system, through a file system of FSType. • <specific_options> : control access of the shared resource. • rw pathname is shared read/write to all clients. This is also the default behavior. • rw=client[:client]...pathname is shared read/write only to the listed clients. No other systems can access pathname. • ro pathname is shared read-only to all clients. • ro=client[:client]... pathname is shared read-only only to the listed clients. No other systems can access pathname. B. Prabhakaran

Sharing Filesystems… • <-d description>: -d flag may be used to provide a description of the resource being shared. • Example : To share the /disk file system read-only at boot time. • share -F nfs -o ro /disk • share -F nfs -o rw=usera:userb /somefs • Multiple share commands on same file system? : Last command supersedes. • Try: • /etc/dfs/dfstab: list of share commands to be executed at boot time • /etc/dfs/fstypes: list of file system types, NFS by default • /etc/dfs/sharetab: system record of shared file systems. B. Prabhakaran

Automounting • mount a remote file system only when it is accessed, perhaps for a guessed duration of time. • automount utility: installs autofs mount points and associates an automount map with each mount point. • autofs file system monitors attempts to access directories within it and notifies the automountd daemon. • automountd uses the map to locate a file system. Then mounts at the point of reference within the autofs file system. • A map can be assigned to an autofs mount using an entry in the /etc/auto_master map or a direct map. • File system is not accessed within an appropriate interval (10 minutes by default) ? : the automountd daemon unmounts the file system. B. Prabhakaran

Cluster File System • System Model: a set of storage devices that can be accessed by a set of workstations. System 1 System n Very High Speed Network RAID RAID Tapes/CDs RAID: Redundant Array of Inexpensive Disks B. Prabhakaran

Cluster File System • Storage devices can be viewed as a “pool of centralized resources”. • Storage devices are shared by a set of workstations/systems or a cluster as it is called. • Both the pool of storage and the cluster are attached to very high speed networks (typically optical networks). • Devices can be mounted to different systems: e.g., Raid1 to system n, Raid 2 to system 1 etc. • Features: • Mirroring: replication of entire disks • Striping: data (e.g., multimedia) spread over multiple disks • Online reconfiguration: add/delete storage devices dynamically • Assign/remove devices to applications/systems dynamically B. Prabhakaran

Storage Virtualization • Means logical representation of the physical resources: storage devices & workstations • Virtualization specifies details such as which devices are meant for which host, how they can be shared, etc. • Possible places for virtualization: (each choice has its own advantages and disadvantages) • Workstations or hosts • Volume managers (software) are run on hosts, providing control over how data is stored and accessed over the different devices. B. Prabhakaran

Storage Virtualization... • Possible places for virtualization: • In storage subsystem • Associated with large-scale RAID large subsystems (many terabytes). Virtualization services embedded on storage controllers. • In special appliances: “in-band” or “out-of-band” • Special, intelligent appliances are used to provide virtualization • Appliance name: NAS (Network Attached Storage) • In-band: NAS is part of storage pool • Out-of-band: NAS not a part of storage pool B. Prabhakaran

Veritas Volume Manager • Works on both Unix and Windows • Builds a diskgroup spanning multiple devices. • Dynamic diskgroups management • Striping of data on multiple RAIDs. • Striping distributes data on multiple disks and hence increases the disk bandwidth for retrieval. Suitable for multimedia data. • Cluster Volume Manager: • Allows a volume to be simultaneously mounted for use across multiple servers for both reads and writes. B. Prabhakaran

Veritas Cluster Server • Cluster server handles upto 32 systems. • It monitors, controls and restarts applications in response to a variety of task. • (e.g.,) application A1 may be started on system n is system 1 fails. Disk group D1 will be automatically assigned to system n. • (e.g.) Disk group D2 may be assigned to system 1 if D1 fails and the application A1 will continue. Sn S1 D2 D1 B. Prabhakaran

Service Groups • A set of resources working together to provide application services to clients. • Service group example: • Disk groups having data • Volume built using disk group • File system (directories) using the volume • Servers/systems providing the application • Application program + libraries • Types of Service Groups: • Failover Groups: runs on 1 system in a cluster at a time. Used for applications that are not designed to maintain data consistency on multiple copies. • Cluster server monitors the heart beat of the system. If it fails, the backup is brought on-line. B. Prabhakaran

Service Groups... • Types of Service Groups...: • Parallel groups: run concurrently on more than 1 system. • Time-to-recovery: • On a failure, an application service is moved to another server in the cluster. • Disk groups are de-imported from the crashed server and imported by the back-up server. • Volume manager helps to manage the disk group ownership and accelerate recovery process of the cluster. • New ownership properties are broadcast to the cluster to ensure data security. • Time to take to bring the back-up online. B. Prabhakaran

Disaster Tolerance • More than 1 cluster connected by very high speed networks over a wide area network. • Cluster 1 and 2 geographically distributed. Very High Speed Link Over a Wide Area Network Cluster 1 Cluster 2 B. Prabhakaran

Veritas Volume Replicator • Redundant copy of application in another cluster must be kept up-to-date. • Volume Replicator allows a disk group to be replicated at 1 or more remote clusters. • Initialization of replication: entire disk group is replicated. • Runtime: only modifications to data are communicated. Conserves network bandwidth. • Disk groups at the remote cluster are not usually active. • Identical instance of application is run on the remote cluster in idle mode. • Disaster is identified by volume replicator using heart beats. • Puts remote cluster on-line for the applications. • Time-to-recovery: less than 1 minute. B. Prabhakaran

Distributed File System