Distributed File System

Distributed File System Implements common file system that can be shared by autonomous computers GOALS: Network transparency: hide the details of where a file is located. High availability: ease of accessibility irrespective of the physical location of the file. This objective is difficult to achieve because the distributed file system is vulnerable to problems in underlying networks as well as crashes of systems that are the “file sources”. Replication / mirroring can be used to alleviate the above problem. However, replication/mirroring introduces additional issues such as consistency.

DFS: Architecture In general, files in a DFS can be located in “any” system. We call the “source(s)” of files to be servers and those accessing them to be clients. Potentially, a server for a file can become a client for another file. However, most distributed systems distinguish between clients and servers in more strict way: Clients simply access files and do not have/share local files. Even if clients have disks, they (disks) are used for swapping, caching, loading the OS, etc. Servers are the actual sources of files. In most cases, servers are more powerful machines (in terms of CPU, physical memory, disk bandwidth, ..)

DFS: Architecture … … … … Server Server …. Server Computer Network Client Cache Client Cache

DFS Data Access Request to Access data Load data to client cache Load server cache Return data to client Check client cache Data present Issue disk read Data Not present Data Not present Check Local disk (if any) Check Server cache Data present Data present Data Not present Send request to File server Network

Mechanisms for DFS Mounting: to help in combining files/directories in different systems and form a single file system structure. Caching: to reduce the response time in bringing data from remote machines. Hints: modified caching Bulk data transfer: helps in reducing the delay due to transfer of files over the network. Bulk: Obtain multiple number of blocks with a single seek Format, transfer large number of packets in a single context switch. Reduce the number of acknowledgements to be sent. (e.g.,) useful when downloading OS onto a diskless client. Encryption: Establish a key for encryption with the help of an authentication server.

Mounting Mounting helps to build a hierarchy of file directories. A collection of files can be mounted at an internal node of the hierarchy. Node at which this collection of files is mounted: mount point. Operating systems kernel maintains a structure called the mount table, mapping mount points to appropriate storage devices. Mount table can be maintained at: Each client. Employed in Sun Network File System (NFS). Servers. All clients see the same file system structure. Employed in Sprite file system.

Name Space Hierarchy Server X Root (/) Mount Point Mount Point a c b Server Z Server Y g e h d f i

Caching Performance of distributed file system, in terms of response time, depends on the ability to “get” the files to the user. When files are in different servers, caching might be needed to improve the response time. A copy of data (in files) is brought to the client (when referenced). Subsequent data accesses are made on the client cache. Client cache can be on disk or main memory. Data cached may include future blocks that may be referenced too. Caching implies DFS needs to guarantee consistency of data.

Hints Hints can be used when cached data need not be completely be accurate. Example: Mapping of the name of a file/directory to the actual physical device. The address/name of device can be stored as a hint. If this address fails to access the requested file, the cached data can be purged. The file server can refer to a name server, determine the actual location of file/directory, and update the cache. In hints, a cache is neither updated nor invalidated when a change occurs to the content.

Bulk Data Transfer • Delay in transfer is due to high cost of executing communication protocol. • Transferring of data in bulk reduces the protocol processing at both client and server • Bulk Transfer Reduces the File access overhead

Encryption • To enforce Security • Public key / Private Key Cryptography can be employed

Design Issues Naming: Locating the file/directory in a DFS based on name. Location of Cache: disk, main memory, both. Writing Policy: Updating original data source when cache content gets modified. Cache Consistency: Modifying cache when data source gets modified. Availability: More copies of files/resources. Scalability: Ability to handle more clients/users. Semantics: Meaning of different operations (read, write,…)

Naming Name : Name associated with files, directories or with any other resource Name space: (e.g.,) /home/students/jack, /home/staff/jill. Name space is a collection of names. Location transparency: file names do not indicate their physical locations. Name resolution: mapping name space to an object/device/file/directory. Naming approaches: Simple Concatenation: add hostname to file names. Mount Remote Directories onto local ones Have a Single Global Directory

Naming Approaches …….. 1. Simple Concatenation: Add hostname to file names Advantages : Uniqueness of file name throughout System Name Resolution Simple Disadvantages Violates Network Transparency Location Dependency [Moving Files from one host to another requires change of file]

Naming Approaches …….. 2. Mount Remote Directories onto local directories Advantages : Location transparent after mounting. (followed in Sun NFS). Different clients in the system can mount in different ways. (e.g.,) In client 1: mount /students at /. i.e., /students/jack, /students/jill. In client 2: mount /students at /usr, i.e., /usr/students/jack, /usr/students/jill Can resolve filename without consulting any host Disadvantages Host of the directory should be known

Naming Approaches …….. 3. Single Global Directory Advantages : All files in the system belong to Single Name Space (Followed in Sprite File System) System wide unique names, i.e., all clients mount the same way Can resolve filename without consulting any host Disadvantages Impractical for distributed systems that encompass heterogeneous environments Can work only among (highly) cooperating systems (or system administrators !)

Naming: Context Context: identifying the name space within which name resolution is to be done.[Partition the name space] Example: context using ~ (tilde). ~jill/t: /home/staff/jill/t ~john/t: /home/students/john/t ~name: represents the directory structure associated with a person or a project. Whenever file “t” is accessed, it is interpreted with reference to ~’s environment. ~ helps when different clients mount in different ways, still sharing the same of users and their home directories. (e.g.,) ~john may be mapped to /home/students/john in client 1 and to /usr/students/john in client 2.

Name Resolution Done by name servers that map file names to actual files. Name server is a process that maps names specified by clients to stored objects. Approaches: 1. Centralized name server: send names to the server and get the path of servers+devices that lead to the requested file. Name server becomes a bottle neck. 2. Distributed name server: (e.g.,) consider access to a file /a/b/c/d/e Local name server identifies the remote server that handles the part /b/c/d/e This procedure may be recursively done till ../e is resolved.

Caching In main memory: Faster than disks. Diskless workstations can also cache. Server-cache is in main memory -> same design can be used in clients also. Disadvantage: clients need main memory for virtual memory management too. In disks: Large files can be cached. Virtual memory management is straight forward. After caching the necessary files, the client can get disconnected from network (if needed, for instance, to help its mobility).

Writing Policy When should a modified cache content be transferred to the server? Write-through : Immediate writing at server when cache content is modified. Advantage: reliability, crash of cache (client) does not mean loss of data. Disadvantage: Several writes for each small change. Delayed writing : Write at the server, after a delay. Advantage: small/frequent changes do not increase network traffic. Disadvantage: less reliable, susceptible to client crashes. Write at the time of file closing.

Cache Consistency When should a modified source content be transferred to the cache? Server-initiated Server cache manager informs client cache managers that can then retrieve the data. Client-initiated Client cache manager checks the freshness of data before delivering to users. Overhead for every data access. Concurrent-write sharing Multiple clients open the file, at least one client is writing. File server asks other clients to purge/remove the cached data for the file, to maintain consistency. Sequential-write sharing a client opens a file that was recently closed after writing.

Cache Consistency ... Sequential-write sharing : a client opens a file that was recently closed after writing. This client may have outdated cache blocks of the file (since the other client might have modified the file contents). Use time stamps for both cache and files. Compare the time stamps to know the freshness of blocks. The other client (which was writing previously) may still have modified data in its cache that has not yet been updated on server. (e.g.,) due to delayed writing. Server can force the previous client to flush its cache whenever a new client opens the file.

Availability Intention: overcome the failure of servers or network links. Solution: replication, i.e., maintain copies of files at different servers. Issues: Maintaining consistency Detecting inconsistencies, if they happen despite best efforts. Possible reasons for such inconsistencies: Replica is not updated due to a server failure or a broken network link. Inconsistency problems and their recovery may reduce the benefit of replication.

Availability: Replication Unit of replication: is mostly a file [Roe, Sprite, Cedar]. Replicas of a file in a directory may be handled by different servers, requiring extra name resolutions to locate the replicas. Replication unit: group of files[ Coda ]: Advantage: process of name resolution, etc., to locate replicas can be done for a set of files and not for individual files. Disadvantage: wasteful of disk space if only very few of this group of files is needed by users often.

Replica Management Deals with maintenance of replicas for better availability. Consistency among replicas (mutual consisyency)should be guaranteed. Two-phase commit protocols can be used to update all replicas. Other schemes: Weighted votes [Roe File System] A certain number of votes r or w is to be obtained before reading or writing. Current synchronization site (CSS)[ Harp File System] Designate a process/site to control the modifications. File open/close are done through CSS. CSS can become a bottleneck.

Scalability Ease of adding more servers and clients with respect to the problems. Server-initiated cache invalidation scales up better. Using the clients cache: A server serves only X clients. New clients (after the first X) are informed of the X clients from whom they can get the data (sort of chaining/hierarchy). Cache misses & invalidations are propagated up and down this hierarchy, i.e., each node serves as a mini-file server for its children. Structure of a server: I/O operations through threads (light weight processes) can help in handling more clients.

Semantics What is the effect / meaning of an operation? (e.g.,) read returns the data due to latest write operation. Guaranteeing the above semantics in the presence of caching can be difficult.

Case Study: Sun NFS[http://en.wikipedia.org/wiki/Network_File_System_(protocol)] Major goal: keep the distributed file system independent of underlying hardware and operating system. NFS (Network File System): uses the Remote Procedure Call (RPC) for remote file operations. Virtual file system (VFS) interface: provides uniform, virtual file operations that are mapped to the actual file system. (e.g.,) VFS can be mapped to DOS, so NFS can work with PCs. VFS uses a structure called vnode (virtual node) that is unique in a NFS. Each vnode has a mount table that provides a pointer to its parent file system and to the system over which it is mounted.

Sun NFS... A vnode can be a mount point. Using mount tables, VFS interface can distinguish between local and remote file systems. Requests to remote files are routed to the NFS by the VFS interface. RPCs are used to reach remote VFS interface. Remote VFS invokes appropriate local file operation.

Sun NFS Architecture Client Kernel OS Interface Server Server Routines VFS Interface VFS Interface Others Unix NFS Disks Disk RPC/XDR RPC/XDR Network

NFS: Naming & Location Each client can configure its file system independent of others. i.e., different clients can see different name spaces. Name resolution example: Look up for a/b/c. a corresponds to vnode1 (assume). Look up on vnode1/b returns vnode2 that might say the object is on server X. Look up on vnode2/c is sent to X. X returns a file handle (if the file exists, permission matches, etc). File handle is used for subsequent file operations. Name resolution in NFS is an iterative process (slow). Name space information is not maintained at each server as the servers in NFS are stateless (to be discussed later).

NFS: Caching NFS Client Cache: File blocks: cached on demand. Employs read ahead. Large block sizes (8 Kbytes) for data transfer to improve the sequential read performance. Entire files cached, if they are small. Timestamps of files are also cached. Cached blocks are valid for certain period after which validation is needed from server. Validation done by comparing time stamps of file at server. Delayed writing policy used. Modified files are flushed after closing to handle sequential-write sharing. File name to vnode translations: directory name lookup cache holds the vnodes for remote directory names. Cache updated when lookup fails (cache acts as hints). Attributes of files & directories:

NFS: Caching NFS Client Cache: .... Attributes of files & directories: Attribute inquiries form 90% of calls made to servers. Cache entries are updated every time new attributes are received from server. File attributes are discarded after 3 seconds and directory attributes after 30 seconds.

NFS: Stateless Server NFS servers are stateless to help crash recovery. Stateless: no record of past requests (e.g., whether file is open, position of file pointer, etc.,). Client requests contain all the needed information. No response, client simply re-sends the request. After a crash, a stateless server simply restarts. No need to: Restore previous transaction records. Update clients or negotiate with clients on file status. Disadvantages: Client message sizes are larger. Server cache management difficult since server has no idea on which files have been opened/closed. Server can provide little information for file sharing.

Un/mounting in NFS Mounting of files in Unix is done by using a mount table stored in a file: /etc/mnttab. mnttab is read by programs using procedures such as getmntent. mount command adds an entry in mnttab, i.e., every time a file system is mounted in the system. umount command removes an entry in mnttab, i.e., every time a file system is unmounted from the system.

Un/Mounting First entry in mnttab: file system that was mounted first. Usually, file systems get mounted at boot time. Mount: term used for mounting tapes onto systems Each entry is a line of fields separated by spaces in the form: <special> <mount_point> <fstype> <options> <time> <special>: The name of the resource to be mounted. <mount_point> : pathname of the directory on which the filesystem is mounted. <fstype> : file system type of the mounted file system. <options> : mount options. <time> : time at which the file system was mounted. Entries for <special>: path-name of a block-special device (e.g., /dev/fd0), the name of a remote filesystem (casa:/export/home, i.e., host:pathname), or the name of a swap file.

Sharing Filesystems In SunOS, share command is used to specify the file systems that can be mounted by other systems. (e.g.), share [ -F FSType ] [ -o specific_options ] [-d description ] [ pathname ] Share command makes a resource available to remote system, through a file system of FSType. <specific_options> : control access of the shared resource. rw pathname is shared read/write to all clients. This is also the default behavior. rw=client[:client]...pathname is shared read/write only to the listed clients. No other systems can access pathname. ro pathname is shared read-only to all clients. ro=client[:client]... pathname is shared read-only only to the listed clients. No other systems can access pathname.

Sharing Filesystems… <-d description>: -d flag may be used to provide a description of the resource being shared. Example : To share the /disk file system read-only at boot time. share -F nfs -o ro /disk share -F nfs -o rw=usera:userb /somefs Multiple share commands on same file system? : Last command supersedes. Try: /etc/dfs/dfstab: list of share commands to be executed at boot time /etc/dfs/fstypes: list of file system types, NFS by default /etc/dfs/sharetab: system record of shared file systems.

Automounting mount a remote file system only when it is accessed, perhaps for a guessed duration of time. automount utility: installs autofs mount points and associates an automount map with each mount point. autofs file system monitors attempts to access directories within it and notifies the automountd daemon. automountd uses the map to locate a file system. Then mounts at the point of reference within the autofs file system. A map can be assigned to an autofs mount using an entry in the /etc/auto_master map or a direct map. File system is not accessed within an appropriate interval (10 minutes by default) ? : the automountd daemon unmounts the file system.

Cluster File System System Model: a set of storage devices that can be accessed by a set of workstations. System 1 System n Very High Speed Network RAID RAID Tapes/CDs RAID: Redundant Array of Inexpensive Disks

Cluster File System Storage devices can be viewed as a “pool of centralized resources”. Storage devices are shared by a set of workstations/systems or a cluster as it is called. Both the pool of storage and the cluster are attached to very high speed networks (typically optical networks). Devices can be mounted to different systems: e.g., Raid1 to system n, Raid 2 to system 1 etc. Features: Mirroring: replication of entire disks Striping: data (e.g., multimedia) spread over multiple disks Online reconfiguration: add/delete storage devices dynamically Assign/remove devices to applications/systems dynamically

Storage Virtualization Means logical representation of the physical resources: storage devices & workstations Virtualization specifies details such as which devices are meant for which host, how they can be shared, etc. Possible places for virtualization: (each choice has its own advantages and disadvantages) Workstations or hosts Volume managers (software) are run on hosts, providing control over how data is stored and accessed over the different devices.

Storage Virtualization... Possible places for virtualization: In storage subsystem Associated with large-scale RAID large subsystems (many terabytes). Virtualization services embedded on storage controllers. In special appliances: “in-band” or “out-of-band” Special, intelligent appliances are used to provide virtualization Appliance name: NAS (Network Attached Storage) In-band: NAS is part of storage pool Out-of-band: NAS not a part of storage pool

Veritas Volume Manager Works on both Unix and Windows Builds a diskgroup spanning multiple devices. Dynamic diskgroups management Striping of data on multiple RAIDs. Striping distributes data on multiple disks and hence increases the disk bandwidth for retrieval. Suitable for multimedia data. Cluster Volume Manager: Allows a volume to be simultaneously mounted for use across multiple servers for both reads and writes.

Veritas Cluster Server Cluster server handles upto 32 systems. It monitors, controls and restarts applications in response to a variety of task. (e.g.,) application A1 may be started on system n is system 1 fails. Disk group D1 will be automatically assigned to system n. (e.g.) Disk group D2 may be assigned to system 1 if D1 fails and the application A1 will continue. Sn S1 D2 D1

Service Groups A set of resources working together to provide application services to clients. Service group example: Disk groups having data Volume built using disk group File system (directories) using the volume Servers/systems providing the application Application program + libraries Types of Service Groups: Failover Groups: runs on 1 system in a cluster at a time. Used for applications that are not designed to maintain data consistency on multiple copies. Cluster server monitors the heart beat of the system. If it fails, the backup is brought on-line.

Service Groups... Types of Service Groups...: Parallel groups: run concurrently on more than 1 system. Time-to-recovery: On a failure, an application service is moved to another server in the cluster. Disk groups are de-imported from the crashed server and imported by the back-up server. Volume manager helps to manage the disk group ownership and accelerate recovery process of the cluster. New ownership properties are broadcast to the cluster to ensure data security. Time to take to bring the back-up online.

Disaster Tolerance More than 1 cluster connected by very high speed networks over a wide area network. Cluster 1 and 2 geographically distributed. Very High Speed Link Over a Wide Area Network Cluster 1 Cluster 2

Veritas Volume Replicator Redundant copy of application in another cluster must be kept up-to-date. Volume Replicator allows a disk group to be replicated at 1 or more remote clusters. Initialization of replication: entire disk group is replicated. Runtime: only modifications to data are communicated. Conserves network bandwidth. Disk groups at the remote cluster are not usually active. Identical instance of application is run on the remote cluster in idle mode. Disaster is identified by volume replicator using heart beats. Puts remote cluster on-line for the applications. Time-to-recovery: less than 1 minute.

Distributed File System