AFS - Scalable and Transparent Remote File System

Mobile File System<AFS, Coda, Bayou> Byung Chul Tak

AFS • Andrew File System • Distributed computing environment developed at CMU • provides transparent access to remote shared files • The most important design goal : Scalability • allows existing UNIX programs to access AFS files without modification or recompilations

AFS • Two design characteristics • Whole-file serving • The entire contents of directories and files are transmitted to client computers by AFS servers • While-file caching • A copy of a file is stored in a cache on the local disk • The file cache is permanent

AFS • Usage scenario • A client issues open system call for a file • If there is no copy in the local cache • the server is located • a request for a copy of the file is made • The copy is stored in the local UNIX file system and opened • Subsequent read, write are applied to the local copy • When the client issues a close system call • if the local copy is updated, its contents are sent back to the server

AFS • Assumptions • Most files are small • Read is much more common than writes • Sequential access is common, and random access is rare • Most files are read and written by only one user • When a file is shares, it is usually only one user who modifies it • Files are referenced in bursts and there is a high temporal locality

AFS • Distribution of processes in AFS Workstations Servers User program Venus Vice Network UNIX kernel UNIX kernel User program Venus UNIX kernel Vice User program Venus UNIX kernel UNIX kernel

AFS • Two software components • Venus • A user-level process that runs in each client computers • Vice • The server software that runs as a user-level UNIX process in each server computers

Workstations UNIX file system calls User program Venus Non-local file operations UNIX kernel UNIX file system Local disk AFS • System call interception in AFS • BSD UNIX is modified to intercept file system calls • Venus manages cache • A partition on the local disk is used as a cache

AFS • File identifier • Files and directories in the shared file space is identified by 96-bit fid • Venus translates file pathnames into fids • volume number • In AFS, files are grouped into volumes • file handle • identify the file within the volume • uniquifier • ensures that file identifiers are not reused Volume number File handle Uniquifier 32 bits 32 bits 32 bits

AFS • Cache consistency • based on the callback promise • Callback promise • for ensuring that cached copies of files are updated when another client closes the same file after updating it • Vice supplies a copy of file to Venus, with a callback promise • a token issued by Vice with two state: valid, cancelled • When Venus receives a callback, it sets the callback promise token to cancelled • Venus checks the callback promise when user issues an open • if it is cancelled, then a fresh copy must be fetched

CODA • Evolution from AFS • Mechanisms for high availability • Disconnected operation • a mode of operation in which a client continues to use data during network failure • while disconnected, rely on the local cache • cache miss is reported as failure • Server replication • allowing volumes to have read-write replicas at more than one server

CODA • Venus states • Hoarding state • to hoard useful data in anticipation of disconnection • Emulation state • enter upon disconnection • Venus assumes full responsibility of file operations • Reintegration state • Venus propagates changes made during emulation to the server • validate all cached objects before use

CODA • Design philosophies for extending CODA • Don’t punish strongly-connected clients • unacceptable to degrade the performance of strongly-connected clients on account of the weakly-connected ones • Don’t make like worse than when disconnected • user will not tolerate substantial performance degradation • Do it in the background if you can • ex) switch foreground network delay to background • When in doubt, seek user advice • As connectivity weakens, the price of misjudgment increases

CODA • CODA extensions • Transport protocol refinements • code separation of RPC2 and SFTP protocols • Rapid cache validation • raising the granularity of cache validation • Trickle reintegration • propagating updates to servers asynchronously • User-assisted miss handling • asking user input for large file fetch

CODA • Rapid cache validation • Under previous implementation • Reintegration process shows low performance • Validation of cached objects after reconnection • Solution adopted • Tracking server state at multiple levels of granularity • Version stamps for each volumes • if version stamp is invalid, each cached object is validated as usual

Hoarding strong connection disconnection weak connection Emulating Write Disconnected connection disconnection CODA • Trickle Reintegration • State modification • Write disconnected state • Updates are logged and propagated via trickle reintegration • Reintegration is run on background • A user can force a full reintegration

CODA • Log optimization • key to reducing the volume of reintegration data • basic concept • In emulation state, Venus logs updates • When a log record is appended to the CML(Client Modify Log), Venus checks if it cancels or overrides earlier records • Trickle reintegration reduces the opportunity of optimization • Records should spend enough time in the CML for optimizations to be effective

CODA • Log optimization • Aging • A record is not eligible for reintegration until it has spent a minimal amount of time in the CML • aging window Older than A Log Head Log Tail Time  Reintegration Barrier [ CML during reintegration ]

CODA • Seeking User Advice • Transparency is not always acceptable • Under low bandwidth, a file fetch could take very long and this could be annoying to the user • In some cases, a users is willing to wait for a long delay when the file is important • Patience threshold • Maximum time a user is willing to wait for a particular file, or the equivalent file size • a function of hoard priority P, bandwidth • hoard priority: user perceived importance of files specified by the user

CODA • Seeking User Advice (cont’d) • Patience Threshold model • Handling misses • In status walk, Venus obtains status for missing objects and decides whether to fetch • In data walk, Venus fetches the contents from the server • If file size is above the patience threshold, a screen is shown to the user to collect user decision τ: threshold β,δ: scaling parameter α: lower bound P: hoard priority

BAYOU • Bayou • A replicated, weakly consistent storage system for mobile computing environment • Design Philosophy • Application must know they may read inconsistent data • Applications must know there may be conflicts • Clients can read and write to any replica without the need for coordination • The definition of conflict depends on the semantics

BAYOU • System model • Each data collection is replicated in full at a number of servers • Bayou provides two basic operations • read and write • Client can use any server’s data • client can read and submit write • once write is accepted, client has no further responsibilities • client does not wait for the write to propagate • Anti-entropy session • Bayou servers propagate writes during pair-wise contact

BAYOU • Conflict Detection and Resolution • Supporting application-specific, per-write conflict detection and resolution • Two mechanisms • permit clients to indicate how to detect conflict and how to resolve • dependency check • merge procedures

BAYOU • Dependency checks • Each write operation includes a dependency check • A SQL-like query is used • A conflict is detected if the expected value is not returned

BAYOU • Merge procedures • Each write operation includes a merge procedure • written in a high-level, interpreted language • When automatic merge is impossible, it runs to completion and produce a log • Later, user can resolve it manually

BAYOU • Bayou write implementation • Bayou write call example update dependency check merge procedure

BAYOU • Replica Consistency • Eventual consistency • Bayou guarantees that all servers eventually receive all writes • Consistency is maintained via pair-wise anti-entropy process

BAYOU • Anti-entropy process • To bring two replicas up-to-date • Accept-stamp • Monotonically increasing number assigned by the server when it receives a write • total order over all writes accepted by the server • partial order over all writes in the system • Basic design • a one-way operation between pairs of server • via the propagation of write operations • write propagation is constrained by the accept-order

BAYOU • Pair-wise anti-entropy • unidirectional process • one server brings the other up-to-date by propagating writes unknown to it • Prefix property • A server R that holds a write stamped Wi that was initially accepted by another server X will also hold all writes accepted by X prior to Wi • Accept-stamp is used to achieve this property in Bayou

BAYOU • Basic anti-entropy algorithm • The sending server gets version vector from the receiving server • It traverses the write-log and send writes not covered by the version vector s1 s2 s3 s4 s5 s6 x y z version vector of R : anti-entropy(S,R) { Get R.V from receiving server R # now send all the writes unknown to R w = first write in S.write-log WHILE (w) DO IF R.V(w.server-id) < w.accept-stamp THEN # w is new for R SendWrite(R, w) w = next write in S.write-log END}

BAYOU • Anti-entropy process • A receiving server may receive a write that precedes some writes on the server • Server must undo the effect and redo with new writes • Each server maintains a log of all write operations it has received • The write log may become excessively long • log truncation is necessary especially for mobile systems

BAYOU • Write-log management • Log truncation • When two servers engage in the anti-entropy, it may be possible that one server has discarded some writes that the other might need • In such cases, full database transfer is required • Write stability • Committed write is introduced to allow log management • committed write : one whose position in the write-log will not change, and never be reexecuted

BAYOU • Write stability • Primary-commit protocol • One replica server is designated as the primary replica • The primary replica commits the position of a write in the log • CSN(Commit Sequence Number) • monotonically increasing number assigned to commited writes • CSN is propagated back to all other servers during the anti-entropy process

BAYOU • Anti-entropy protocol extensions • Server reconciliation using transportable media • Support for session guarantees and eventual consistency • Light-weight server creation and retirement

AFS - Scalable and Transparent Remote File System

AFS - Scalable and Transparent Remote File System

Presentation Transcript