370 likes | 379 Views
This Microsoft Research paper discusses a distributed file system that operates without a central server, using client machines to store and maintain files and directories. The system ensures availability and security through encryption, replication, and Byzantine-fault tolerance.
E N D
FARSITE: Federated, Available and Reliable Storage for an Incompletely Trusted Environment A. Atta, W. J. Bolowsky, M. Castro, G. Cermak,R. Chaiken, J. R. Douceur, J. Howell,J. R. Lorch, M. Theimer, R. P. Wattenhoffer Microsoft Research
Paper highlights • Paper discusses a distributed file system lacking a central server • Files and directories reside on client machines • Files are encrypted and replicated • Directory metadata are maintained by Byzantine-replicated finite state machines
Serverless file systems • Idea is not new • xFS (Anderson et al. SOSP 1995) • Objective is to utilize free disk space and processing power of client machines • Two major issues are • Availability of files • Security
Design assumptions (I) • Farsite is intended to run on the desktops of a large corporation or a university: • Maximum scale of ~105 machines • Interconnected by a high-bandwidth low-latency network • Most machines up most of the time • Uncorrelated machine failures
Design assumptions (II) • No files are both • Read by many users and • Frequently updated by at least one user (very infrequent in Windows NT file system) • Small but significant fraction of users will maliciously attempt to destroy or corrupt file data and metadata
Design assumptions (III) • Large fraction of users may independently attempt unauthorized accesses • Each machine is under the control of its immediate user • Cannot be subverted by other people • No user sensitive data persist after logout or system reboot • Not true for any commodity OS
Enabling technology trends (I) • General increase in unused disk capacity:for 4800 desktops at Microsoft research Year Unused disk space • 49% • 50% • 58%
Enabling technology trends (II) • Lowered cost of cryptographic operations: • Can now encrypt data at 72MB/s • Faster than disk sequential I/O bandwidth (32MB/s)
Namespace roots • Farsite provides hierarchical directory namespaces • Each namespace has its own root • Each root has a unique root name • Each root is managed by a designated set of machines forming a Byzantine-fault-tolerant group • No need for a protected set of machines
Trust and certification (I) • Basic Requirements • Users must trust the machines that offer to present data or metadata • Machines must trust the validity of requests from remote users • System security must trust that machines that claim to be distinct are truly distinct • To prevent Sybil attacks
Sybil attacks • (Douceur 2002) • Possible whenever redundancy is used to increase security • Single rogue entity can • Pretend to be many and • End controlling a large part of the system • Cannot prevent them without alogically centralized authority certifying identities
Trust and certification (II) • Farsite manages trust throughpublic-key cryptographic certificates • Namespace certificates • User certificates • Machine certificates
Trust and certification (III) • Bootstrapped by fiat: • Machines told to accept certificates that can be authenticated with some public keys • Associated private keys are called Certification Authorities (CA) • Certificates created either by CAs themselves or by users authorized to create certificates
Trust and certification (IV) • User private keys are • Encrypted with a symmetric key derived from user password • Stored in a globally-readable directory in Farsite • Does not require users to modify their behavior • User or machine keys can be revoked
Handling malicious behaviors • Most fault-tolerant file systems do not protect users’ files against malicious behaviors of hosts • They assume that a host will eitherbehave correctly or crash • Malicious behaviors are often calledByzantine failures • One or more hosts act as if they were controlled by very clever traitors
System architecture (I) • Each Farsite client will deal with two different sets of hosts • A set of machines constituting adirectory group • A set of machines acting as file hosts • In practice these three roles are shared by all machines
DirectoryGroup FileHost Member Member FileHost Client Member Member FileHost System architecture (II) Client sees one directory group
The directory group (I) • Replicates directories on directory members • Directory integrity enforced through a Byzantine-fault-tolerant protocol • Works as long as less than one-third of the hosts misbehave in any manner (“traitor) • Requires a minimum of four hosts to tolerate one misbehaving host
The directory group (II) • Decisions for all operations that are not determined by the client request are made through a cryptographically secure distributed random number generator • Issues leases on files to clients • Promise not to allow any incompatible access to the file during the duration of the lease without notifying the client
The directory group (III) • Directory groups can split: • Randomly select a group of machines they know • Tell them to form a new directory group • Delegate a portion of their namespace to new group • Both user and directory group mutually authenticate themselves
The file hosts (I) • Farsite stores encrypted replicas of each file to ensure file integrity and file availability • Continuously monitors host availability and relocates replicas whenever necessary • Does not allow all replicas of a given file to reside on hosts owned by the same user • Files that were recently accessed by a client are cached locally (for “roughly one week”)
The file hosts (II) • Farsite does not use voting: • Correct replicas are identified by the directory host • Farsite does not update at once all replicas of a file: • Would be too slow • Uses instead a background update mechanism
Semantic differences • Unlike NTFS, Farsite • Puts a limit on the number of clients that can have a file open for write • Allows a directory to be renamed even if there is an open handle on a file in the directory or any of its descendents • Uses background—”lazy”—propagation of directory updates
Reliability and availability (I) • Trough redundancy • Metadata stored in a directory group of RD members remain accessible if no more than (RD - 1) / 3 members fail • Data replicated on RF file hosts remain accessible as long as one of these hosts remains alive
Reliability and availability (II) • Farsite migrates duties of machines that have been unavailable for a long period of time to new machines (regeneration) • More aggressive approach to directory migration than to file-host migration • Farsite continuously monitors host availability and relocates replicas whenever necessary • Client cache files for a week after last access
Security (I) • Write access control enforced through Access Control Lists managed by directory group • Requires Byzantine agreement • Read access control achieved through strong cryptography • File is encrypted with symmetric file key • File key is encrypted with public keys of all authorized users
Security (II) • Same technique is applied to directory names • Members of directory group cannot read them • To ensure file integrity, Farsite stores a copy of a Merkle hash tree over the file data blocks in the directory group that manages the file’s metadata
What is a Merkle hash tree? (I) • Consider a file made up of four blocks:A, B, C and D • We successively compute: • a =leaf_hash(A) , …, d = leaf_hash(D) • p = inner_hash( a, b), q = inner_hash( c, d) • r = inner_hash( p, q) • Recomputing r (the root hash) an comparing it with its supposed value will detect any tampering
a=leaf_hash(A) b=leaf_hash(B) d =leaf_hash(D) c=leaf_hash(C) q=inner_hash(c, d) p=inner_hash(a, b) r=inner_hash(p,q) A B C D What is a Merkle hash tree? (II)
Durability (I) • File creations, deletions and renames are not immediately forwarded to directory group • High cost of Byzantine protocol • First stored in a log on client • Much as in Coda disconnected mode • Log is pushed back to directory group • At fixed intervals • Whenever a lease is recalled
Durability (II) • When a client reboots, it needs to send its committed updates to the directory group and have them accepted as authentic • Client will generate an authenticator key which it will distribute among members of the directory group • Can use this key to sign each committed update
Consistency (I) • Directory group uses a lease mechanism: • Data read/write leases • Data read-only leases • Concurrent write accesses are handled by redirecting them to a single client machine • Guarantees correctness • Non scalable
Consistency (II) • Leases have variable granularity • Single file • Entire subtree • No good way to handle read/write lease expiration on a disconnected client The fundamental paper on leases is C. G. Gray, .D. R. Cheriton: Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency. SOSP 1989: pp. 202-210
Consistency (III) • Specialname leases for files and directories • A name lease on a directory allows holder to create files and subdirectories under that directory with any non-extant name • More special-purpose leases were introduced to implement Windows file sharing semantics
Scalability • Ensured through • Hint-based pathname translation:Hints are data items that are useful when they are correct and cause no harm when they are incorrect • Think of a phone number • Delayed-directory change notification
Efficiency • Space efficiency: • Almost 50% of disk space could be reclaimed by eliminating duplicate files • Farsite detects files with duplicate contents and co-locates them in same set of file hosts • Performance: • Achieved through caching and delaying updates
Evaluation • Designed to scale up to 105 machines • Roughly 300 new machines per day • Andrew benchmark two times slower than NTFS • Still to do • Implement disk quotas • Have mechanism to measure machine availability