150 likes | 161 Views
Explore Farsite, a reliable storage system for untrusted environments, enabling read/write sharing and trust management for scalable data replication. Learn about performance considerations and security measures such as access control, encryption, and scalability. Discover the structure, replica set management, and update propagation in the system.
E N D
Farsite: Ferderated, Available, and Reliable Storage for an Incompletely Trusted Environment Microsoft Reseach, Appear in OSDI’02
Design Assumption • 100,000 machines in a large corporation or university, interconnected by a high-bandwidth, low-latency network • Allow large-scale read-only sharing • Allow small-scale read/write sharing • A small fraction of users misbehave
Enabling Technology Trends • Large amount of unused disk space enables the use of replication for reliability • Relatively low cost of strong cryptography enables distributed security
Problems • Namespace roots • A file system is a hierachical directory namespace, originated at a root • Allows multiple roots, each of which can be regarded as a virtual file server • A root cooresponds to a set of participating machines • Trust and Certification • The security of any distributed system is an issue of trust • Manage trust using public-key-cryptographic certificates • A namespace certificate • A user certificate • A machine certificate
Basic System • Each machine performs three roles: a client, a member of a directory group, and a file host • A directory group: a set of machine that collectively manage file information using a Byzantine-fault-tolerant protocol • A file host: a machine used to store file data replicas
Performance Considerations • Problems ? • All FS metadata operations involve Byzantine-fault-tolerant protocol(BFT) • BFT is high-cost • Solution • Local caching improves read performance (by content leases) • Batch logged updates(write-back caching, due to many writes are deleted or overwritten shortly after they occur)
Security • Access control by ACL • Privacy • Convergent encryption to protect the file data • Exclusive encryption to protect directory or file names • Integrity by a Merkle hash tree
Scalability • When a directory group becomes overloaded, it can delegate part of its namespace to another group • When open a file/directory with a paticular pathname, it needs to determine which group of machines is responsible for that name • Hint-based pathname translation (caching) like in Sprite
Taming aggressive replication in the Pangaea wide-area file system HP Labs
Design Goals • Speed: hide the wide-area networking latency • Availability and autonomy • Network economy: transfer data between nodes in physical proximity, thereby reducing latency and bandwidth
Structure of a file system • Gold replicas • The directory entry of a file lists the file’s gold replicas • Form a clique • Bronze replicas
Replica set management • Pervasive replication: a replica is created whenever a file is accessed by a user • File creation • Replica addition: the new replica S must be added to the graph (m edged) • adds an edge to a random gold replica (from a different region than S) • Asks a random gold replica P, to pick the replica (among P’s immediate graph neighbors)closest to S • Asks P to choose m-2 random replicas using random walk • Name-space containment
Propagating updates • Efficient and reliable update propagation • Delta propagation, harbingers, and using a spanning tree to exploit physical topology • Conflict resolution: combing version vectors and last-writer-win rules • Lack of strong consistency guarantees: eventually achieved
Questions? • Graph-based replica for each file, too much metadata to maintain • Like a multicast-based file system, updates are propagated using multicast
Discussion • Metadata and data management in a distributed file sytem • Either mutable, but have to trust some machines, like xFS, or Farsite using Byzantine-fault-tolerant to trust part of machines to serialize updates • Or immutable, using logged updates, it relies on each individual user to form the image of a file system • The replication factor of metadata and data maybe differ according to their usage?