210 likes | 220 Views
This article provides an overview of distributed file systems, including their features, advantages, and different architectures. It also discusses the Google File System (GFS), Ceph, Andrew File System (AFS), and Hadoop. Topics covered include communication protocols, naming strategies, consistency and replication, security, and fault tolerance.
E N D
Latest Relevant Techniques and Applications for Distributed File Systems Ela Sharda esharda1@student.gsu.edu
Overview What is Distributed File System ? Features The Google File System (GFS) Ceph Andrew File System (AFS) Coda Hadoop References
What is Distributed File System ? A distributed file system stores files on one or more computers called servers, and makes them accessible to other computers called clients, where they appear as normal files. Advantages of using file servers : - Files are more widely available since many computers can access the servers, and sharing the files from a single location is easier than distributing copies of files to individual clients. - Backups and safety of the information are easier to arrange. The servers can provide large storage space, which might be costly or impractical to supply to every client.
cont... Since more than one client may access the same data simultaneously, the server must have a mechanism in place to organize updates so that the client always receives the most current version of data and that data conflicts do not arise. DFS typically use file or database replication (distributing copies of data on multiple servers) to protect against data access failures. Files which are distributed across multiple servers appear to users as if they reside in one place on the network. Users no longer need to know and specify the actual physical location of files in order to access them.
Features [2] Architecture : Different DFS architecture exists 1. Client- Server Architecture : Sun Microsystem’s Network System) which provides standardized view of its local file system. 2. Cluster-Based Distributed File : Systemsuch as GFS. It consists of a Single master along with multiple chunk servers and divided into multiple chunks. 3. Symmetric Architecture : Based on peer-to-peer technology. In this file system, the clients also host the metadata manager code,resulting in all nodes understanding the disk structures.
cont... 4. Asymmetric Architecture: There are one or more dedicated metadata managers that maintain the file system and its associated disk structures. Examples include Lustre and traditional NFS file systems. 5. Parallel Architecture : Here, data blocks are striped, in parallel, across multiple storage devices on multiple storage servers. Support for concurrent read and write capabilities.
cont... Communication : DFS’s use Remote Procedure Call method to communicate as they make the system independent from underlying OS, networks and transport protocols - In RPC approach, there are two communication protocols to consider, TCP and UDP. - TCP is mostly used by all DFS’s. - UDP is considered for improving performance in Hadoop.
cont... Naming :The currently common approach employs - 1. Central metadata server to manage file name space. Therefore decoupling metadata and data improve the file namespace and relief the synchronization problem. 2. Metadata distributed in all nodes resulting in all nodes understanding the disk structure.
cont... Consistency and Replication : Most of DFS employ checksumto validate the data after sending through communication network. - Caching and Replication play an important role in DFS when they are designed to operate over wide area network. - It can be done in many ways such as Client-side caching and Server-Side replication. - There are two types of data need to be considered for replication: metadata replication and data object replication.
cont... Security : Authentication Issues and access control are some of the important security issues in DFS’s that need to be analyzed. - Most DFS employ security with authentication, authorization and privacy. - Some DFS’s for specific purposes such as GFS and Hadoop, base on the trust between all nodes and clients.
cont... Fault Tolerance : It is very much related to the replication feature because replication is created to provide availability and support transparency of failures to users. - There are two approaches for fault tolerance : failure as exception and failure as norm. - Failure as exception systems will isolate the failure node or recover the system from last normal running state. - Failure as norm systems employ replication of all kind of data.
The Google File System [1] A scalable distributed file system for large data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients. It is widely deployed within Google as the storage platform for the generation and processing of data as well as research and development efforts that require large data sets. GFS is optimized for Google's core data storage and usage needs which can generate enormous amounts of data that need to be retained. The architecture is cluster based distributed file system.
Ceph [6] A distributed file system that provides excellent performance and reliability while promising unparalleled scalability. It is developed at UCSC. Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for heterogeneous and dynamic clusters of object storage devices (OSDs). The primary goals of the architecture are scalability (to hundreds of petabytes and beyond), performance, and reliability.
Andrew File System [5] The Andrew File System is a distributed networked file system which uses a set of trusted servers to present a homogeneous, location-transparent file name space to all the client workstations. It was developed by CMU as part of the Andrew Project. AFS uses Kerberos for authentication, and implements access control lists on directories for users and groups. Kerberos is a computer network authentication protocol developed at MIT, which allows individuals communicating over a non-secure network to prove their identity to one another in a secure manner. It provides mutual authentication — both the user and the server verify each other's identity.
Coda [4] Coda is a distributed file system developed at CMU with its origin in AFS2. It has many features that are very desirable for network filesystems. Disconnected operation for clients - reintegration of data from disconnected client - bandwidth adaptation Failure Resilience - read/write replication servers - handles of network failures which partition the servers Performance and scalability - client side persistent caching of files, directories and attributes for high performance - write back caching
cont... Some more features: Security - kerberos like authentication - access control lists (ACL's) Well defined semantics of sharing Freely available source code CODA LOGO
Hadoop [7] Apache Hadoop is a free Java software framework that supports data intensive distributed applications. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google's MapReduce and Google File System papers. It is a top level Apache project and Yahoo has been the largest contributor to the project and uses Hadoop extensively in its Web Search and Advertising businesses. IBM and Google have announced a major initiative to use Hadoop to support University courses in Distributed Computer Programming.
References [1] labs.google.com/papers/gfs-sosp2003.pdf [2] Tran Doan Thanh et al, A Taxonomy and Survey on Distributed File Systems, Fourth International Conference on Networked Computing and Advanced Information Management [3] http://technet.microsoft.com/en-us/library/cc738688.aspx [4] http://www.coda.cs.cmu.edu/ljpaper/lj.html [5] http://en.wikipedia.org/wiki/Andrew_File_System [6] Sage A. Weil et al, Ceph: A Scalable, High-Performance Distributed File System [7] http://en.wikipedia.org/wiki/Hadoop