1 / 57

Large Scale File Systems

Large Scale File Systems. by Dale Denis. The need for Large Scale File Systems. Big Data. The Network File System (NFS ). The use of inexpensive commodity hardware. Large Scale File Systems. The Google File System (GFS). The Hadoop File System (HDFS). Outline.

ewan
Download Presentation

Large Scale File Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large Scale File Systems by Dale Denis

  2. The need for Large Scale File Systems. Big Data. The Network File System (NFS). The use of inexpensive commodity hardware. Large Scale File Systems. The Google File System (GFS). The Hadoop File System (HDFS). Outline Dale Denis

  3. International Data Corporation (IDC): The digital universe will grow to 35 zettabytes globally by 2020. The New York Stock Exchange generates over one terabyte of new trade data per day. The Large Hadron Collider new Geneva, Switzerland, will produce approximately 15 petabytes of data per year. Facebook hosts approximately 10 billion photos, taking up 2 petabytes of data. Big Data Dale Denis

  4. 85 % of the data being stored is unstructured data. The data does not require frequent updating once it is written, but it is read often. The scenario is complimentary to data that is more suitable for an RDBMS Relational Database Management Systems are good at storing structured data: Microsoft SQL Server Oracle MySQL Big Data Dale Denis

  5. NFS: The ubiquitous distributed file system. Developed by Sun Microsystems in the early 1980’s. While its design is straightforward, it is also very constrained: The files in an NFS volume must all reside on a single machine. All clients must go to this machine to retrieve their data. NFS Dale Denis

  6. The bottleneck of reading data from a drive: Transfer speeds have not kept up with storage capacity. 1990: A typical drive with 1370 MB capacity had a transfer speed of 4.4 MB/s so it would take 5 minutes to read all of the drives data. 2010: A terabyte drive with the typical transfer speed of 100 MB/s takes around 2 hours to read all of the data. NFS Dale Denis

  7. Cost-effective. One Google server rack: 176 processors 176 GB of memory $278,000 Commercial grade server: 8 processors 1/3 the memory Comparable amount of disk space $758,000 Scalable. Failure is to be expected. Inexpensive Commodity Hardware Dale Denis

  8. Apache Nutch: Doug Cutting created Apache Lucene. Apache Nutch was a spin-off of Lucene. Nutch was an open source web search engine. Development started in 2002. The architecture had scalability issues due to the very large files generated as part of the web crawl and indexing process. A Solution is Born Dale Denis

  9. In 2003 the paper “The Google File System” was published. In 2004 work began on an open source implementation of the Google File System (GFS) for the Nutch web search engine. The project was called the Nutch Distributed File System (NDFS). In 2004 Google published a paper introducing MapReduce. By early 2005 the Nutch developers had a working implementation of MapReduceand by the end of the year all of the major algorithms in Nutch had been ported to run MapReduce on NDFS. A Solution is Born Dale Denis

  10. In early 2006 they realized that the MapReduce implementation and NDFS had the potential beyond web search. The project was moved out of Nutch and was renamed Hadoop. In 2006 Doug Cutting was hired by Yahoo. Hadoop became an open source project at Yahoo!. In 2008 Yahoo! Announced that its production search index was running on a 10,000-core Hadoop cluster. A Solution is Born Dale Denis

  11. A scalable distributed file system for large distributed data-intensive applications. Provides fault tolerance while running on inexpensive commodity hardware. Delivers high aggregate performance to a large number of clients. The design was driven by observations at Google of their application workloads and technological environment. The file system API and the applications were co-designed. The Google File System (GFS) Dale Denis

  12. The system is built from many inexpensive commodity components that often fail. The system must tolerate, detect, and recovery from component failure. The system stores a modest number of large files. Small files must also be supported but the system doesn’t need to be optimized for them. The workloads primarily consist of large streaming reads and small random reads. GFS Design Goals and Assumptions Dale Denis

  13. The workloads also have many large, sequential writes that append data to files. Small writes at arbitrary positions in a file are to be supported but do not have to be efficient. Multiple clients must be able to concurrently append to the same file. High sustained bandwidth is more important than low latency. The system must provide a familiar file system interface. GFS Design Goals and Assumptions Dale Denis

  14. Supports operations to create, delete, open, close, read, and write files. Has snapshot and record append operations. Record append operations allow multiple clients to append data to the same file concurrently while guaranteeing the atomicity of each client’s append. GFS Interface Dale Denis

  15. A GFS cluster consists of: A single master. Multiple chunk servers. The GFS cluster is accessed by multiple clients. Files are divided into fixed-size chunks. The chunks are 64 MB, this is configurable. Chunk servers store the chunks on local disks. Each chunk is replicated on multiple servers. A standard replication factor of 3. GFS Architecture Dale Denis

  16. Two files being stored on three chunk servers with a replication factor of 2. GFS Architecture Dale Denis

  17. Maintains all file system metadata. Namespace information. Access control information. Mapping from files to chunks. The current location of the chunks. Controls system wide activities. Executes all namespace operations. Chunk lease management. Garbage collection. Chunk migration between the servers. Communicates with each chunk server in heart beat messages. GFS Single Master Dale Denis

  18. Clients interact with the master for metadata operations, but all data-bearing communication goes directly to the chunk servers. A client sends the master a request for a file and the master responds with the locations of all of the chunks. The client then sends a request to one of the chunks servers for a replica. GFS Single Master Dale Denis

  19. The chunk size is large. Advantages: Reduces the client’s need to interact with the master. Helps to keep the master from being a bottleneck. Reduces the size of the metadata stored on the master. The master is able to keep all of the metadata in memory. Reduces the network overhead by keeping persistent TCP connections to the chunk server over an extended period of time. Disadvantages: Hotspots with small files if too many clients are accessing the same file. Chunks are stored on local disks as Linux files. GFS Chunks Dale Denis

  20. The master stores three types of metadata. File and chunk namespaces. The mapping from files to chunks. The location of each chunk’s replicas. The master doesn’t store the locations of the replicas. The master asks each chunk server about its chunks At master startup. When a chunk server joins the cluster. The master also includes rudimentary support for permissions and quotas. GFS Metadata Dale Denis

  21. The operations log is central to GFS! The operations log contains a historical record of critical metadata changes. Files and chunks are uniquely identified by the logical times at which they were created. The log is replicated on multiple remote machines. The master recovers its state by replaying the operation log. Monitoring infrastructure outside of the GFS restarts a new master process if the old master fails. Read-only “Shadow Masters” provide read-only access when the primary master is down. GFS Metadata Dale Denis

  22. A mutation is an operation that changes the contents or metadata of a chunk. Leases are used to maintain a consistent mutation order across replicas. The master grants a lease to one of the replicas, which is called the primary. The primary picks a serial order for all mutations to the chunk. The lease mechanism is designed to minimize the management overhead at the master. GFS Leases and Mutations Dale Denis

  23. The client asks the master which chunk server holds the current lease for a chunk and the locations of the other replicas. The client pushes the data to all of the replicas. When all replicas acknowledge receiving the data the client sends a write request to the primary. The primary serializes the mutations and applies the changes to its own state. GFS The anatomy of a mutation Dale Denis

  24. The primary forwards the write request to all of the secondary replicas. The secondary's apply the mutations in the same serial order assigned by the primary. The secondary's reply to the primary that they have completed. The primary replies to the client. GFS The anatomy of a mutation Dale Denis

  25. The data flow and the control flow have been decoupled. The data is pushed linearly along a carefully picked chain of chunk servers in a pipeline fashion. Each chunk server forwards the data to the next nearest chunk server in the chain. The goal is to fully utilize each machine’s network bandwidth and avoid bottlenecks. GFS The anatomy of a mutation Dale Denis

  26. Write Control and Data Flow. GFS The anatomy of a mutation Dale Denis

  27. Record appends are atomic. The client specifies the data, the GFS appends it to the file atomically at an offset of the GFS’s choosing. In a traditional write, the client specifies the off-set at which data is to be written. The primary replica checks to see if appending to the current chunk would exceed the maximum size. If so, the primary pads the current chunk and replies to the client that the operation should be retried on the next chunk. GFS Record Append Dale Denis

  28. Namespace Management and Locking. Locks allow multiple operations to be active at the same time. Locks over regions of the namespace ensure proper serialization. Each master operation acquires a set of locks before it runs. The centralized server approach was chosen to in order to simplify the design. Note: GFS does not have a per-directory data structure that lists all the files in that directory. GFS Master Operations Dale Denis

  29. Replica Placement. The duel goals of replica placement policy: Maximize data reliability and availability. Maximize network bandwidth utilization. Chunks must not only be spread across machines, they must also be spread across racks. Fault tolerance. To exploit the aggregate bandwidth of multiple racks. GFS Master Operations Dale Denis

  30. The master rebalances replicas periodically. Replicas are removed from chunk servers with below-average free space. Through this process the master gradually fills up a new chunk server. Chunks are re-replicated as the number of replicas falls below a user-specified goal. Due to failure. Data corruption. Garbage collection is done lazily at regular intervals. GFS Master Operations Dale Denis

  31. When a file is deleted the file is renamed to a hidden name and the file is given a deletion time stamp. After three days the file is removed from the namespace. The time interval is configurable. Hidden files can be undeleted. In the regular heartbeat message the chunk server reports a subset of the chunks that it has. The master replies with the id’s of the chunks that are no longer in the namespace. The chunk server is free to delete chunks that are not in the namespace. GFS Garbage Collection Dale Denis

  32. Each chunk server uses check summing to detect the corruption of stored data. Each chunk is broken into 64 KB blocks, each block has a 32 bit checksum. During idle periods the chunk servers are scanned to verify the contents of inactive chunks. GFS servers generate diagnostic logs that record many significant events. Chunk servers going online and offline. All RPC requests and replies. GFS Data Integrity Dale Denis

  33. Experiment 1: One chunk server with approx. 15,000 chunks containing 600 GB of data was taken off-line. The number of concurrent clonings was restricted to 40% of the total number of chunk servers. All chunks were restored in 23.3 minutes, at an effective replication rate of 440 MB/s. GFS Recovery Time Dale Denis

  34. Experiment 2: Two chunk servers with approx. 16,000 chunks and 660 GB of data were taken off-line. Cloning was set to a high priority. All chunks were restored to a 2x replication within 2 minutes. The cluster back in a state where it could tolerate another chunk server failure without data loss. GFS Recovery Time Dale Denis

  35. Test Environment. 16 client machines. 19 GFS servers. 16 chunk servers. 1 master, 2 master replicas. All machines had the same configuration. Each machine had a 100 Mbps full-duplex Ethernet connection. 2 HP 2524 10/100 switches. All 19 servers were connected to one switch and all 16 clients were connected to the other. A 1 Gbps link connected the two switches. GFS Measurements Dale Denis

  36. N clients reading simultaneously from the file system. Theoretical limit peaks at an aggregate of 125 Mbps when the 1 Gbps link is saturated. Theoretical per client limit of 12.5 Mbps when the network interface is saturated. Observed read rate of 10 Mbps when one client is reading. GFS Measurements - Reads Dale Denis

  37. N clients writing simultaneously to N distinct files. Theoretical limit peaks at an aggregate of 67 Mbps because each byte has to be written to 3 of the 16 chunk servers. Observed write rate of 6.3 Mbps. This slow rate was attributed to issues with the network stack that didn’t work well with GFS pipeline scheme. In practice that has not been a problem. GFS Measurements - Writes Dale Denis

  38. N clients append simultaneously to a single file. The performance is limited by the network bandwidth of the chunk servers that store the last chunk of the file. As the number clients increases the congestion on the chunk servers also increases. GFS Measurements - Appends Dale Denis

  39. The Hadoop Distributed File System: The open source distributed file system for large data sets that is based upon the Google File System. As with the GFS the HDFS is a distributed file system that is designed to run on commodity hardware. The HDFS provides high throughput access to application data and is suitable for applications that have large data sets. The HDFS is not a general purpose file system. The Hadoop Distributed File System (HDFS) Dale Denis

  40. Hardware failure is the norm. The detection of faults and quick, automatic recovery is a core architectural goal. Large Data Sets Applications that run on HDFS have large data sets. A typical file is gigabytes to terabytes in size. It should support tens of millions of files. Streaming Data Access. Applications that run on HDFS need streaming access to their data sets. The emphasis is on high throughput of data access rather than low latency. HDFS Design Goals and Assumptions Dale Denis

  41. Simple Coherency Model Once a file has been written it cannot be changed. There is a plan to support appending-writes in the future. Portability across heterogeneous hardware and software platforms. In order to facilitate adoption of HDFS. Hadoop is written in Java. Provide interfaces for applications to move themselves closer to where the data is. “Moving computation is cheaper than moving data” HDFS Design Goals and Assumptions Dale Denis

  42. An HDFS cluster consists of: A Name Node. Multiple Data Nodes. Files are divided into fixed-size blocks. The blocks are 64 MB, this is configurable. The goal is to minimize the cost of seeks. Seek time should be 1% of transfer time. As transfers speeds increase block size can be increased. Block sizes that are too big will cause MapReduce jobs to run slowly. HDFS Architecture Dale Denis

  43. Writing data to the HDFS. No control messages to or from the data nodes. No concern for serialization. HDFS Architecture Dale Denis

  44. Ideally bandwidth between nodes should be used to determine distance. In practice measuring bandwidth between nodes is difficult. HDFS assumes that in each of the following scenarios that bandwidth becomes progressively less: Processes on the same node. Different nodes on the same rack. Nodes on different racks in the same data center. Nodes in different data centers. HDFS Network Topology Dale Denis

  45. By default HDFS assumes that all nodes are on the same rack in the same data center. An XML configuration script is used to map nodes to locations. HDFS Network Topology Dale Denis

  46. Trade-off between reliability, write bandwidth, and read bandwidth. All replicas on nodes at different data centers provides high redundancy at the cost of high write bandwidth. HDFS Replica Placement Dale Denis

  47. First replica goes on the same node as the client. Second replica goes on a different rack, selected at random. The third replica is placed on the same rack as the second but a different node is chosen. Further replicas are placed on nodes selected at random from the cluster. Nodes are selected that are not too busy or full. The system avoids placing too many replicas on the same rack. HDFS Replica Placement Dale Denis

  48. Based upon the POSIX model but does not provide strong security for HDFS files. Is designed to prevent accidental corruption or misuse of information. Each file and directory is associated with an owner and a group. For files there are separate permissions to read, write or append to the file. For directories there are separate permissions to create or delete files or directories. Permissions are new to HDFS, adding methods such as Kerberos authentication in order to establish user identify have been planned for the future. HDFS Permissions Dale Denis

  49. HDFS provides a Java API. A JNI-base wrapper, libhdfs had been developed that allows you to work with the Java API with C/C++. Work is underway to expose HDFS through the WebDAV protocol. HDFS Accessibility Dale Denis

  50. The main web interface is exposed on the NameNode at port 50070. Contains an overview about the health, capacity and usage of the cluster. Each datanode also has a web interface at port 50075. Logfiles generated by the Hadoop daemons can be accessed through this interface. Very useful for distributed debugging and troubleshooting. HDFS Web Interface Dale Denis

More Related