500 likes | 558 Views
introduction to distributed file system and google file system
E N D
Distributed file system • A distributed file system is a client/server based application that allows clients to access and process data stored on the server as if it were on their own computer. When a user accesses a file on the server, the server sends the user a copy of the file, which is cached on the user's computer while the data is being processed and is then returned to the server.
Distributed File System • Single Namespace for entire cluster • Data Coherency – Write-once-read-many access model – Client can only append to existing files • Files are broken up into blocks – Typically 128 MB block size – Each block replicated on multiple DataNodes • Intelligent Client – Client can find location of blocks – Client accesses data directly from DataNode
Hadoop • Apache Hadoop is an open-source software framework written in Java. It is primarily used for storage and processing of large sets of data, better known as big data. It comprises of several components that allow the storage and processing of large data volumes in a clustered environment. However, the two main components are Hadoop Distributed File System and MapReduce programming.
Who uses Hadoop? • Amazon/A9 • Yahoo! • Facebook • Google • IBM • Joost • Last.fm • New York Times • PowerSet • Veoh
Main components of a Hadoop • Hadoop have wide range of technologies that provide great advantage in solving complex business problems. • Core components of a Hadoop application are- 1) Hadoop Common 2) HDFS 3) HadoopMapReduce 4) YARN
1) Hadoop Common Hadoop common provides all java libraries, utilities, OS level abstraction, necessary java files and script to run Hadoop • 2) HDFS • The default storage layer for Hadoop. • 3) HadoopMapReduce For processing large data sets in parallel across a hadoop cluster, HadoopMapReduce framework is used. Data analysis uses a two-step map and reduce process. • 4) YARN • Hadoop YARN is a framework for job scheduling and cluster resource management.
Other components • Data Access Components are - Pig and Hive • Data Storage Component is - HBase • Data Integration Components are - Apache Flume, Sqoop, Chukwa • Data Management and Monitoring Components are - Ambari, Oozie and Zookeeper. • Data Serialization Components are - Thrift and Avro • Data Intelligence Components are - Apache Mahout and Drill
Other components • Spark- Used on top of HDFS, Spark promises speeds up to 100 times faster than the two-step MapReduce function in certain applications. Allows data to loaded in-memory and queried repeatedly, making it particularly apt for machine learning algorithms • Hive- Originally developed by Facebook, Hive is a data warehouse infrastructure built on top of Hadoop. Hive provides a simple, SQL-like language called HiveQL, whilst maintaining full support for MapReduce. This means SQL programmers with little former experience with Hadoop can use the system easier, and provides better integration with certain analytics packages like Tableau. Hive also provides indexes, making querying faster. • HBase- Is a NoSQL columnar database which is designed to run on top of HDFS. It is modelled after Google’s BigTable and written in Java. It was designed to provide BigTable-like capabilities to Hadoop, such as the columnar data storage model and storage for sparse data.
Other components • Flume- Flume collects (typically log) data from ‘agents’ which it then aggregates and moves into Hadoop. In essence, Flume is what takes the data from the source (say a server or mobile device) and delivers it to Hadoop. • Mahout- Mahout is a machine learning library. It collects key algorithms for clustering, classification and collaborative filtering and implements them on top of distributed data systems, like MapReduce. Mahout primarily set out to collect algorithms for implementation on the MapReduce model, but has begun implementing on other systems which were more efficient for data mining, such as Spark. • Sqoop-Sqoop is a tool which aids in transitioning data from other database systems (such as relational databases) into Hadoop.
Hadoop Distributed File System • HDFS is a file system designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware. • Designed to reliably store very large files across machines in a large cluster • Data Model • Data is organized into files and directories • In storage layer files are divided into uniform sized blocks(64MB, 128MB, 256MB) and distributed across cluster nodes • Blocks are replicated to handle hardware failure • File system keeps checksums of data for corruption detection and recovery • HDFS exposes block placement so that computes can be migrated to data
HDFS Components Master-Slave architecture • HDFS Master “Namenode” - Manages the file system namespace - Controls read/write access to files - Manages block replication - Maps a file name to a set of blocks • HDFS Workers “Datanodes” - Serve read/write requests from clients - Perform replication tasks upon instruction by Namenode. - Report blocks and system state. • HDFS Namespace Backup “Secondary Namenode”
HDFS Blocks • HDFS has Large block size Default 64MB Typical 128MB, 256MB, 512MB… Normal Filesystem blocks are few kilobytes. • Unlike a file system for a single disk. A file in HDFS that is smaller than a single block does not occupy a full block. if a block is 10MB it needs only 10MB of the space of the full block on the local drive. • A file is stored in blocks on various nodes in hadoop cluster. • Provides complteabstrction view to client.
NameNode Metadata • Meta-data in Memory – The entire metadata is in main memory – No demand paging of FS meta-data • Types of Metadata – List of files – List of Blocks for each file – List of DataNodes for each block – File attributes, e.g access time, replication factor • A Transaction Log – Records file creations, file deletions. etc
DataNode • A Block Server – Stores data in the local file system – Stores meta-data of a block – Serves data and meta-data to Clients • Block Report – Periodically sends a report of all existing blocks to the NameNode • Facilitates Pipelining of Data – Forwards data to other specified DataNodes
Block Placement • Current Strategy -- One replica on random node on local rack -- Second replica on a random remote rack -- Third replica on same remote rack -- Additional replicas are randomly placed • Clients read from nearest replica • Would like to make this policy pluggable
Replication Engine • NameNode detects DataNode failures – Chooses new DataNodes for new replicas – Balances disk usage – Balances communication traffic to DataNodes
Data Pipelining • Client retrieves a list of DataNodes on which to place replicas of a block • Client writes block to the first DataNode • The first DataNode forwards the data to the next DataNode in the Pipeline • When all replicas are written, the Client moves on to the next block in file
Secondary NameNode • Copies FsImage and Transaction Log from NameNode to a temporary directory • Merges FSImage and Transaction Log into a new FSImage in temporary directory • Uploads new FSImage to the NameNode – Transaction Log on NameNode is purged
Google file system • The most influential development for hadoop was GFS. • The concept was first introduced in a white paper in 2003. • This was the time when the internet was booming and people were beginning to see a very large datasets. • Google needed a file system that was reliable and scalable at the same time without sacrificing performance and cost.
Google file system • Large, distributed, highly fault tolerant file system. • Provides fault tolerance, serving large number of clients with high aggregate performance. • The field of Google is beyond the searching. • Google store the data in more than 15 thousands commodity hardware.
4 Key Observations • Component failure should be treated as norms and not the exception. • They suggested in the paper that the quantity and quality of the components guaranteed failure in large databases, so they aim to develop a file system that could be going to accommodate for these errors easily. • Files are getting larger, multiple GB is normal, some datasets are even multiple TB(2003) and now some petabytes.
4 Key Observations(cont..) • Most files are mutated by appending new data rather than overwriting. • This includes data like streams and also a lot of data analysis methods, for eg: Twitter, you do not need to make overrides you’re just going to be analyzing the data or may be occasionally appending it. • Google basically said that a lot of new data coming in doesn’t require write capability within the file. • Co-designing the applications and API benefits the overall system. • If u design a file system with a use case in mind then the file system will surely benefit.
DESIGN OVERVIEW • Assumptions • Designed from many inexpensive commodity components that are said to fail often. • It stores a modest number of large files. • Workloads consist of large streaming reads and small random reads. • Workloads also have many large, sequential writes that append data to files. • Efficiently implement well-defined semantics for multiple clients. • High sustained bandwidth is more important than low latency.
Google file system Architecture • Google developed a file system capable of processing large datasets but successfully accommodated hardware failure by using a lot of inexpensive commodity components instead of a few expensive units. • The use of inexpensive commodity components also enabled linear scalabilty: • Increasing the database size was as simple as adding more inexpensive units • GFS cluster consists of a single master and multiple chunkservers. • The basic analogy of GFS is master , client , chunkservers.
Google file system Architecture • Files are divided into fixed-size chunks. • Chunk servers store chunks on local disks as Linux files. • Master maintains all file system metadata.
Includes the namespace, access control information, the mapping from files to chunks, and the current locations of chunks. • Clients interact with the master for metadata operations. • Chunk servers need not cache file data .
ChunkServer: house the actual data sets • Similar to the concept of block in file systems. • Compared to file systems, the size of chunk is 64 MB. • Less chunks and less metadata for chunks in the master. • Property of chunk is chunks are stored in chunkserver as file, chunk handle, i.e., chunk file name. • Chunks are replicated 3 times and then are put on different chunkservers.
Metadata • Master stores three major types of metadata: the file and chunk namespaces, the mapping from files to chunks, and the location of each chunk’s replicas. • First two types are kept persistent to an operation log stored on the master’s local disk. • But the third type is not persistent record, it stays up-to-date through the use of heartbeat. • Metadata is stored in memory, master operations are fast. • Easy and efficient for the master to periodically scan . • Periodic scanning is used to implement chunk garbage collection, re-replication and chunk migration .
Master • Single process ,running on a separate machine that stores all metadata. • Load Management: Chunk management and Load balancing. Client • Clients contact master to get the metadata to contact the chunkservers.
SYSTEM INTERACTION • Read Algorithm 1. Application originates the read request 2. GFS client translates the request form (filename, byte range) -> (filename, chunk index), and sends it to master 3. Master responds with chunk handle and replica locations (i.e. chunkservers where the replicas are stored)
4. Client picks a location and sends the (chunk handle, byte range) request to the location 5. Chunkserver sends requested data to the client 6. Client forwards the data to the application • Write Algorithm 1. Application originates the request 2. GFS client translates request from (filename, data) -> (filename, chunk index), and sends it to master 3. Master responds with chunk handle and (primary + secondary) replica locations
4. Client pushes write data to all locations. Data is stored in chunkservers’ internal buffers
5. Client sends write command to primary 6. Primary determines serial order for data instances stored in its buffer and writes the instances in that order to the chunk 7. Primary sends the serial order to the secondaries and tells them to perform the write 8. Secondaries respond to the primary 9. Primary responds back to the client
Record Append Algorithm 1. Application originates record append request. 2. GFS client translates requests and sends it to master. 3. Master responds with chunk handle and (primary + secondary) replica locations. 4. Client pushes write data to all replicas of the last chunk of the file. 5. Primary checks if record fits in specified chunk. 6. If record doesn’t fit, then the primary: Pads the chunk Tell secondaries to do the same And informs the client Client then retries the append with the next chunk 7. If record fits, then the primary: Appends the record Tells secondaries to write data at exact offset Receives responses from secondaries And sends final response to the client
MASTER OPERATION • Name space management and locking • Multiple operations are to be active and use locks over regions of the namespace. • GFS does not have a per-directory data structure. • GFS logically represents its namespace as a lookup table. • Each master operation acquires a set of locks before it runs. • Replica placement • A GFS cluster is highly distributed. • The chunk replica placement policy serves , maximize data reliability and availability, and maximize network bandwidth utilization. • Chunk replicas are also spread across racks.
Creation , Re-replication and Balancing Chunks • Factors for choosing where to place the initially empty replicas: (1)We want to place new replicas on chunkservers with below-average disksp ace utilization. (2) We want to limit the number of “recent” creations on each chunkserver. (3)Spread replicas of a chunk across racks. • master re-replicates a chunk. • Chunk that needs to be rereplicated is prioritized based on how far it is from its replication goal. • Finally, the master rebalances replicas periodically.
GARBAGE COLLECTION • Garbage collection at both the file and chunk levels. • Deleted by the application, the master logs the deletion immediately. • File is just renamed to a hidden name . • The file can be read under the new, special name and can be undeleted. • Memory metadata is erased.
FAULT TOLERANCE • High Availability • Fast Recovery • Chunk Replication • Master Replication • Data Integrity • Chunkserver uses checksumming. • Broken up into 64 KB blocks.
CHALLENGES • Storage size. • Bottle neck for the clients. • Time.
CONCLUSION • Supporting large-scale data processing. • Provides fault tolerance. • Tolerate chunkserver failures. • Delivers high throughput. • Storage platform for research and development.
Data consistency • Eric Brewer’s CAP theorem for Distributed Systems: • There are three core systemic requirements that exists in a special relationship when it comes to designing and deploying applications in a distributed environment: • Consistency [across node in a cluster] • Availability • Partition Tolerance • A Distributed system can achieve only two of these systemic requirements, but not all three.
Terms used • Consistency : A system operates fully, or not at all. For eg. Transaction commit/rollback. • Availability : A system is always able to answer a request. For eg. Cluster failover. • Partition Tolerance :If data is distributed (e.g. partitioned to different servers in an MPP RDBMS , and one or more nodes fails, the system can continue to function).
SQL vs Big Data(non SQL) Prioritization Global Consistency Global Consistency • Partition Tolerance • Partition Tolerance • Availability • Availability Hadoop SQL Server Atomicity Consistency Isolation Durability Basically Available, Soft State, Eventual Consistency
ACID (SQL) vs BASE (HADOOP) • Atomicity Consistency Isolation Durability • Data distributed across nodes must be consistent before released for subsequent queries • This is usually achieved via 2 phase commit • Yet immediate consistency across distributed partitions (nodes) limits scale-out performance. • Basically Available, Soft State, Eventual Consistency • Eventually consistency is acceptable, so it’s not necessary to hold subsequent queries until updates are fully written to distributed partitions (nodes) • Scale-out performance is greatly enhanced • This is fine when the nature of the data can tolerate some imprecision in query results.
Consistency Issues – Access/Update Ratio User accesses to the page … Updates to the Web page time
Consistency Model • When and how the modifications are made = consistency model: • Weak versus strong consistency model
Consistency Models (cont.) The general organization of a logical data store, physically distributed and replicated across multiple processes.
Consistency Models (cont) • A process performs a read operation on a data item, expects the operation to return a value that shows the result of the last write operation on that data • No global clock difficult to define the last write operation • Consistency models provide other definitions • Different consistency models have different restrictions on the values that a read operation can return read2 read1
Summary of Consistency Models • Consistency models not using synchronization operations. • Models with synchronization operations.