The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur

The Hadoop Distributed File System:Architecture and Designby DhrubaBorthakur Presented by Bryant Yao

Introduction • What is it? It’s a file system! • Supports most of the operations a normal file system would. • Open source implementation of GFS (Google File System). • Written in Java • Designed primarily for GNU/Linux • Some support for Windows

Design Goals • HDFS is designed to store large files (think TB or PB). • HDFS is designed for a computer cluster/s made up of racks. • Write once, read many model • Useful for reading many files at once but not single files. • Streaming access of data • Data is coming to you constantly and not in waves • Make use of commodity computers • Expect hardware to fail • “Moving computation is cheaper than moving data” Rack 1 Rack 2 Cluster

Master/Slave Architecture Namenode Datanodes

Master/Slave Architecture cont. • 1 master, many slaves • The master manages the file system namespace and regulates access to files by clients. • Data distributed across slaves. The slaves store the data as “blocks”. • What is a block? • A portion of a file. • Files are broken down into and stored as a sequence of blocks. File 1 A B C Broken down into blocks A, B, and C.

Task Flow

Namenode • Master • Handles metadata operations • Stored in a transaction log called EditLog • Manages datanodes • Passes I/O requests to datanodes • Informs the datanode when to perform block operations. • Maintains a BlockMap which keeps track of which blocks each datanode is responsible for. • Stores all files’ metadata in memory • File attributes, number of replicas, file’s blocks, block locations, and checksum of a block. • Stores a copy of the namespace in the FsImage on disk.

Datanode • Slave • Handles data I/O. • Handles block creation, deletion, and replication • Local storage is optimized so files are stored over multiple file directories • Storing data into a single directory

Data Replication • Makes copies of the data! • Replication factor determines the number of copies. • Specified by namenode or during file creation • Replication is pipelined!

Pipelining Data Replication • Blocks are split into portions (4KB). 1 2 3 Assume a block is split into 3 portions: A, B, and C. A 1 2 3 B A 1 2 3 C B A

Replication Polic y • Communication bandwidth between computers in a rack is greater than between a computer outside of the rack. • We could replicate data across racks…but this would consume the most bandwidth. • We could replicate data across all computers in a rack…but if the rack dies we’re in the same position as before.

Replication Polic y cont. • Assume only three replicas are created. • Split the replicas between 2 racks. • Rack failure is rare so we’re still able to maintain good data reliability while minimizing bandwidth cost. • Version 0.18.0 • 2 replicas in current rack (2 different nodes) • 1 replica in remote rack • Version 0.20.3.x • 1 replica in current rack • 2 replicas in remote rack (2 different nodes) • What happens if replication factor is 2 or > 3? • No answer in this paper. • Some other papers state that the minimum is 3. • The author wrote a separate paper stating every replica after the 3rd is placed randomly.

Reading Data • Read the data that’s closest to you! • If the block/replica of data you want is on the datanode/rack/data center you’re on, read it from there! • Read from datanodes directly. • Can be done in parallel. • Namenode is used to generate the list of datanodes which host a requested file as well as getting checksum values to validate blocks retrieved from the datanodes.

Writing Data • Data is written once • Split into blocks, typically of size 64MB • The larger the block size, the less metadata stored by the namenode • Data is written to a temporary local block on the client side and then flushed to a datanode, once the block is full. • If a file is closed while the temporary block isn’t full, the remaining data is flushed to the datanode. • If the namenode dies during file creation, the file is lost!

Hardware Failure Imagine a file is broken into 3 blocks spread over three datanodes. 1 2 3 Block A Block B Block C If the third datanode died, we would have no access to block C and we can’t retrieve the file. 1 2 3 Block A Block B Block C

Designing for Hardware Failure • Data replication • Safemode • Heartbeat • Block report • Checkpoints • Re-replication

Checkpoints EditLog FsImage + = File System Namespace

Checkpoints • FsImage is a copy of the system taken before any changes have occurred. • EditLog is a log of all the changes to the namenode since it’s startup. • Upon the start up of the namenode, it applies all changes to the FsImage to create an up to date version of itself. • The resulting FsImage is the checkpoint. • If either the FsImage or EditLog is corrupt, the HDFS will not start!

Heartbeat and Blockreport • A heartbeat is a message sent from the datanode to the namenode. • Periodically sent to the namenode, letting the namenode know it’s “alive.” • If it’s dead, assume you can’t use it. • Blockreport • A list of blocks the datanode is handling.

Safemode • Upon startup, the namenode enters “safemode” to check the health status of the cluster. Only done once. • Heartbeat is used to ensure all datanodes are available to use. • Blockreport is used to check data integrity. • If the number of replicas retrieved is different from the number of replicas expected, there is a problem. Replicated Found A A A A A

Other • Can view file system through FS Shell or the web • Communicates through TCP/IP • File deletes are a move operation to a “trash” folder which auto-deletes files after a specified time (default is 6 hours). • Rebalancer moves data from datanodes which have are close to filling up their local storage.

Relation with Search Engines • Originally built for Nutch. • Intended to be the backbone for a search engine. • HDFS is the file system used by Hadoop. • Hadoop also contains a MapReducer which has many applications, like indexing the web! • Analyzing large amounts of data. • Used by many, many companies • Google, Yahoo!, Facebook, etc. • It can store the web! • Just kidding .

“Pros/Cons” • The goal of this paper is to describe the system, not analyze it. It gives a great beginning overview. • Probably could’ve been condensed/organized better. • Some information is missing • SecondaryNameNode • CheckpointNode • Etc.

Pros/Cons of HDFSIn and Beyond the Paper • Pros • It accomplishes everything it set out to do. • Horizontally scalable – just add a new datanode! • Cheap cheapcheap to build. • Good for reading and storing large amounts of data. • Cons • Security • No redundancy of namenode • Single point of failure • The namenode is not scalable • Doesn’t handle small files well • Still in development, many features missing

Questions? Thank you for listening!

The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur