340 likes | 896 Views
Hadoop Distributed File System. Vartika verma. Outline. Introduction Architecture File I/O Operations Replica Management Comparison with other File Systems. HDFS. Scalable, reliable, fault-tolerant, distributed storage system designed to work closely with MapReduce
E N D
Hadoop Distributed File System Vartikaverma
Outline • Introduction • Architecture • File I/O Operations • Replica Management • Comparison with other File Systems
HDFS • Scalable, reliable, fault-tolerant, distributed storage system designed to work closely with MapReduce • Distributed: Scale of data growing at higher pace than single storage disk capacity growth, hence cluster of disk distributed over network is necessary. • Scalable: Extends to handle growing data requirement. • Reliable: Safeguards against data lose, achieved by data replication. • Fault-Tolerant: Protects against increased failure probability due to large number of disks.
Origin • Based on Google File System • Developed by Doug Cutting and Mike Cafarella to support Nutch, open source web search engine • Now under Apache License • Many abstractions build over HDFS for specific cases, like Hive, HBase, etc. • Hundreds of companies use it today, including Facebook, Yahoo, Netflix, Twitter, Amazon, etc.
Architecture Overview Name Node = + User Checkpoint HDFS Client Journal Inodes Image HDFS Client Inter Node Communication Using TCP … Data Node 1 Data Node 2 Data Node 3 Data Node n Data Blocks Replicas
Name Node • Single name node for each cluster • Inodes: Holds metadata about the entire namespace in form of inodes – creation stamp, permissions, size, etc. • Image: Collection of inodes for the file system constitutes the image which is hold in memory • Image is persisted on disk in form of checkpoint and journal • Checkpoint:snapshot of image at some instance • Journal:write ahead log of changes over checkpoint • For improved durability, checkpoint and journal are replicated on multiple locations (different disk or network location)
Name Node (cont.) • Multi-threaded system, serves requests from multiple clients simultaneously • Each request is logged in journal. To avoid too many disk seeks, requests are batched together and then flushed to journal. • SPOF – Backup nodes come to the rescue. Zookeeper for automated recovery.
Data Node • File is stored as collection of data blocks (having default size - 64MB) • Data block is represented by two files – one containing data itself, other containing metadata like checksum • Spread across racks: Image reference [2]
Data node – Name node communication • Handshake: Performed with Name Node when a Data Node starts. Confirms Namespace ID (a unique identifier associated with the filesystem when it is formatted) and software version, and registers its Storage ID • Block Report: Information about all the block replicas it holds, is shared every hour with Name Node • Heartbeat: Send every 3 seconds to confirm Data Node is operational. Contain information for space allocation and load balancing. Name Node can reply with commands like replicate data block, remove local replica of a data block, etc.
HDFS Client • Provides API to read/write/delete file: • READ: Get list of Data Nodes from Name Node in topological sorted order. Then read data directly from Data Nodes. • WRITE: For each block of data, setup a pipeline of Data Nodes to write to. Image reference [2]
Checkpoint Node • Journal size can grow very large • Checkpoint Node occasionally combine the checkpoint and journal on Name Node to create a new checkpoint and empty journal • Checkpoint is never overwritten, but replaced with newly created one
File I/O Operations • Single writer, multiple reader model WRITE • A file written cannot be modified, but can be appended • Writer gets a lease for the file handle • Soft Limit: exclusive access to file, can extend lease. • Hard Limit: 1 hour - continue to have access unless some other client pre-empts it. Also after hard limit, the file is closed. READ • During read, the checksum is validated and if found different, it is reported to Name Node which marks it for deletion. • On error while reading a block, next replica from the pipeline is used to read it.
Data Block Write Pipeline Image reference [2]
Replica Management • Data is replicated across multiple nodes which span multiple racks. • Name Node ensures data block is not over or under replicated. Also ensures all replicas are not on the same rack. • Extra replicas are deleted ensuring that number of racks remain same. • Under replicated blocks are put in replication priority queue. Blocks with only one replica left are given highest priority. • Balancer: HDFS block placement does not take free disk space into account. Cluster admin can run this tool to ensure disk utilization on the cluster is balanced. • Block Scanner: Goes through all the blocks on the node and verifies checksum. On error notifies Name Node which mark them for deletion.
References • HDFS Architecture Guide, http://hadoop.apache.org/docs/r1.0.4/hdfs_design.html • Chapter 3 – Hadoop-The definitive guide, Tom White. • Konstantin Shvachko, HairongKuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop Distributed File System. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (MSST '10). IEEE Computer Society, Washington, DC, USA, 1-10. DOI=10.1109/MSST.2010.5496972 http://dx.doi.org/10.1109/MSST.2010.5496972 • WittawatTantisiriroj, SwapnilPatil, and Garth Gibson. 2008. Data-intensive file systems for Internet services: A rose by any other name ... Parallel Data Laboratory, Carnegie Mellon University, Pittsburg, PA, USA • MonaliMavani, "Comparative Analysis of Andrew Files System and Hadoop Distributed File System," Lecture Notes on Software Engineering vol. 1, no. 2, pp. 122-125, 2013