390 likes | 578 Views
Lei Xu. Hadoop Distributed File System. Brief Introduction. Hadoop An apache project for data-intensive applications Typical application: Map-Reduce (OSDI’04), a distributed algorithm for massive-data computation Crawl and index web pages (Y!) Analyze popular topics and trends (Twitter)
E N D
Lei Xu Hadoop Distributed File System
Brief Introduction • Hadoop • An apache project for data-intensive applications • Typical application: Map-Reduce (OSDI’04), a distributed algorithm for massive-data computation • Crawl and index web pages (Y!) • Analyze popular topics and trends (Twitter) • Led by Yahoo!/Facebook/Cloudera
Brief Introduction (cont’d) • Hadoop Distributed File System (HDFS) • A scalable distributed file system to serve HadoopMapReduce applications • Borrow the essential ideas from the Google File System • Sanjay Ghenawat, Howard Gobioff and Shun-Tak Leung. The Google File System. 19TH ACM Symposium on Operating System Principles (SOSP’03) • Share same design assumptions
Google File System • A scalable distributed file system designed for: • Data-intensive applications (mainly MapReduce) • Web page indexing • Then it has spread to other applications • E.g. Gmail, Big Table, App Engine • Fault-tolerant • Low-cost hardware • High throughputs
Google File System (cont’d) • Departure from other file system assumptions • Run on top of the commodity hardware • Component failures are common • Files are huge • Basic block size 64~128 MB • 1~64KB in traditional file systems (Ext3/NTFS and etc.) • Massive-data/data-intensive processing • Large streaming read and small random read • Large, sequential writes • No (or bare) random writes
Hadoop DFS Assumptions • Other than the assumptions in Google File System, HDFS assumes that: • Simple Coherency Model • Write-once-read-many • Once a file was created, written and closed, it can not be changed anymore. • Moving Computation Is Cheaper than Moving Data • “Semi-Location-Aware” computation • Try its best to assign computations closer to the related data • Portability Across Heterogeneous Hardware and Software Platforms • Is written in Java, multi-platform support • Google File System was written in C++ and run on Linux • Store data on top of existing file systems (NTFS/Ext4/Btrfs…)
HDFS Architecture • Master/Slave Architecture • NameNode • Metadata Server • File location ( file name -> the DataNode ) • File attributions (atime/ctime/mtime, size, the number of replicas and etc.) • DataNode • Manages the storage attached to the nodes that they run on • Client • Producer and Consumers of data
NameNode • Metadata Server • Only one NameNode in one cluster • Single Point Failure • Potential performance bottleneck • Manage the file system namespace • Traditional hierarchical namespace • Keep all file metadata in memory for fast access • The memory size of NameNode determines how many files can be supported • Execute file system namespace operation: • Open/close/rename/create/unlink… • Return the location of data blocks
NameNode (cont’d) • Maintains system-wide activities • E.g. creating new replications of file data, garbage collection, load balancing and etc. • Periodically communicates with DataNodeto collect their statuses • Is DataNode alive? • Is DataNode overload?
DataNode • Storage server • Store fixed-size data blocks on local file systems ( ext4/zfs/btrfs ) • Serve read/write operations from the clients • Create, delete, replicate data blocks upon instruction from the NameNode • Block size = 64MB
Client • Application-level implementations • Does not provide POSIX API • Hadoop has a FUSE interface • FUSE: Filesystem in Userspace • Has limited functions (e.g, no random write supports) • Query the NameNode for file locations and metadata • Contact corresponding DataNodes for file I/Os
Data Replication • Files are stored as a sequence of blocks • The blocks (typically 64MB) are replicated for fault tolerance • Replication factor is configurable per file • Can be specified at creation time, and can be changed later • The NameNode decides how to replicate blocks. It periodically receives: • Heartbeat, which implies the DataNode is alive • Blockreport, which contains a list of all blocks on a DataNode • When a DataNode is down, the NameNode replicas all blocks on this DataNode to other active DataNode to achieve enough replications
Data Replication (cont’d) • Rack Awareness • Hadoop instance runs on a cluster of computers that spread across many racks: • Nodes in same rack are connected by one switches • Communications between two nodes in different racks go through switches • Slower than nodes in same rack • One rack may fail due to network/power issues. • Improve data reliability, availability and network bandwidth utilization
Data Replications (cont’d) • Rack Awareness (cont’d) • For common case, the replication factor is three • Two replicas are placed on two different nodes in same rack • The third replica is placed on a node in a remote rack • Improves write performance • 2/3 writes are in same rack, faster • Without compromising data reliability
Replica Selection • For READ operation: • Minimize the bandwidth consumption and latency • Prefer nearer node: • If there is a replica on the same node, it is preferred • The cluster may span multiple data centers, replicas in same data centers are preferred
Filesystem Metadata • The HDFS stores all file metadata on NameNode • An EditLog • Record every change that occurs to filesystem metadata • For failure recovery • Same as journaling file systems (Ext3/NTFS) • An FSImage • Stores mapping of blocks to files and file attributes • EditLog and FSImage are stored on NameNode locally
FilesystemMetedata(cont’d) • DataNode has no knowledge about HDFS files • It only stores data blocks as regular files on local file systems • With a checksum for data integrity • It periodically reports a Blockreport that includes all blocks stored on this DataNode to NameNode • Only the DataNode has knowledge about the availability of one block replica.
FilesystemMetadata(cont’d) • When NameNode starts up • Load FSImage and EditLog from the local file system • Update FSImage with latest EditLogs • Create a new FSImage for latest checkpoint and store on local file system permanently
Communication Protocol • A Hadoop specific RPC on top of TCP/IP • NameNode is simply a server that only responses to the requests issued by DataNodes or clients • ClientProtocol.java – client protocol • DatanodeProtoco.java – datanode protocol
Robustness • Primary object of HDFS: • Reliable with component failures • In a typical large cluster (>1K nodes), component failures are common • Three common types of failures: • NameNode failures • DataNode failures • Network failures
Robustness (cont’d) • Heartbeats • Each DataNode sends heartbeats to NameNode periodically • System status and block reports • The NameNode marks DataNodes w/o recent heartbeats as dead • Does not forward I/O to it • Mark all data blocks on these DataNodes as unavailable • Re-replicate these blocks if necessary (according to the replication factor). • Can detect network failures and DataNode dies
Robustness (cont’d) • Re-Balancing • Automatically move the data on one DataNode to another one • If the free space falls below a threshold • Data-Integrity • A block of data may be corrupted • Disk faults, network faults, buggy software • Client computes checksums for each block and stores them in a separate hidden file in HDFS namespace • Verify data before read it
Robustness (cont’d) • Metadata failures • FSImage and EditLog are the central data structures • Once corrupted, HDFS can not build namespace and access data • NameNode can be configured to support multiple-copies of FSImage and EditLog • E.g: one FSImage/EditLog on local machine, another one is stored on mounted remote NFS server. • Reduce the update performances • Once NameNode is down, it must to restart the cluster manually
Data Organization • Data Blocks • HDFS is designed to support very large files and streaming I/Os • A File is chopped up into 64MB blocks • Reduce the number of connection establishments and accelerate TCP transmissions • If possible, each block of a file will reside on a different DataNode • For future parallel I/O and computations (MapReduce)
Data Organization (cont’d) • Staging • When write a new file • A client firstly caches the file data into temporary local file until this file worth over the HDFS block size • Then the client contacts NameNode to assign a DataNode • The client flushes the cached data to the chosen DataNode • Fully utilized the bandwidth
Data Organization (cont’d) • Replication Pipeline • A client obtains a DataNode list to flush one block • The client firstly flushes the data to the first DataNode • The first DataNode starts to receive the data in small portions (4kB), writes that portions to local storage, and transfer it to the next DataNode in the list immediately • The second DataNode acts as the first one • The total transfer time for one block(64MB) is: • T(64MB) + T(4kb) * 2 , for pipeline • 3 * T(64MB), for non-pipeline
Replication Pipeline • The client asks the NameNode where to put data • The client push data to DataNode linearly to fully utilize network bandwidth • The secondary replicas reply to the primary. Then the primary replies to the client for success. * This figure was in “The Google File System” paper
See also • HBase – a BigTable implementation on Hadoop • Key-value storage • Pig – high-level language to run data analyze on Hadoop • ZooKeeper • “ZooKeeper: Wait-free Coordination for Internet-scale Systems”, ATC’10, Best Paper • CloudStore (KFS, previously Kosmosfs) • A C++ implementation of Google File System • Parallels the Hadoop project
Known Issues and Research Interests • NameNode is the single point failure • Limits the total files supported in the HDFS as well • RAM limitation • Google has changed the one-master architecture to multiple-header cluster • However, the details are unrevealed
Known Issues and Research Interests (cont’d) • Use replications to provide data reliability • Same problems to RAID-1 ? • Apply RAID technologies to HDFS? • “DiskReduce: RAID for Data-Intensive Scalable Computing”, PDSW’09
Known Issues and Research Interests (cont’d) • Energy Efficiency • DataNodes are alive for data availability • However, there may be no MapReduce computations running on them. • Waste of energy
Conclusion • Hadoop Distributed File System is designed to serve MapReduce computations • Provide high reliable storage • Support mass of data • Optimized data placement policies based on the topology of data centers • Large companies build their core businesses on top of these infrastructures • Google: GFS/MapReduce/BigTable • Yahoo!/Facebook/Amazon/Twitter/NY Times: Hadoop/HBase/Pig
Reference • HDFS Architecture Guide: http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html
Questions? Thank you !