1 / 18

HDFS ( Hadoop Distributed File System)

HDFS ( Hadoop Distributed File System). 2011-10-10 Taejoong Chung, MMLAB. Contents. Introduction Hadoop Distributed File System? Assumption & Goals Mechanism Structure Data Management Maintenance Pros and Cons. HDFS. Hadoop Distributed File System

berget
Download Presentation

HDFS ( Hadoop Distributed File System)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HDFS (Hadoop Distributed File System) 2011-10-10 Taejoong Chung, MMLAB

  2. Contents • Introduction • Hadoop Distributed File System? • Assumption & Goals • Mechanism • Structure • Data Management • Maintenance • Pros and Cons

  3. HDFS • Hadoop Distributed File System • Started from ‘Nutch’ (open-source search engine project) in 2005 • Java based, Apache top-level project • To save massive data with low cost • Characteristics • User-level distributed file system • Fault-tolerant • Could be deployed on low-cost hardwares

  4. Assumption & Goals • Protection of Failure • Detection of faults and quick, automatic recovery • Consider hardware & software failure • Streaming Data Access • Batch processing rather than interactive use • High throughput of data access rather than low latency of data access

  5. Assumption & Goals - contd 3) Large Data Set • Typical file in HDFS is gigabytes to terabytes • High aggregate data bandwidth scaling to hundreds of nodes. 4) Simple Coherency Model • Write-once-read-many access • File once created, not allowed to modified

  6. Assumption & Goals - contd 5) Migrating Computation into data • Provides interface for applications to move themselves closer to where the data is located 6) Portability • Easily portable from one platfrom to another • Java based

  7. Structure • Master / Slave architecture • NameNode (Master) • Manages the file system namespace • Regulates access to files by clients • Not contain any data files • Unique • DataNode (Slave) • Actual repository • Multiple nodes are required

  8. Conceptual Diagram Namespace(Headquarter) Directory service Block: Piece of data a DataNode: contain multiple blocks of data

  9. Operation • A file is distributed with multiple blocks with multiple duplication over the DataNodes • A file is cut into multiple blocks whose size is 64MB (default) • Each block is replicated over the DataNodes (# of replica: 3, default) • Scheme • Direction to maximize the ‘tolerance’ • Local Tolerance • Inside of rack • Global Tolerance • Outside of rack

  10. Example Command to save files from NameNode Local tolerance: in same rack Global tolerance: outside of rack Rack 2 Rack 3 Rack 1 DataNodes Rack Awareness

  11. Data Maintenance • Each DataNode send ‘Heartbeat’ messages containing ‘Blockreport’ to NameNode • Blockreport • A list of all blocks on a DataNode • Heartbeat • Kinds of ‘Ping’ (I’m alive!) • Receipt of a Hearbeat implies that the DataNodes is functioning properly

  12. Data Management • NameNode manages all data • EditLog • All the transaction is recorded from NameNode • FsImage (File System Image) • To configure the which data blocks are stored in which DataNodes • Key matadata is stored in memory • Heartbeat messages from DataNodes are stored in here

  13. Data Integrity (1) • Safemode • On startup, NameNode receives Heartbeat and Blockreport messages from DataNode • Each block has a specified minimum number of replicas • Under this threshold, re-replication happened • No replication of new data blocks does not occur in this period • This happens regularly

  14. Data Integrity (2) • Data fetched from a DataNode could be corrupted • Checksum algorithms are implemented • Operation • When a client creates an HDFS files, it also create calculated checksum • A client receives a file, it also downloads checksum • Comparing downloaded checksum and another calculated checksum from file, a client could verify the content

  15. Robustness • Data disk failure, heartbeats and re-replication • From heartbeats message, NameNode could check the liveness of DataNode • Cluster rebalancing • If a DataNode have much more data than the others, procedure for redistribution of blocks happened • Data integrity • Checksum • Metadata disk failure • FsImage, EditLog are copied

  16. Pros and Cons • Pros • Powerful mechanism for ‘Fault-Tolerant’ • Easy to deploy • Free • Cons • Single point of failure – NameNode • Not optimized solution • Same magnitude of replication for each block • Not that fast

  17. Download & More Information • Official site • http://hadoop.apache.org/ • Last build at March, 2011 • Korean Dev. • http://www.hadoop.co.kr/ • Last uploaded materials at Oct, 2011

  18. QnA

More Related