HDFS Hadoop Distributed File System

HDFSHadoop Distributed File System 100062123柯懷貿 100062139王建鑫 101062401彭偉慶

Outline • Introduction • HDFS – How it works • Pros and Cons • Conclusion 柯懷貿

Introduction to HDFS HadoopDistributed File System • Cloud Computing • JAVA • Processing PB-Level Data • Distributed Computing Environment • Allow files shared via internet • Write-once-read-many • Restricting access • Replication & Fault tolerance • Mapping between logical objects & physical objects • Dung Cutting established • Nutch Project • File System for Hadoop framework • Remote Procedure Call • Master/Slave • Yahoo! has accomplished 10,000-core Hadoop cluster in 2008 • HDFS • HadoopMapReduce • HBase 柯懷貿

MapReduce 柯懷貿

HBase • NoSQL • Using several servers to store PB-level data 柯懷貿

HDFS • Distributed, scalable, and portable • File replication(default : 3) • Reading efficacy 柯懷貿

王建鑫

HDFS major roles • Client(user) – read/write data from/to file system • Name node(masters) – oversee and coordinate the data storage function, receive instructions from Client • Data node(slaves) – store data and run computations, receive instructions from Namenode 王建鑫

王建鑫

Rack Awareness 王建鑫

王建鑫

HDFS fault tolerance • Node failure – data node or namenode is dead • Communication failure – cannot send and retrieve data • Data corruption – data corrupted while sending over network or corrupted in the hard disks • Write failure – the data node which is ready to be written is dead • Read failure - the data node which is ready to be read is dead 王建鑫

王建鑫

Detect the Network failure • Whenever data is sent, an ACK is replied by the receiver • If the ACK is not received(after several retries), the sender assumes that the host is dead, or the network has failed • Also Checksum is sent along with transmitted data→can detect corrupt data when transferring 王建鑫

Handling the write/read failure • Client write the block in smaller data units(usually 64KB) called packet • Each data node replies back an ACK for each packet to confirm that they got the packet • If client don’t get the ACKs from some nodes, dead node detected • Client then adjust the pipeline to skip that node(then?) • Handling the read failure：just read another node 王建鑫

Handling the write failure cont’d • Name node contains two tables: • List of blocks – blockA in dn1, dn2,dn8；blockB in dn3, dn7, dn9… • List of Data nodes – dn1 has blockA, blockD；dn2 has blockE, blockG… • Name node check list of blocks to see if a block is not properly replicated • If so, ask other data nodes to copy block from data nodes that have the replication. 王建鑫

Pros • Very large files • A file size overs xxxMB, GB, TB, PB .….. • Streaming data access • Write-once, read-many. • Efficient on reading whole dataset. • Commodity hardware • High reliability and availability. • Doesn’t require expensive, highly reliable hardware. 彭偉慶

Cons 彭偉慶

Conclusion • HDFS -an Apache Hadoop subproject. • Highly fault-tolerant and is designed to be deployed on low-cost hardware. • High throughputbut not low latency. 彭偉慶

HDFS Hadoop Distributed File System