110 likes | 249 Views
This presentation introduces Apache Hadoop HDFS. It describes the HDFS file system in terms of Hadoop and big data. It looks at its architecture and resiliance.
E N D
Apache Hadoop HDFS • What is it ? • What is it for ? • Architecture • Resilience • Administration • Data access • Future changes ?
HDFS – What is it ? • HDSF = Hadoop Distributed File System • It is a distributed file system • Runs on low cost hardware • It is open source • Written in Java • Fault tolerant • Designed for very large data sets • Tuned for high throughput
HDFS – What is it for ? • Designed for batch processing • Streaming access to data • Large data sizes i.e. Terabytes • Highly reliable using data replication • Supports very large node clusters • Supports large files • Supports file numbers into millions
HDFS – Architecture • Has a master / slave architecture • A master NameNode • Controls file system operations • Maps data blocks to DataNodes • Logs all changes • Slave DataNodes • Store file blocks • Store replicated data
HDFS – Resilience • Data is replicated across DataNodes • Nodes may fail but data is still available • DataNodes indicate state via heart beat report • Single point of failure in master NameNode • Data integrity via check sums
HDFS – Administration • Access via Java API • FS Shell commands language • HTTP browser • C wrapper for Java API • Space reclamation • Via control of replication factor • Deleted files sent to trash folder • Trash folder cleaned after configurable time
HDFS – Future changes Things they might consider for HDFS • File append • User quotas • File links • Stand by nodes
Other Areas • Want to know about ? • Big Data • Nutch • Solr • see my other presentations
Contact Us • Feel free to contact us at • www.semtech-solutions.co.nz • info@semtech-solutions.co.nz • We offer IT project consultancy • We are happy to hear about your problems • You can just pay for those hours that you need • To solve your problems