530 likes | 623 Views
http://www.excelonlineclasses.co.nr/ excel.onlineclasses@gmail.com. Excel Online Classes offers following services :. Online Training Development Testing Job support Technical Guidance Job Consultancy Any needs of IT Sector. Nagarjuna K. HDFS. HDFS .
E N D
http://www.excelonlineclasses.co.nr/ excel.onlineclasses@gmail.com http://www.excelonlineclasses.co.nr/
Excel Online Classes offers following services: • Online Training • Development • Testing • Job support • Technical Guidance • Job Consultancy • Any needs of IT Sector http://www.excelonlineclasses.co.nr/
Nagarjuna K HDFS http://www.excelonlineclasses.co.nr/
HDFS • Distributed FS designed to run on Commodity Hardware • Provides high throughput access to application data , suitable for applications having large datasets http://www.excelonlineclasses.co.nr/
Assumptions & Goals • Hardware Failure • Streaming Data Access • Large Datasets • Simple coherency Model • Moving Computation cheaper than moving data http://www.excelonlineclasses.co.nr/
Hardware Failure Assumptions & Goals • HDFS instance many machines • Each storing part of the data • Chances that any machine goes down can’t be avoided • Detection of faults, auto recovery is core architectural goal of HDFS http://www.excelonlineclasses.co.nr/
Streaming Data Access Assumptions & Goals • HDFS is designed fro batch processing rather than interactive usage by users. • Emphasis on Data throughput • Not on low Latency data access. http://www.excelonlineclasses.co.nr/
Streaming Data Access Assumptions & Goals • HDFS built on !dea“Write once , Read many times pattern” • Overtime data set generated and placed in HDFS • Analysis is done one large part of data , rather than on first few records • Time to read whole data set is more than retrieving first or the last record. http://www.excelonlineclasses.co.nr/
Large Datasets Assumptions & Goals • A typical file ranges from GB to TB http://www.excelonlineclasses.co.nr/
Simple Coherency Model Assumptions & Goals • HDFS built on !dea “Write once , Read many times pattern” • The assumption enables high through put access http://www.excelonlineclasses.co.nr/
Moving Computation OR Data ? Assumptions & Goals • Computation intensive porgraming • Data intensive programing http://www.excelonlineclasses.co.nr/
Where HDFS doesn’t fit • Low latency data access • Lots of small files • Multiple writers, arbitrary file modifications http://www.excelonlineclasses.co.nr/
Where HDFS doesn’t fit • Low latency data access • Lots of small files • High latency time • Each file (say 10 KB of size) takes up a block in HDFS Compress • All the metadata is stored in HDFS memory http://www.excelonlineclasses.co.nr/
Where HDFS doesn’t fit • Multiple writers, arbitrary file modifications • Single user writes files in HDFS. Appending only at the end. Multiple sources of writing into a same file or writing at arbitrary offset is not supported (currently) http://www.excelonlineclasses.co.nr/
Blocks • disc has block size • minimum amount of data that is read/write • 512 bytes • FileSystem blocks are few multiple of disc block size • few KB http://www.excelonlineclasses.co.nr/
Blocks • In classical FS, single block may contain data of only single file • Leads to internal fragmentation. • Newer file systems, solves this problem by • block suballocation • tail merging http://www.excelonlineclasses.co.nr/
Blocks • HDFS also has a block size • 64 MB • Unlike normal FS , if file is less than 64 MB it doesn’t occupy underlying storage of 64MB. http://www.excelonlineclasses.co.nr/
Why BIG BLOCK size ? • Throughput vs Latency • time to seek start of block • Reading the whole block http://www.excelonlineclasses.co.nr/
Why BIG BLOCK size ? • seek time = 10ms • transfer rate (throughput) = 100MBPS • make seek time 1% of transfer rate , • block size = 100MB • Default is 64 MB • As the transfer rate increases , Block size can be increased http://www.excelonlineclasses.co.nr/
hadoopfsck / -files -blocks • Gives information about all the files and blocks in the file system • Replication • under • over etc., • corrupt ? • etc., http://www.excelonlineclasses.co.nr/
HDFS Architecture NS NAME NODE Name Space BLOCK MANAGEMENT Block Storage ….. STORAGE DATA NODE DATA NODE http://www.excelonlineclasses.co.nr/
HDFS Architecture -- NameSpace • Name Space • Consists of dirs, files, blocks • Supports create/ delete/modify/list files or dirs operations NS NAME NODE Name Space BLOCK MANAGEMENT Block Storage ….. STORAGE DATA NODE DATA NODE http://www.excelonlineclasses.co.nr/
HDFS Architecture -- Block Storage • Block Storage • Block Management • Datanode cluster membership • Supports create/delete/modify/get block location o/p • Manages replica and placement • Storage • Provides read and write access to blocks. NS NAME NODE Name Space BLOCK MANAGEMENT Block Storage ….. STORAGE DATA NODE DATA NODE http://www.excelonlineclasses.co.nr/
HDFS Architecture • NameSpace Volume = NameSpace+Blocks • Implemented using NN and DNs • NameNode supports • Name Space • Block Management • Both are collocated in the namenode • DataNodes are used in storing the block replicas • Block files are stored on the local file system http://www.excelonlineclasses.co.nr/
Metadata in NameNode http://www.excelonlineclasses.co.nr/
NameNode • Two main storage systems • fsimage • edit logs • New write request • recorded in the edits log • in memory metadata is updated • used to serve read requests http://www.excelonlineclasses.co.nr/
NameNode --fsimage • Serialized form of all the dir& file inodes in the system • iNodes internal representation of file metadata • file replication level • modification/access times • access permissions • block size • blocks a file is made up of http://www.excelonlineclasses.co.nr/
NameNode --fsimage • Doesn’t record datanodes on which blocks are present • NameNode keeps this mapping in memory • NameNode asks datanode for their block lists periodically. • Hence NameNode upto-date http://www.excelonlineclasses.co.nr/
DataNode • Periodically sends ___ to NameNode • Heart Beat • Block Report http://www.excelonlineclasses.co.nr/
NameNode --EditLogs • Keep on increasing. • So What ? • EditLogs are stored on physical disk http://www.excelonlineclasses.co.nr/
NameNode --EditLogs http://www.excelonlineclasses.co.nr/
Secondary NameNode • Asks NN for edits and fsimage file • Loads fsimage into memory • Applies each and every operation in edits file onto fsimage and consolidates the fsimage file • Send back this fsimage to NN. http://www.excelonlineclasses.co.nr/
NN & SNN • Thus edits file in NN becomes less • NN doesn’t have the burden of merging the edit logs with existing image http://www.excelonlineclasses.co.nr/
Communication b/w NN and DN/client • DN OR Client connects through configured TCP port of NameNode. • A RPC abstract wraps Clinet/DN protocol. • RPC – Remote Procedure Call http://www.excelonlineclasses.co.nr/
Communication b/w NN and DN/client • Name Node doesn’t initiate any RPC • It just responds to RPC’s http://www.excelonlineclasses.co.nr/
Robustness of HDFS • Data Node Failures, Heart Beat, Replication NN DN http://www.excelonlineclasses.co.nr/
Robustness of HDFS • Cluster Rebalancing • Free Space goes down on once cluster. • High Demand for a particular data http://www.excelonlineclasses.co.nr/
Robustness of HDFS • Data Integrity • CheckSum of data node. • If client doesn’t receive the proper data, client can opt for data from another replica. http://www.excelonlineclasses.co.nr/
Robustness of HDFS • Metadata Disk Failure • NameNode • Secondary NameNode http://www.excelonlineclasses.co.nr/
Data Organization • Data Blocks • Write once / Read many • Apt for Large data sets • chooped into 64 mb blocks, • each block reside on different node if possible http://www.excelonlineclasses.co.nr/
Data Organization • Staging • Client caches the data before writing to block. • NameNode insert file name into its metadata and allocates a block, • Client flushes out that temp data on to a block on the DataNode specified by NN http://www.excelonlineclasses.co.nr/
Data Organization • Staging • Once the file is closed, client informs NN, that no more data is present. • NameNode commits the file creation operation on to persistent store. • If NameNode dies in this process….. ? http://www.excelonlineclasses.co.nr/
Data Organization • Replication Pipelining • The first DataNode receives data from client in small portions (say 4 KB), • writes into its disk and forwards to DN2 • DN2 does the same thing with DN3 which ultimately flushes the data out. http://www.excelonlineclasses.co.nr/
File Permissions on HDFS • Client’s identity determined • user name and groups from which it operates. • Sharing of FS shouldn’t be used hostile environment • Going forward • Kerberos authentication http://www.excelonlineclasses.co.nr/
Hadoop File Systems • HDFS is just one implementation of Hadoop FileSystems. • org.apache.hadoop.fs.FileSystem • represents a FileSystem in hadoop http://www.excelonlineclasses.co.nr/
Hadoop File Systems http://www.excelonlineclasses.co.nr/
Hadoop File Systems http://www.excelonlineclasses.co.nr/
DFShell The HDFS shell can be invoked by: bin/hadoopdfs <args> • put • rm • rmr • setrep • stat • tail • test • text • cat • chgrp • chmod • chown • copyFromLocal • copyToLocal • cp • du • dus • expunge • get • getmerge • ls • lsr • mkdir • movefromLocal • mv • touchz nagarjuna@outlook.com
Link Files in HDFS • No Hard Links • No Soft Links http://www.excelonlineclasses.co.nr/