240 likes | 345 Views
http://www.excelonlineclasses.co.nr/ excel.onlineclasses@gmail.com. Excel Online Classes offers following services :. Online Training Development Testing Job support Technical Guidance Job Consultancy Any needs of IT Sector. Nagarjuna K. HDFS. HDFS .
E N D
http://www.excelonlineclasses.co.nr/ excel.onlineclasses@gmail.com http://www.excelonlineclasses.co.nr/
Excel Online Classes offers following services: • Online Training • Development • Testing • Job support • Technical Guidance • Job Consultancy • Any needs of IT Sector http://www.excelonlineclasses.co.nr/
Nagarjuna K HDFS http://www.excelonlineclasses.co.nr/
HDFS • Distributed FS designed to run on Commodity Hardware • Provides high throughput access to application data , suitable for applications having large datasets http://www.excelonlineclasses.co.nr/
Assumptions & Goals • Hardware Failure • Streaming Data Access • Large Datasets • Simple coherency Model • Moving Computation cheaper than moving data http://www.excelonlineclasses.co.nr/
Hardware Failure Assumptions & Goals • HDFS instance many machines • Each storing part of the data • Chances that any machine goes down can’t be avoided • Detection of faults, auto recovery is core architectural goal of HDFS http://www.excelonlineclasses.co.nr/
Streaming Data Access Assumptions & Goals • HDFS is designed fro batch processing rather than interactive usage by users. • Emphasis on Data throughput • Not on low Latency data access. http://www.excelonlineclasses.co.nr/
Streaming Data Access Assumptions & Goals • HDFS built on !dea“Write once , Read many times pattern” • Overtime data set generated and placed in HDFS • Analysis is done one large part of data , rather than on first few records • Time to read whole data set is more than retrieving first or the last record. http://www.excelonlineclasses.co.nr/
Large Datasets Assumptions & Goals • A typical file ranges from GB to TB http://www.excelonlineclasses.co.nr/
Simple Coherency Model Assumptions & Goals • HDFS built on !dea “Write once , Read many times pattern” • The assumption enables high through put access http://www.excelonlineclasses.co.nr/
Moving Computation OR Data ? Assumptions & Goals • Computation intensive porgraming • Data intensive programing http://www.excelonlineclasses.co.nr/
Where HDFS doesn’t fit • Low latency data access • Lots of small files • Multiple writers, arbitrary file modifications http://www.excelonlineclasses.co.nr/
Where HDFS doesn’t fit • Low latency data access • Lots of small files • High latency time • Each file (say 10 KB of size) takes up a block in HDFS Compress • All the metadata is stored in HDFS memory http://www.excelonlineclasses.co.nr/
Where HDFS doesn’t fit • Multiple writers, arbitrary file modifications • Single user writes files in HDFS. Appending only at the end. Multiple sources of writing into a same file or writing at arbitrary offset is not supported (currently) http://www.excelonlineclasses.co.nr/
Blocks • disc has block size • minimum amount of data that is read/write • 512 bytes • FileSystem blocks are few multiple of disc block size • few KB http://www.excelonlineclasses.co.nr/
Blocks • In classical FS, single block may contain data of only single file • Leads to internal fragmentation. • Newer file systems, solves this problem by • block suballocation • tail merging http://www.excelonlineclasses.co.nr/
Blocks • HDFS also has a block size • 64 MB • Unlike normal FS , if file is less than 64 MB it doesn’t occupy underlying storage of 64MB. http://www.excelonlineclasses.co.nr/
Why BIG BLOCK size ? • Throughput vs Latency • time to seek start of block • Reading the whole block http://www.excelonlineclasses.co.nr/
Why BIG BLOCK size ? • seek time = 10ms • transfer rate (throughput) = 100MBPS • make seek time 1% of transfer rate , • block size = 100MB • Default is 64 MB • As the transfer rate increases , Block size can be increased http://www.excelonlineclasses.co.nr/
hadoopfsck / -files -blocks • Gives information about all the files and blocks in the file system • Replication • under • over etc., • corrupt ? • etc., http://www.excelonlineclasses.co.nr/
File Permissions on HDFS • Client’s identity determined • user name and groups from which it operates. • Sharing of FS shouldn’t be used hostile environment • Going forward • Kerberos authentication http://www.excelonlineclasses.co.nr/
Hadoop File Systems • HDFS is just one implementation of Hadoop FileSystems. • org.apache.hadoop.fs.FileSystem • represents a FileSystem in hadoop http://www.excelonlineclasses.co.nr/
Hadoop File Systems http://www.excelonlineclasses.co.nr/
Hadoop File Systems http://www.excelonlineclasses.co.nr/