Programming on Hadoop

Programming on Hadoop

Outline • Different perspective of Cloud Computing • The Anatomy of Data Center • The Taxonomy of Computation • Computation intensive • Data intensive • The Hadoop Eco-system • Limitations of Hadoop

Cloud Computing • From user perspective • A service which enables users to run their applications on the Internet • From service provider perspective • A resource pool which is used to deliver cloud services through the Internet • The resource pool is hosted in on-premise data center • What the data center (DC) looks like ?

An Example of DC • Google’s Data Center at 2009. From Jeffrey Dean’s talk on WSDM2009

A Closer Look at DC – Overview Figure is copied from [4]

A Closer Look of DC – Cooling Figure is copied from [4]

A Closer Look at DC – Computing Resources Figure is copied from [4]

The Commodity Server • Commodity server is NOT low-end server • Standard components vs. proprietary hardware • Common configuration in 2008 • Processor: 2 quad-core Intel Xeon 2.0GHz CPUs • Memory: 8 GB ECC RAM • Storage: 4 1TB SATA disks • Network: Gigabit Ethernet

Approaches to Deliver Service • The dedicated approach • Serve each customer with dedicated computing resources • The shared approach (multi-tenant architecture) • Serve customers with the shared resource pool

The Dedicated Approach • Pros: • Easy to implement • Performance & security guarantee • Cons: • Pain for the customer to scale their applications • Poor resource utilization

The Shared Approach • Pros: • No pain for customers to scale their applications • Better resources utilization • Better performance in some cases • Low service cost per customer • Cons: • Need complicated software layer • Performance isolation/tuning may be complicated • To achieve better performance customers should be familiar with the software/hardware architecture to some degree

The Hadoop Eco-system • An software infrastructure to deliver a DC as a service through shared-resources approach • Customers can use Hadoop to develop/deploy certain data-intensive applications on the cloud • We focus on the Hadoop core in this lecture • Hadoop == Hadoop– core afterwards HBase Chukwa Hive Pig Extensions Hadoop Distributed File System (HDFS) MapReduce Core

The Taxonomy of Computations • Computation-intensive tasks • Small data (in-memory), Lots of CPU cycles per data item processing • Examples: machine learning • Data-intensive tasks • Large-volume data (in-disk), relatively small CPU cycles per data item processing • Examples: DBMS

The Data-intensive Tasks • Streaming-oriented data access • Read/Write a large portion of dataset in streaming manner (sequentially) • Character: • NO-seek, high-throughput • Optimized for larger data transferring rate • Random-oriented data access • Read/Write a small number of data items randomly located in the dataset • Character: • Seek-oriented • Optimized for low-latency data access for each data item

What Hadoop does & doesn’t • Hadoopcan perform • High-throughput streaming data access • Limited low-latency random data access through HBase • Large-scale analysis through MapReduce • Hadoopcannot do • Perform transactions • Certain time-critical applications

Hadoop Quick Start • Very simple • Download Hadoop package from Apache • http://hadoop.apache.org/ • Unpack into a folder • Do some configurations on hadoop-site.xml • fs.default.name  select the default file system (e.g., HDFS) • mapred.job.tracker point to the JobTracker of MapReduce cluster • Start • Format the file system only once (in a fresh installation) • bin/hadoopnamenode –format • Launch HDFS & MapReduce cluster • bin/start-all.sh

The Launched HDFS cluster

The Launched MapReduce Cluster

The Hadoop Distributed Filesystem • Wraps the DC as a resource pool and provides a set of API to let users read/write data from/into the DC sequentially

A Closer Look at the API • Aha, writing “hello word!” • bin/hadoop jar test.jar public class Main { public static void main(String[] args) throws Exception { FileSystemfs = FileSystem.get(new Configuration()); FSDataOutputStreamfsOut = fs.create(“testFile”); fsOut.writeBytes(“Hello Hadoop”) fsOut.close(); } }

A Closer Look at the API (cont.) • Reading data from the HDFS public class Main { public static void main(String[] args) throws Exception { FileSystemfs = FileSystem.get(new Configuration()); FSDataInputStreamfsIn = fs.open(new Path(“testFile”)); byte[] buf = new byte[1024]; intlen = fsIn.read(buf); System.out.println(new String(buf, 0, len); } }

Inside HDFS • A single NameNode multiple DataNodes architecture (see [5] for reference) • Chop each file as a set of fix-sized blocks and store those data blocks on all available DataNodes • NameNode hosting all file system meta-data (file block mapping, block locations etc) in memory • DataNode  hosting all file data for reading/writing

Inside HDFS – Architecture • Figure is copied from http://hadoop.apache.org/common/docs/current/hdfs_design.html

Inside HDFS – Writing Data Figure is copied from [2]

Inside HDFS – Reading Data • What is the problem with reading/writing ? Figure is copied from [2]

The HDFS Cons • Single reader/writer • Reading and writing a single block each time • Only touch ONE data node • Data transferring rate == disk bandwidth of a SINGLE node • Too slow for a large file • Suppose disk bandwidth == 100MB/sec • Reading /writing a 1TB file requires ~3 hrs • How to fix it ?

Multiple Reader/Writers • Reading/Writing a large data set using multiple processes • Each process reads/writes a subset of the whole data set and materialize the sub-data set as file • File collection for the whole data set • Typically, the file collection is stored in a directory named with the data set

Multiple Readers/Writers (cont.) • Question – what is the proper number of readers and writers ? Data set A /root/datasetA Sub-set 1 Process 1 part-0001 Sub-set 2 Process 2 part-0002 Sub-set 3 Process 3 part-0003 Sub-set 4 Process 4 part-0004

Multiple Readers/Writers (cont.) • Reading/writing a large data set using multiple readers/writers and the materialize the data set as a collection of files is common pattern in HDFS • But, too painful ! • Invocation of multiple readers/writers in the cluster • Coordination of those readers/writers • Machine failure • …. • Rescue: MapReduce

The MapReduce System • MapReduce is a programming model and its associated implementation for processing and generating large data sets [1] • The computation performs key/value oriented operations and consists of two functions • Map: transform the input key/value pair into a set of intermediate key/value pairs • Reduce: merge intermediate key/value pairs with the same key and produce an other key/value pair

The MapReduce Programming Model • Map: (k0, v0) -> (k1, [v1]) • Reduce: (k1, [v1]) -> (k2, v2)

The System Architecture • One JobTacker for Job submission • Multiple TaskTrackers for invocation of mappers or reducers Figure is from Google image

The Mapper Interface • Mapper/Reducer is defined as a generic java interface in Hadoop public interface Mapper<K1, V1, K2, V2> { void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter); } public interface Reducer<K2, V2, K3, V3> { void reduce(K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter); }

The Data Types of MapReduce • MapReduce makes no assumption of the data type • It does not know what constitutes key/value pair • Users must figure out what is appropriate input/output data types • The runtime data interpreting pattern • Achieved by implementing two Hadoop interface • RecordReader<K, V> for parsing input key/value pair • RecordWriter<K, V> for serializing output key/value pair

The RecordReader/Writer Interface interface RecordReader<K, V> { // Omit other functions boolean next(K key, V value); } interface RecordWriter<K, V> { // Omit other functions void write(K key, V value); }

The Overall Picture • The data set are spitted into many parts • Each part is processed by one mapper • The intermediated results are processed by reducer • Each reducer writes its results as a file part-000n RecordReader InputSplit-n Shuffle/merge RecordWriter map reduce

Performance Tuning • A lot of factors … • From architecture level • Record parsing, map-side sorting, …, see [3] • Shuffling see many research papers on VLDB, SIGMOD • Parameter Tuning • Memory buffer for mapper/reducer • The thumb of rule for concurrent mapper and reducers • Map: per file block per map • Reducer: a small multiple of available TaskTrackers

Limitations of Hadoop • HDFS • No reliable appending yet • File is immutable • MapReduce • Basically row-oriented • Support for complicated computation is not strong

Reference • [1] Jeffrey Dean, Sanjay Chemawat. MapReduce: Simplified data processing on large clusters • [2] Tom White. Hadoop: The Definitive Guide • [3] Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu. The Performance of MapReduce: An Indepth Study • [4] Luiz André Barroso and UrsHolzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines • [5] Sanjay Chemawat, Howard Gobioff, Shun-Tak Leung. The Google File System

Thank You!

Programming on Hadoop

Programming on Hadoop

Presentation Transcript

Hadoop

Performance Evaluation on Hadoop Hbase

Hadoop

Cubes on Hadoop

Hadoop

SQL on Hadoop

Hadoop , Hadoop , Hadoop !!!

Hadoop

SQL on Hadoop

“Introducing Hadoop on Azure:

Hadoop Programming

Why Spark on Hadoop Matters

Data Management Platform on Hadoop

Deep Learning on Hadoop

HADOOP

“Introducing Hadoop on Azure:

Programming on

Hadoop

Hadoop