400 likes | 554 Views
Programming on Hadoop. Outline. Different perspective of Cloud Computing The Anatomy of Data Center The Taxonomy of Computation Computation intensive Data intensive The Hadoop Eco-system Limitations of Hadoop. Cloud Computing. From user perspective
E N D
Outline • Different perspective of Cloud Computing • The Anatomy of Data Center • The Taxonomy of Computation • Computation intensive • Data intensive • The Hadoop Eco-system • Limitations of Hadoop
Cloud Computing • From user perspective • A service which enables users to run their applications on the Internet • From service provider perspective • A resource pool which is used to deliver cloud services through the Internet • The resource pool is hosted in on-premise data center • What the data center (DC) looks like ?
An Example of DC • Google’s Data Center at 2009. From Jeffrey Dean’s talk on WSDM2009
A Closer Look at DC – Overview Figure is copied from [4]
A Closer Look of DC – Cooling Figure is copied from [4]
A Closer Look at DC – Computing Resources Figure is copied from [4]
The Commodity Server • Commodity server is NOT low-end server • Standard components vs. proprietary hardware • Common configuration in 2008 • Processor: 2 quad-core Intel Xeon 2.0GHz CPUs • Memory: 8 GB ECC RAM • Storage: 4 1TB SATA disks • Network: Gigabit Ethernet
Approaches to Deliver Service • The dedicated approach • Serve each customer with dedicated computing resources • The shared approach (multi-tenant architecture) • Serve customers with the shared resource pool
The Dedicated Approach • Pros: • Easy to implement • Performance & security guarantee • Cons: • Pain for the customer to scale their applications • Poor resource utilization
The Shared Approach • Pros: • No pain for customers to scale their applications • Better resources utilization • Better performance in some cases • Low service cost per customer • Cons: • Need complicated software layer • Performance isolation/tuning may be complicated • To achieve better performance customers should be familiar with the software/hardware architecture to some degree
The Hadoop Eco-system • An software infrastructure to deliver a DC as a service through shared-resources approach • Customers can use Hadoop to develop/deploy certain data-intensive applications on the cloud • We focus on the Hadoop core in this lecture • Hadoop == Hadoop– core afterwards HBase Chukwa Hive Pig Extensions Hadoop Distributed File System (HDFS) MapReduce Core
The Taxonomy of Computations • Computation-intensive tasks • Small data (in-memory), Lots of CPU cycles per data item processing • Examples: machine learning • Data-intensive tasks • Large-volume data (in-disk), relatively small CPU cycles per data item processing • Examples: DBMS
The Data-intensive Tasks • Streaming-oriented data access • Read/Write a large portion of dataset in streaming manner (sequentially) • Character: • NO-seek, high-throughput • Optimized for larger data transferring rate • Random-oriented data access • Read/Write a small number of data items randomly located in the dataset • Character: • Seek-oriented • Optimized for low-latency data access for each data item
What Hadoop does & doesn’t • Hadoopcan perform • High-throughput streaming data access • Limited low-latency random data access through HBase • Large-scale analysis through MapReduce • Hadoopcannot do • Perform transactions • Certain time-critical applications
Hadoop Quick Start • Very simple • Download Hadoop package from Apache • http://hadoop.apache.org/ • Unpack into a folder • Do some configurations on hadoop-site.xml • fs.default.name select the default file system (e.g., HDFS) • mapred.job.tracker point to the JobTracker of MapReduce cluster • Start • Format the file system only once (in a fresh installation) • bin/hadoopnamenode –format • Launch HDFS & MapReduce cluster • bin/start-all.sh
The Hadoop Distributed Filesystem • Wraps the DC as a resource pool and provides a set of API to let users read/write data from/into the DC sequentially
A Closer Look at the API • Aha, writing “hello word!” • bin/hadoop jar test.jar public class Main { public static void main(String[] args) throws Exception { FileSystemfs = FileSystem.get(new Configuration()); FSDataOutputStreamfsOut = fs.create(“testFile”); fsOut.writeBytes(“Hello Hadoop”) fsOut.close(); } }
A Closer Look at the API (cont.) • Reading data from the HDFS public class Main { public static void main(String[] args) throws Exception { FileSystemfs = FileSystem.get(new Configuration()); FSDataInputStreamfsIn = fs.open(new Path(“testFile”)); byte[] buf = new byte[1024]; intlen = fsIn.read(buf); System.out.println(new String(buf, 0, len); } }
Inside HDFS • A single NameNode multiple DataNodes architecture (see [5] for reference) • Chop each file as a set of fix-sized blocks and store those data blocks on all available DataNodes • NameNode hosting all file system meta-data (file block mapping, block locations etc) in memory • DataNode hosting all file data for reading/writing
Inside HDFS – Architecture • Figure is copied from http://hadoop.apache.org/common/docs/current/hdfs_design.html
Inside HDFS – Writing Data Figure is copied from [2]
Inside HDFS – Reading Data • What is the problem with reading/writing ? Figure is copied from [2]
The HDFS Cons • Single reader/writer • Reading and writing a single block each time • Only touch ONE data node • Data transferring rate == disk bandwidth of a SINGLE node • Too slow for a large file • Suppose disk bandwidth == 100MB/sec • Reading /writing a 1TB file requires ~3 hrs • How to fix it ?
Multiple Reader/Writers • Reading/Writing a large data set using multiple processes • Each process reads/writes a subset of the whole data set and materialize the sub-data set as file • File collection for the whole data set • Typically, the file collection is stored in a directory named with the data set
Multiple Readers/Writers (cont.) • Question – what is the proper number of readers and writers ? Data set A /root/datasetA Sub-set 1 Process 1 part-0001 Sub-set 2 Process 2 part-0002 Sub-set 3 Process 3 part-0003 Sub-set 4 Process 4 part-0004
Multiple Readers/Writers (cont.) • Reading/writing a large data set using multiple readers/writers and the materialize the data set as a collection of files is common pattern in HDFS • But, too painful ! • Invocation of multiple readers/writers in the cluster • Coordination of those readers/writers • Machine failure • …. • Rescue: MapReduce
The MapReduce System • MapReduce is a programming model and its associated implementation for processing and generating large data sets [1] • The computation performs key/value oriented operations and consists of two functions • Map: transform the input key/value pair into a set of intermediate key/value pairs • Reduce: merge intermediate key/value pairs with the same key and produce an other key/value pair
The MapReduce Programming Model • Map: (k0, v0) -> (k1, [v1]) • Reduce: (k1, [v1]) -> (k2, v2)
The System Architecture • One JobTacker for Job submission • Multiple TaskTrackers for invocation of mappers or reducers Figure is from Google image
The Mapper Interface • Mapper/Reducer is defined as a generic java interface in Hadoop public interface Mapper<K1, V1, K2, V2> { void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter); } public interface Reducer<K2, V2, K3, V3> { void reduce(K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter); }
The Data Types of MapReduce • MapReduce makes no assumption of the data type • It does not know what constitutes key/value pair • Users must figure out what is appropriate input/output data types • The runtime data interpreting pattern • Achieved by implementing two Hadoop interface • RecordReader<K, V> for parsing input key/value pair • RecordWriter<K, V> for serializing output key/value pair
The RecordReader/Writer Interface interface RecordReader<K, V> { // Omit other functions boolean next(K key, V value); } interface RecordWriter<K, V> { // Omit other functions void write(K key, V value); }
The Overall Picture • The data set are spitted into many parts • Each part is processed by one mapper • The intermediated results are processed by reducer • Each reducer writes its results as a file part-000n RecordReader InputSplit-n Shuffle/merge RecordWriter map reduce
Performance Tuning • A lot of factors … • From architecture level • Record parsing, map-side sorting, …, see [3] • Shuffling see many research papers on VLDB, SIGMOD • Parameter Tuning • Memory buffer for mapper/reducer • The thumb of rule for concurrent mapper and reducers • Map: per file block per map • Reducer: a small multiple of available TaskTrackers
Limitations of Hadoop • HDFS • No reliable appending yet • File is immutable • MapReduce • Basically row-oriented • Support for complicated computation is not strong
Reference • [1] Jeffrey Dean, Sanjay Chemawat. MapReduce: Simplified data processing on large clusters • [2] Tom White. Hadoop: The Definitive Guide • [3] Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu. The Performance of MapReduce: An Indepth Study • [4] Luiz André Barroso and UrsHolzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines • [5] Sanjay Chemawat, Howard Gobioff, Shun-Tak Leung. The Google File System