Developing Hadoop Applications in Java

Developing Hadoop Applications in Java A Hortonworks University Hadoop Training Course for Developers

Course Introduction

Introductions • Your name • Job responsibilities • Previous Hadoop experience (if any) • What brought you here

Course Outline • Day 1 • Unit 1: Understanding Hadoop and MapReduce • Unit 2: Writing MapReduce Applications • Unit 3: Map Aggregation • Day 2 • Unit 4: Partitioning and Sorting • Unit 5: Input and Output Formats • Day 3 • Unit 6: Optimizing MapReduce Jobs • Unit 7: Advanced MapReduce Features • Unit 8: Unit Testing • Unit 9: Defining Workflow • Day 4 • Unit 10: HBase Programming • Unit 11: Pig Programming • Unit 12: Hive Programming

Who is Hortonworks?

Who is Developing Apache Hadoop? Apache Hadoop Hortonworks Hortonworks Largest PMC and Committer Base from any Single Organization Apache Bylaws http://hadoop.apache.org/who.html As of 5/2012

Balancing Innovation & Stability Be aggressive—Ship early and often Be predictable—ship when stable

Unit 1: Understanding Hadoop and MapReduce

What is Hadoop?

Features of Hadoop = HDFS + MapReduce + Hive + Pig + HBase + HCatalog, Mahout, ZooKeeper, etc.

Hortonworks Data Platform

Hortonworks Data Platform Fully Supported Integrated Platform • Challenge: • Integrate, manage, and support changes across a wide range of open source Hadoop • Time intensive, complex, expensive • Solution: Hortonworks Data Platform • Integrated certified platform distributions • Extensive Q/A process • Industry-leading Support with clear service levels for updates and patches • Multi-year Support and Maintenance Policy • Technical guidance support for Universe and Multiverse components Hadoop Core Pig Zookeeper Hive HCatalog HBase = New Version

The Hadoop Distributed File System • NameNode • The “master” node of HDFS • Determines and maintains how the chunks of data are distributed across the DataNodes • DataNode • Stores the chunks of data, and is responsible for replicating the chunks across other DataNodes

Big Data NameNode Put into HDFS Big Data Break the data into chunks and distribute to the DataNodes DataNode 1 DataNode 2 DataNode 3 The DataNodes replicate the chunks

The JobTracker and TaskTrackers • JobTracker • the “master” daemon of the TaskTrackers • clients submit MapReduce jobs to the JobTracker • distributes the tasks to available TaskTrackers • TaskTracker • runs on DataNodes • performs the actual MapReduce job

1. Submits a job to the JobTracker JobTracker client 2. Distributes tasks to the TaskTrackers based on availability and where the data resides 4. Sends task status to JobTracker TaskTracker 1 TaskTracker 2 TaskTracker 3 JVM JVM JVM 3. The TaskTracker spawns a JVMto execute the task

Job Schedulers • Fair Scheduler • all jobs get, on average, an equal share of resources over time • Capacity Scheduler • jobs are submitted to queues, and queues are allocated a fraction of the total resource capacity • Use mapred.jobtracker.taskSchedulerto configure the scheduler

Hadoop Modes NameNode Secondary NameNode JobTracker DataNode/ TaskTracer DataNode/ TaskTracer In a fully-distributed cluster, the NameNode, Secondary NameNode and JobTrackereach run on their own machine. DataNode/ TaskTracer DataNode/ TaskTracer DataNode/ TaskTracer DataNode/ TaskTracker

Installing HDP

HDFS Filesystem Commands • hadoopfs -ls counties • hadoopfs -lsr counties • hadoopfs -mkdirpopulation_data • hadoopfs -put data/*.txt population_data/ • hadoopfs -cat population_data/population_1.txt

The HDFS API Configuration conf = new Configuration(); Path dir = new Path("results"); FileSystemfs = FileSystem.get(conf); if(!fs.exists(dir)) { dir.getFileSystem(conf).mkdirs(dir); }

Lab 1.1: Configuring a Hadoop Development Environment Lab 1.2: Putting Files in HDFS with Java

Unit 2: Writing MapReduce Applications

Map Phase Shuffle/Sort Reduce Phase DataNode 1 DataNode 1 Mapper Reducer DataNode 2 DataNode 2 Data is shuffled across the network and sorted Mapper DataNode 3 DataNode 3 Mapper Reducer

DataNode Mapper output = Reducer input Input split Spill files are merged into a single file The InputFormat generates <k1,v1> pairs Records are sorted and spilled to disk when the buffer reaches a threshold Mapper MapOutputBuffer <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> The map method outputs <k2,v2> pairs

DataNode DataNode Mapper output = Reducer input 2. In-memory buffer Spill files 3. Merged input DataNode Mapper output = Reducer input 4. HDFS 5. Reducer DataNode In-memory buffer Spill files DataNode Mapper output = Reducer input Merged input 1. The Reducer fetches the data from the Mappers HDFS Reducer

The Key/Value Pairs of MapReduce <K1, V1> <K2, V2> Mapper Shuffle/Sort Reducer <K3, V3> <K2, (V2,V2,V2,V2)>

The MapReduce API • Develop Java MapReduce applications using the org.apache.hadoop packages • Prior to Hadoop 0.20: the old API • org.apache.hadoop.mapred package • As of Hadoop 0.20: the new API • org.apache.hadoop.mapreduce package

WordCountMapper public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String currentLine = value.toString(); String [] words = currentLine.split(" "); for(String word : words) { Text outputKey = new Text(word); context.write(outputKey, new IntWritable(1)); } } }

WordCountReducer public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for(IntWritable count : values) { sum += count.get(); } IntWritableoutputValue = new IntWritable(sum); context.write(key, outputValue); } }

WordCountJob Job job = new Job(getConf(), "WordCountJob"); Configuration conf = job.getConfiguration(); job.setJarByClass(getClass()); Path in = new Path(args[0]); Path out = new Path(args[1]); FileInputFormat.setInputPaths(job, in); FileOutputFormat.setOutputPath(job, out); job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); return job.waitForCompletion(true)?0:1;

Running a MapReduce Job • To run a job, perform the following steps: • Put the input files into HDFS. • If the output directory exists, delete it. • Use hadoop to execute the job. • View the output files. • hadoop jar wordcount.jarWordCountJob input/file.txt result

Lab 2.1: Word Count Lab 2.2: Distributed Grep Lab 2.3: Inverted Index

Unit 3: Map Aggregation

<“by”, 1> <“the”, 1> <“people”, 1> <“for”, 1> <“the”, 1> <“people”, 1> <“of”, 1> <“the”, 1> <“people”, 1> Without Aggregation: Mapper Reducer The Reducer processes a large number of records, using HTTP across the network. The Mapper simply outputs every word, without performing any computations. With Aggregation: <“by”, 1> <“the”, 3> <“people”, 3> <“for”, 1> <“of”, 1> Mapper Reducer Expensive network traffic is decreased. The Mapper combines records in a manner that does not affect the algorithm.

Overview of Combiners MapOutputBuffer Disk Spill file Combiner <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> Combiner Disk Spill file <k2,v2> <k2,v2> Combiner 1. When the output buffer is full, a spill to disk occurs. 2. The Combiner is invoked in an attempt to reduce file I/O. Disk Spill file <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> Mapper output = Reducer input 3. The result is fewer records output by the Mapper.

Details of a Combiner 2. If a Combiner is used, the output is sent to Lists in memory (with a List for each Key). MapOutputBuffer <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> List for key1 List for key5 List for key2 List for key3 List for key4 List for keyN 1. When the output buffer is full, a spill to disk occurs. 3. After a certain number of <key,value> pairs are written to the lists, the lists are sent to the Combiner. Disk Spill file <k2,v2> <k2,v2> <k2,v2> Combiner 4. The records are spilled to disk.

Reduce-side Combining Spill files In-memory buffer 1. The Combiner is also used in the reduce phase when merging the intermediate <key,value> pairs from different Mappers. Merged input 2. 3. HDFS 4. Reducer

Example of a Combiner public class WordCountCombiner extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritableoutputValue = new IntWritable(); @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for(IntWritable count : values) { sum += count.get(); } outputValue.set(sum); context.write(key, outputValue); } }

In-Map Aggregation • The Mapper combines records as they are being processed • The Mapper stores records in memory • If you have a lot of records and storing in memory is prohibitive, then in-map aggregation may not work for you

“We the People of the United States, in Order to form a more perfect union...” TopResultsMapper ArrayList PriorityQueue “by”, 100 “in”, 145 “or”, 157 “be”, 178 “to”, 201 “and”, 262 “shall”, 293 “of”, 493 “the”, 726 “We”, 1 “the”, 2 “People”, 1 “of”, 1 “United”, 1 “States”, 1 “in”, 1 “order”, 1 ... After the entire input is processed, the List is converted to a PriorityQueue The top 10 results are sent to the Reducer

protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String [] input = StringUtils.split(value.toString(),'\\', ' '); for(String word : input) { Word currentWord = new Word(word, 1); if(words.contains(currentWord)) { //increment the existing Word's frequency for(Word w : words) { if(w.equals(currentWord)) { w.frequency++; break; } } } else { words.add(currentWord); } } }

@Override protected void cleanup(Context context) throws IOException, InterruptedException { Text outputKey = new Text(); IntWritableoutputValue = new IntWritable(); queue = new PriorityQueue<Word>(words.size()); queue.addAll(words); for(inti = 1; i <= maxResults; i++) { Word tail = queue.poll(); if(tail != null) { outputKey.set(tail.value); outputValue.set(tail.frequency); context.write(outputKey, outputValue); } } }

Counters

User-defined Counters • Write an enum: public enumMyCounters { GOOD_RECORDS, BAD_RECORDS } • Use getCounterto increment a counter: context.getCounter(MyCounters.GOOD_RECORDS). increment(1);

Lab 3.1: Using a Combiner Lab 3.2: Computing an Average

Unit 4: Partitioning and Sorting

DataNode DataNode Reducer The Partitioner determines which records get sent to which Reducer Mapper DataNode Reducer Partitioner DataNode Reducer

1. The Mapper outputs <key,value> pairs Mapper <key1, value> <key6, value> <key2, value> <key2, value> <key1, value> <key8, value> <key3, value> <key8, value> <key1, value> Partitioner public intgetPartition() 2. Each <key,value> pair is passed to the Partitioner 3. The Partitioner returns an int between 0 and the number of Reducers Reducer 0 Reducer 1 Reducer 2 Reducer 3

The Default Partitioner public class HashPartitioner<K, V> extends Partitioner<K, V> { public intgetPartition(K key, V value, intnumReduceTasks) { return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks; } }

Developing Hadoop Applications in Java

Developing Hadoop Applications in Java

Presentation Transcript

Developing and Deploying Java Applications for BlackBerry

Creating Java Applications

Developing DVB-Java Applications

Developing e-Commerce Applications Using Oracle and Java

Developing Java Applications with Windows Azure

Logging in Java applications

Developing Mobile iOS and Android Applications with Java

Large Scale Applications on Hadoop in Yahoo

Uncoupling Java Applications

Applications in Java

Developing Sophisticated Applications in SIR

Femto Java Developing Java applications for tiny footprint platforms

Applications in Java 1.2

Distributed Applications in Java

Developing Java Applications in the Cloud: Oracle Java Cloud Service

Applications of Big Data & Hadoop

Oracle Developing Applications for the Java EE 6 Platform

Java Web Applications

Developing Java Applications with NetBeans

Developing Java Applications in the Cloud: Oracle Java Cloud Service

Opt Spring Framework for Developing Java Enterprise Applications