1.67k likes | 1.68k Views
Learn how to develop Hadoop applications using Java with this comprehensive Hortonworks University training course for developers. Gain knowledge and skills in various topics such as MapReduce, HBase, Pig, and Hive programming.
E N D
Developing Hadoop Applications in Java A Hortonworks University Hadoop Training Course for Developers
Introductions • Your name • Job responsibilities • Previous Hadoop experience (if any) • What brought you here
Course Outline • Day 1 • Unit 1: Understanding Hadoop and MapReduce • Unit 2: Writing MapReduce Applications • Unit 3: Map Aggregation • Day 2 • Unit 4: Partitioning and Sorting • Unit 5: Input and Output Formats • Day 3 • Unit 6: Optimizing MapReduce Jobs • Unit 7: Advanced MapReduce Features • Unit 8: Unit Testing • Unit 9: Defining Workflow • Day 4 • Unit 10: HBase Programming • Unit 11: Pig Programming • Unit 12: Hive Programming
Who is Developing Apache Hadoop? Apache Hadoop Hortonworks Hortonworks Largest PMC and Committer Base from any Single Organization Apache Bylaws http://hadoop.apache.org/who.html As of 5/2012
Balancing Innovation & Stability Be aggressive—Ship early and often Be predictable—ship when stable
Features of Hadoop = HDFS + MapReduce + Hive + Pig + HBase + HCatalog, Mahout, ZooKeeper, etc.
Hortonworks Data Platform Fully Supported Integrated Platform • Challenge: • Integrate, manage, and support changes across a wide range of open source Hadoop • Time intensive, complex, expensive • Solution: Hortonworks Data Platform • Integrated certified platform distributions • Extensive Q/A process • Industry-leading Support with clear service levels for updates and patches • Multi-year Support and Maintenance Policy • Technical guidance support for Universe and Multiverse components Hadoop Core Pig Zookeeper Hive HCatalog HBase = New Version
The Hadoop Distributed File System • NameNode • The “master” node of HDFS • Determines and maintains how the chunks of data are distributed across the DataNodes • DataNode • Stores the chunks of data, and is responsible for replicating the chunks across other DataNodes
Big Data NameNode Put into HDFS Big Data Break the data into chunks and distribute to the DataNodes DataNode 1 DataNode 2 DataNode 3 The DataNodes replicate the chunks
The JobTracker and TaskTrackers • JobTracker • the “master” daemon of the TaskTrackers • clients submit MapReduce jobs to the JobTracker • distributes the tasks to available TaskTrackers • TaskTracker • runs on DataNodes • performs the actual MapReduce job
1. Submits a job to the JobTracker JobTracker client 2. Distributes tasks to the TaskTrackers based on availability and where the data resides 4. Sends task status to JobTracker TaskTracker 1 TaskTracker 2 TaskTracker 3 JVM JVM JVM 3. The TaskTracker spawns a JVMto execute the task
Job Schedulers • Fair Scheduler • all jobs get, on average, an equal share of resources over time • Capacity Scheduler • jobs are submitted to queues, and queues are allocated a fraction of the total resource capacity • Use mapred.jobtracker.taskSchedulerto configure the scheduler
Hadoop Modes NameNode Secondary NameNode JobTracker DataNode/ TaskTracer DataNode/ TaskTracer In a fully-distributed cluster, the NameNode, Secondary NameNode and JobTrackereach run on their own machine. DataNode/ TaskTracer DataNode/ TaskTracer DataNode/ TaskTracer DataNode/ TaskTracker
HDFS Filesystem Commands • hadoopfs -ls counties • hadoopfs -lsr counties • hadoopfs -mkdirpopulation_data • hadoopfs -put data/*.txt population_data/ • hadoopfs -cat population_data/population_1.txt
The HDFS API Configuration conf = new Configuration(); Path dir = new Path("results"); FileSystemfs = FileSystem.get(conf); if(!fs.exists(dir)) { dir.getFileSystem(conf).mkdirs(dir); }
Lab 1.1: Configuring a Hadoop Development Environment Lab 1.2: Putting Files in HDFS with Java
Map Phase Shuffle/Sort Reduce Phase DataNode 1 DataNode 1 Mapper Reducer DataNode 2 DataNode 2 Data is shuffled across the network and sorted Mapper DataNode 3 DataNode 3 Mapper Reducer
DataNode Mapper output = Reducer input Input split Spill files are merged into a single file The InputFormat generates <k1,v1> pairs Records are sorted and spilled to disk when the buffer reaches a threshold Mapper MapOutputBuffer <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> The map method outputs <k2,v2> pairs
DataNode DataNode Mapper output = Reducer input 2. In-memory buffer Spill files 3. Merged input DataNode Mapper output = Reducer input 4. HDFS 5. Reducer DataNode In-memory buffer Spill files DataNode Mapper output = Reducer input Merged input 1. The Reducer fetches the data from the Mappers HDFS Reducer
The Key/Value Pairs of MapReduce <K1, V1> <K2, V2> Mapper Shuffle/Sort Reducer <K3, V3> <K2, (V2,V2,V2,V2)>
The MapReduce API • Develop Java MapReduce applications using the org.apache.hadoop packages • Prior to Hadoop 0.20: the old API • org.apache.hadoop.mapred package • As of Hadoop 0.20: the new API • org.apache.hadoop.mapreduce package
WordCountMapper public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String currentLine = value.toString(); String [] words = currentLine.split(" "); for(String word : words) { Text outputKey = new Text(word); context.write(outputKey, new IntWritable(1)); } } }
WordCountReducer public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for(IntWritable count : values) { sum += count.get(); } IntWritableoutputValue = new IntWritable(sum); context.write(key, outputValue); } }
WordCountJob Job job = new Job(getConf(), "WordCountJob"); Configuration conf = job.getConfiguration(); job.setJarByClass(getClass()); Path in = new Path(args[0]); Path out = new Path(args[1]); FileInputFormat.setInputPaths(job, in); FileOutputFormat.setOutputPath(job, out); job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); return job.waitForCompletion(true)?0:1;
Running a MapReduce Job • To run a job, perform the following steps: • Put the input files into HDFS. • If the output directory exists, delete it. • Use hadoop to execute the job. • View the output files. • hadoop jar wordcount.jarWordCountJob input/file.txt result
Lab 2.1: Word Count Lab 2.2: Distributed Grep Lab 2.3: Inverted Index
<“by”, 1> <“the”, 1> <“people”, 1> <“for”, 1> <“the”, 1> <“people”, 1> <“of”, 1> <“the”, 1> <“people”, 1> Without Aggregation: Mapper Reducer The Reducer processes a large number of records, using HTTP across the network. The Mapper simply outputs every word, without performing any computations. With Aggregation: <“by”, 1> <“the”, 3> <“people”, 3> <“for”, 1> <“of”, 1> Mapper Reducer Expensive network traffic is decreased. The Mapper combines records in a manner that does not affect the algorithm.
Overview of Combiners MapOutputBuffer Disk Spill file Combiner <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> Combiner Disk Spill file <k2,v2> <k2,v2> Combiner 1. When the output buffer is full, a spill to disk occurs. 2. The Combiner is invoked in an attempt to reduce file I/O. Disk Spill file <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> Mapper output = Reducer input 3. The result is fewer records output by the Mapper.
Details of a Combiner 2. If a Combiner is used, the output is sent to Lists in memory (with a List for each Key). MapOutputBuffer <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> <k2,v2> List for key1 List for key5 List for key2 List for key3 List for key4 List for keyN 1. When the output buffer is full, a spill to disk occurs. 3. After a certain number of <key,value> pairs are written to the lists, the lists are sent to the Combiner. Disk Spill file <k2,v2> <k2,v2> <k2,v2> Combiner 4. The records are spilled to disk.
Reduce-side Combining Spill files In-memory buffer 1. The Combiner is also used in the reduce phase when merging the intermediate <key,value> pairs from different Mappers. Merged input 2. 3. HDFS 4. Reducer
Example of a Combiner public class WordCountCombiner extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritableoutputValue = new IntWritable(); @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for(IntWritable count : values) { sum += count.get(); } outputValue.set(sum); context.write(key, outputValue); } }
In-Map Aggregation • The Mapper combines records as they are being processed • The Mapper stores records in memory • If you have a lot of records and storing in memory is prohibitive, then in-map aggregation may not work for you
“We the People of the United States, in Order to form a more perfect union...” TopResultsMapper ArrayList PriorityQueue “by”, 100 “in”, 145 “or”, 157 “be”, 178 “to”, 201 “and”, 262 “shall”, 293 “of”, 493 “the”, 726 “We”, 1 “the”, 2 “People”, 1 “of”, 1 “United”, 1 “States”, 1 “in”, 1 “order”, 1 ... After the entire input is processed, the List is converted to a PriorityQueue The top 10 results are sent to the Reducer
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String [] input = StringUtils.split(value.toString(),'\\', ' '); for(String word : input) { Word currentWord = new Word(word, 1); if(words.contains(currentWord)) { //increment the existing Word's frequency for(Word w : words) { if(w.equals(currentWord)) { w.frequency++; break; } } } else { words.add(currentWord); } } }
@Override protected void cleanup(Context context) throws IOException, InterruptedException { Text outputKey = new Text(); IntWritableoutputValue = new IntWritable(); queue = new PriorityQueue<Word>(words.size()); queue.addAll(words); for(inti = 1; i <= maxResults; i++) { Word tail = queue.poll(); if(tail != null) { outputKey.set(tail.value); outputValue.set(tail.frequency); context.write(outputKey, outputValue); } } }
User-defined Counters • Write an enum: public enumMyCounters { GOOD_RECORDS, BAD_RECORDS } • Use getCounterto increment a counter: context.getCounter(MyCounters.GOOD_RECORDS). increment(1);
Lab 3.1: Using a Combiner Lab 3.2: Computing an Average
DataNode DataNode Reducer The Partitioner determines which records get sent to which Reducer Mapper DataNode Reducer Partitioner DataNode Reducer
1. The Mapper outputs <key,value> pairs Mapper <key1, value> <key6, value> <key2, value> <key2, value> <key1, value> <key8, value> <key3, value> <key8, value> <key1, value> Partitioner public intgetPartition() 2. Each <key,value> pair is passed to the Partitioner 3. The Partitioner returns an int between 0 and the number of Reducers Reducer 0 Reducer 1 Reducer 2 Reducer 3
The Default Partitioner public class HashPartitioner<K, V> extends Partitioner<K, V> { public intgetPartition(K key, V value, intnumReduceTasks) { return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks; } }