1 / 84

Introduction to Hadoop

Introduction to Hadoop. Prabhaker Mateti. ACK. Thanks to all the authors who left their slides on the Web. I own the errors of course. What Is ?. Distributed computing frame work For clusters of computers Thousands of Compute Nodes Petabytes of data Open source, Java

genna
Download Presentation

Introduction to Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Hadoop Prabhaker Mateti

  2. ACK • Thanks to all the authors who left their slides on the Web. • I own the errors of course.

  3. What Is ? • Distributed computing frame work • For clusters of computers • Thousands of Compute Nodes • Petabytes of data • Open source, Java • Google’s MapReduce inspired Yahoo’s Hadoop. • Now part of Apache group

  4. What Is ? • The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. Hadoop includes: • Hadoop Common utilities • Avro: A data serialization system with scripting languages. • Chukwa: managing large distributed systems. • HBase: A scalable, distributed database for large tables. • HDFS: A distributed file system. • Hive: data summarization and ad hoc querying. • MapReduce: distributed processing on compute clusters. • Pig: A high-level data-flow language for parallel computation. • ZooKeeper: coordination service for distributed applications.

  5. The Idea of Map Reduce

  6. Map and Reduce • The idea of Map, and Reduce is 40+ year old • Present in all Functional Programming Languages. • See, e.g., APL, Lisp and ML • Alternate names for Map: Apply-All • Higher Order Functions • take function definitions as arguments, or • return a function as output • Map and Reduce are higher-order functions.

  7. Map: A Higher Order Function • F(x: int) returns r: int • Let V be an array of integers. • W = map(F, V) • W[i] = F(V[i]) for all I • i.e., apply F to every element of V

  8. Map Examples in Haskell • map (+1) [1,2,3,4,5] == [2, 3, 4, 5, 6] • map (toLower) "abcDEFG12!@#“ == "abcdefg12!@#“ • map (`mod` 3) [1..10] == [1, 2, 0, 1, 2, 0, 1, 2, 0, 1]

  9. reduce: A Higher Order Function • reduce also known as fold, accumulate, compress or inject • Reduce/fold takes in a function and folds it in between the elements of a list.

  10. Fold-Left in Haskell • Definition • foldl f z [] = z • foldl f z (x:xs) = foldl f (f z x) xs • Examples • foldl (+) 0 [1..5] ==15 • foldl (+) 10 [1..5] == 25 • foldl (div) 7 [34,56,12,4,23] == 0

  11. Fold-Right in Haskell • Definition • foldr f z [] = z • foldr f z (x:xs) = f x (foldr f z xs) • Example • foldr (div) 7 [34,56,12,4,23] == 8

  12. Examples of theMap Reduce Idea

  13. Word Count Example • Read text files and count how often words occur. • The input is text files • The output is a text file • each line: word, tab, count • Map: Produce pairs of (word, count) • Reduce: For each word, sum up the counts.

  14. Grep Example Search input files for a given pattern Map: emits a line if pattern is matched Reduce: Copies results to output

  15. Inverted Index Example Generate an inverted index of words from a given set of files Map: parses a document and emits <word, docId> pairs Reduce: takes all pairs for a given word, sorts the docId values, and emits a <word, list(docId)> pair

  16. Map/Reduce Implementation Idea

  17. Execution on Clusters Input files split (M splits) Assign Master & Workers Map tasks Writing intermediate data to disk (R regions) Intermediate data read & sort Reduce tasks Return

  18. Map/Reduce Cluster Implementation M map tasks R reduce tasks Input files Intermediate files Output files split 0 split 1 split 2 split 3 split 4 Output 0 Output 1 Several map or reduce tasks can run on a single computer Each intermediate file is divided into R partitions, by partitioning function Each reduce task corresponds to one partition

  19. Execution

  20. Fault Recovery • Workers are pinged by master periodically • Non-responsive workers are marked as failed • All tasks in-progress or completed by failed worker become eligible for rescheduling • Master could periodically checkpoint • Current implementations abort on master failure

  21. Component Overview

  22. http://hadoop.apache.org/ • Open source Java • Scale • Thousands of nodes and • petabytes of data • 27 December, 2011: release 1.0.0 • but already used by many

  23. Hadoop • MapReduce and Distributed File System framework for large commodity clusters • Master/Slave relationship • JobTracker handles all scheduling & data flow between TaskTrackers • TaskTracker handles all worker tasks on a node • Individual worker task runs map or reduce operation • Integrates with HDFS for data locality

  24. Hadoop Supported File Systems • HDFS: Hadoop's own file system. • Amazon S3 file system. • Targeted at clusters hosted on the Amazon Elastic Compute Cloud server-on-demand infrastructure • Not rack-aware • CloudStore • previously Kosmos Distributed File System • like HDFS, this is rack-aware. • FTP Filesystem • stored on remote FTP servers. • Read-only HTTP and HTTPS file systems.

  25. "Rack awareness" • optimization which takes into account the geographic clustering of servers • network traffic between servers in different geographic clusters is minimized.

  26. HDFS: Hadoop Distr File System • Designed to scale to petabytes of storage, and run on top of the file systems of the underlying OS. • Master (“NameNode”) handles replication, deletion, creation • Slave (“DataNode”) handles data retrieval • Files stored in many blocks • Each block has a block Id • Block Id associated with several nodes hostname:port (depending on level of replication)

  27. Hadoop v. ‘MapReduce’ • MapReduce is also the name of a framework developed by Google • Hadoop was initially developed by Yahoo and now part of the Apache group. • Hadoop was inspired by Google's MapReduce and Google File System (GFS) papers.

  28. MapReduce v. Hadoop

  29. wordCount A Simple Hadoop Examplehttp://wiki.apache.org/hadoop/WordCount

  30. Word Count Example • Read text files and count how often words occur. • The input is text files • The output is a text file • each line: word, tab, count • Map: Produce pairs of (word, count) • Reduce: For each word, sum up the counts.

  31. WordCount Overview 3 import ... 12 public class WordCount { 13 14 public static class Map extends MapReduceBase implements Mapper ... { 17 18 public void map ... 26 } 27 28 public static class Reduce extends MapReduceBase implements Reducer ... { 29 30 public void reduce ... 37 } 38 39 public static void main(String[] args) throws Exception { 40 JobConf conf = new JobConf(WordCount.class); 41 ... 53 FileInputFormat.setInputPaths(conf, new Path(args[0])); 54 FileOutputFormat.setOutputPath(conf, new Path(args[1])); 55 56 JobClient.runJob(conf); 57 } 58 59 }

  32. wordCount Mapper 14 public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { 15 private final static IntWritable one = new IntWritable(1); 16 private Text word = new Text(); 17 18 public void map( LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 19 String line = value.toString(); 20 StringTokenizer tokenizer = new StringTokenizer(line); 21 while (tokenizer.hasMoreTokens()) { 22 word.set(tokenizer.nextToken()); 23 output.collect(word, one); 24 } 25 } 26 }

  33. wordCount Reducer 28 public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { 29 30 public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 31 int sum = 0; 32 while (values.hasNext()) { 33 sum += values.next().get(); 34 } 35 output.collect(key, new IntWritable(sum)); 36 } 37 }

  34. wordCount JobConf 40 JobConf conf = new JobConf(WordCount.class); 41 conf.setJobName("wordcount"); 42 43 conf.setOutputKeyClass(Text.class); 44 conf.setOutputValueClass(IntWritable.class); 45 46 conf.setMapperClass(Map.class); 47 conf.setCombinerClass(Reduce.class); 48 conf.setReducerClass(Reduce.class); 49 50 conf.setInputFormat(TextInputFormat.class); 51 conf.setOutputFormat(TextOutputFormat.class);

  35. WordCount main 39 public static void main(String[] args) throws Exception { 40 JobConf conf = new JobConf(WordCount.class); 41 conf.setJobName("wordcount"); 42 43 conf.setOutputKeyClass(Text.class); 44 conf.setOutputValueClass(IntWritable.class); 45 46 conf.setMapperClass(Map.class); 47 conf.setCombinerClass(Reduce.class); 48 conf.setReducerClass(Reduce.class); 49 50 conf.setInputFormat(TextInputFormat.class); 51 conf.setOutputFormat(TextOutputFormat.class); 52 53 FileInputFormat.setInputPaths(conf, new Path(args[0])); 54 FileOutputFormat.setOutputPath(conf, new Path(args[1])); 55 56 JobClient.runJob(conf); 57 }

  36. Invocation of wordcount • /usr/local/bin/hadoop dfs -mkdir <hdfs-dir> • /usr/local/bin/hadoop dfs -copyFromLocal <local-dir> <hdfs-dir> • /usr/local/bin/hadoop jar hadoop-*-examples.jar wordcount [-m <#maps>] [-r <#reducers>] <in-dir> <out-dir>

  37. Mechanics of Programming Hadoop Jobs

  38. Job Launch: Client • Client program creates a JobConf • Identify classes implementing Mapper and Reducer interfaces • setMapperClass(), setReducerClass() • Specify inputs, outputs • setInputPath(), setOutputPath() • Optionally, other options too: • setNumReduceTasks(), setOutputFormat()…

  39. Job Launch: JobClient • Pass JobConf to • JobClient.runJob() // blocks • JobClient.submitJob() // does not block • JobClient: • Determines proper division of input into InputSplits • Sends job data to master JobTracker server

  40. Job Launch: JobTracker • JobTracker: • Inserts jar and JobConf (serialized to XML) in shared location • Posts a JobInProgress to its run queue

  41. Job Launch: TaskTracker • TaskTrackers running on slave nodes periodically query JobTracker for work • Retrieve job-specific jar and config • Launch task in separate instance of Java • main() is provided by Hadoop

  42. Job Launch: Task • TaskTracker.Child.main(): • Sets up the child TaskInProgress attempt • Reads XML configuration • Connects back to necessary MapReduce components via RPC • Uses TaskRunner to launch user process

  43. Job Launch: TaskRunner • TaskRunner, MapTaskRunner, MapRunner work in a daisy-chain to launch Mapper • Task knows ahead of time which InputSplits it should be mapping • Calls Mapper once for each record retrieved from the InputSplit • Running the Reducer is much the same

  44. Creating the Mapper • Your instance of Mapper should extend MapReduceBase • One instance of your Mapper is initialized by the MapTaskRunner for a TaskInProgress • Exists in separate process from all other instances of Mapper – no data sharing!

  45. Mapper void map (WritableComparable key,Writable value,OutputCollector output,Reporter reporter )

  46. What is Writable? • Hadoop defines its own “box” classes for strings (Text), integers (IntWritable), etc. • All values are instances of Writable • All keys are instances of WritableComparable

  47. Writing For Cache Coherency while (more input exists) { myIntermediate = new intermediate(input); myIntermediate.process(); export outputs; }

  48. Writing For Cache Coherency myIntermediate = new intermediate (junk); while (more input exists) { myIntermediate.setupState(input); myIntermediate.process(); export outputs; }

  49. Writing For Cache Coherency • Running the GC takes time • Reusing locations allows better cache usage • Speedup can be as much as two-fold • All serializable types must be Writable anyway, so make use of the interface

  50. Getting Data To The Mapper

More Related