940 likes | 1.2k Views
Hadoop. 教材. 湯秉翰 著 (2013) , 雲端網頁程式設計- Google App Engine 應用實作 ( 第二版 ) , 博碩文化, ISBN 978-986-201-824-8 ( 書號 PG31356) 鍾葉青、鍾武君 著 (2013) , 雲端計算, 東華書局, ISBN 9789861579030 ( 書號 CL009) 許清榮、林奇暻、買大誠 著 (2012) , 掌握 Hadoop 翱翔雲端 -Windoop 應用實作指南, 博碩文化, ISBN 978-986-201-673-2 ( 書號 PG21241)
E N D
教材 湯秉翰 著(2013), 雲端網頁程式設計-Google App Engine應用實作(第二版), 博碩文化, ISBN 978-986-201-824-8 (書號 PG31356) 鍾葉青、鍾武君 著(2013), 雲端計算, 東華書局, ISBN 9789861579030 (書號 CL009) 許清榮、林奇暻、買大誠 著(2012), 掌握Hadoop翱翔雲端-Windoop應用實作指南, 博碩文化, ISBN 978-986-201-673-2 (書號 PG21241) 鍾葉青、李冠憬、許慶賢、賴冠州 著(2011), 雲端程式設計: 入門與應用實務, 東華書局, ISBN 9789861578125 (書號 CL008)
大綱 Hadoop 簡介 HDFS MapReduce Programming Model Hbase
Hadoop • Hadoop • Apache 專案項目 • 分散式計算平台 • 軟體平台(Software Framework) • 適合處理巨量資料
Hadoop 應用程式 Cloud Applications 分散式檔案系統 MapReduce Hbase Hadoop Distributed File System (HDFS) 叢集電腦 A Cluster of Machines
歷史(2002-2004) • 創立者– Doug Cutting • Lucene • 純 Java 高效能、全文檢索搜尋引擎程式庫 • Inverse Index • Nutch • 基於 Lucene程式庫 • 網頁搜尋軟體(Web-Search Software)
歷史(轉捩點) • Nutch遇到儲存的問題 • Google 發表網頁搜尋引擎論文 • SOSP 2003 : "The Google File System" • OSDI 2004 : "MapReduce : Simplifed Data Processing on Large Cluster" • OSDI 2006 : "Bigtable: A Distributed Storage System for Structured Data"
歷史(2004-Now) • Dong Cutting 參考 Google 發表的論文 • 在 Nutch 裡實作 GFS & MapReduce • 自 Nutch 0.8 起 Hadoop 成為獨立專案 • Yahoo 僱用 Dong Cutting 建立網頁搜尋引擎 • Nutch DFS → Hadoop Distributed File System (HDFS)
Hadoop 的特色 • Efficiency 效率 • 資料節點,平行處理 • Robustness 耐用 • Automatically maintain multiple copies of data and automatically re-deploys computing tasks based on failures • Cost Efficiency 成本效益 • Distribute the data and processing across clusters of commodity computers • Scalability 擴展性 • Reliably store and process massive data
HDFS HDFS 簡介 HDFS 維運 程式開發環境
什麼是 HDFS • Hadoop Distributed File System 分散式檔案系統 • 參考 Google 檔案系統 • 適合大量資料分析的分散式檔案系統 • 基於具備容錯能力的商用硬體 • 應用於 Hadoop Cloud Applications MapReduce Hbase Hadoop Distributed File System (HDFS) A Cluster of Machines
HDFS 架構 名稱節點 資料節點 建立副本 機櫃#1 機櫃#2 HDFS 架構
HDFS Client Block Diagram Client computer HDFS Namenode HDFS-Aware application POSIX API HDFS API HDFS Datanode Regular VFS with local and NFS-supported files Separate HDFS view HDFS Datanode Specific drivers Network stack
HDFS 維運 • Shell Commands 殼層命令 • HDFS Common APIs 共通API
範例 • In the <HADOOP_HOME>/ • bin/hadoop fs –ls • Lists the content of the directory by given path of HDFS • ls • Lists the content of the directory by given path of local file system
HDFS Common APIs Configuration 設定 FileSystem檔案系統 Path 路徑 FSDataInputStream資料輸入串流 FSDataOutputStream資料輸出串流
1: import java.io.File; 2: import java.io.IOException; 3: 4: import org.apache.hadoop.conf.Configuration; 5: import org.apache.hadoop.fs.FileSystem; 6: import org.apache.hadoop.fs.FSDataInputStream; 7: import org.apache.hadoop.fs.FSDataOutputStream; 8: import org.apache.hadoop.fs.Path; 9: 10: public class HelloHDFS { 11: 12: public static final String theFilename = "hello.txt"; 13: public static final String message = "Hello HDFS!\n"; 14: 15: public static void main (String [] args) throws IOException { 16: 17: Configuration conf = new Configuration(); 18: FileSystemhdfs = FileSystem.get(conf); 19: 20: Path filenamePath = new Path(theFilename);
FSDataOutputStream extends the java.io.DataOutputStream class 21: 22: try { 23: if (hdfs.exists(filenamePath)) { 24: // remove the file first 25: hdfs.delete(filenamePath, true); 26: } 27: 28: FSDataOutputStream out = hdfs.create(filenamePath); 29: out.writeUTF(message); 30: out.close(); 31: 32: FSDataInputStream in = hdfs.open(filenamePath); 33: String messageIn = in.readUTF(); 34: System.out.print(messageIn); 35: in.close(); 36: } catch (IOExceptionioe) { 37: System.err.println("IOExceptionduring operation: " + ioe.toString()); 38: System.exit(1); 39: } 40: } 41: } FSDataInputStream extends the java.io.DataInputStream class
Configuration • Provides access to configuration parameters. • Configuration conf = new Configuration() • A new configuration. • … = new Configuration(Configuration other) • A new configuration with the same settings cloned from another. • Methods:
FileSystem Configuration conf = new Configuration(); FileSystemhdfs = FileSystem.get( conf ); An abstract base class for a fairly generic FileSystem. Ex: Methods:
Path Path filenamePath = new Path("hello.txt"); Names a file or directory in a FileSystem. Ex: Methods:
FSDataInputStream FSDataInputStream in = hdfs.open(filenamePath); Utility that wraps a FSInputStream in a DataInputStream and buffers input through a BufferedInputStream. Inherit from java.io.DataInputStream Ex:
FSDataInputStream Methods:
FSDataOutputStream FSDataOutputStream out = hdfs.create(filenamePath); Utility that wraps a OutputStream in a DataOutputStream, buffers output through a BufferedOutputStream and creates a checksum file. Inherit from java.io.DataOutputStream Ex:
FSDataOutputStream Methods:
程式開發環境 • A Linux environment • On physical or virtual machine • Ubuntu 10.04 • Hadoop environment • Reference Hadoop setup guide • user/group: hadoop/hadoop • Single or multiple node(s), the later is preferred. • Eclipse 3.7M2a with hadoop-0.20.2 plugin
MapReduce MapReduce簡介 Sample Code Program Prototype Programming using Eclipse Lab Requirement
什麼是MapReduce? • Programming model for expressing distributed computations at a massive scale • A patented software framework introduced by Google • Processes 20 petabytes of data per day • Popularized by open-source Hadoop project • Used at Yahoo!, Facebook, Amazon, … Cloud Applications MapReduce Hbase Hadoop Distributed File System (HDFS) A Cluster of Machines
Nodes, Trackers, Tasks • JobTracker • Run on Master node • Accepts Job requests from clients • TaskTracker • Run on slave nodes • Forks separate Java process for task instances
Example - Wordcount Sort/Copy Mapper Input Output Hello 1 Cloud 1 Merge Hello Cloud Reducer Hello 2 TA 2 Hello 1 Hello [1 1] TA [1 1] Hello 1 TA 1 TA 1 Mapper TA cool TA 1 cool 1 cool 1 Hello TA Reducer Cloud 1 cool 2 Cloud 1 Cloud [1] cool [1 1] cool 1 cool 1 Mapper cool Hello 1 TA 1
MapReduce MapReduce簡介 Sample Code Program Prototype Programming using Eclipse
public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args) .getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(wordcount.class); job.setMapperClass(mymapper.class); job.setCombinerClass(myreducer.class); job.setReducerClass(myreducer.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); System.exit(job.waitForCompletion(true) ? 0 : 1); }
Mapper import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class mymapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String line = ( (Text) value ).toString(); StringTokenizeritr = new StringTokenizer( line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }
Mapper(cont.) Hi Cloud TA say Hi InputKey StringTokenizeritr = new StringTokenizer( line); ( (Text) value ).toString(); Hi Cloud TA say Hi /user/hadoop/input/hi … … HiCloud TAsay Hi … … itr itr itr itr itr itr while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } <word, one> Input Value <Hi, 1> <Cloud, 1> <TA, 1> <say, 1> <Hi, 1>
Reducer import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class myreducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritableval : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
Reducer (cont.) <word, one> <Hi, 1 → 1> <Cloud, 1> Hi <TA, 1> <say, 1> 1 1 <key, result> <Hi, 2> <Cloud, 1> <TA, 1> <say, 1>
MapReduce術語 • Job • A "full program" - an execution of a Mapper and Reducer across a data set • Task • An execution of a Mapper or a Reducer on a slice of data • Task Attempt • A particular instance of an attempt to execute a task on a machine
Main Class Class MR{ main(){ Configuration conf = new Configuration(); Job job = new Job(conf, "job name"); job.setJarByClass(thisMainClass.class); job.setMapperClass(Mapper.class); job.setReduceClass(Reducer.class); FileInputFormat.addInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
Job • Identify classes implementing Mapper and Reducer interfaces • Job.setMapperClass(), setReducerClass() • Specify inputs, outputs • FileInputFormat.addInputPath() • FileOutputFormat.setOutputPath() • Optionally, other options too: • Job.setNumReduceTasks(), • Job.setOutputFormat()…
Class Mapper • Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT> • Maps input key/value pairs to a set of intermediate key/value pairs. • Ex:
Class Mapper Class MyMapper extend Mapper <Object, Text, Text, IntWritable> { //global variable public void map(Object key, Text value, Context context) throws IOException,InterruptedException { //local vaiable …. context.write(key’, value’); } } Input Class(key, value) Onput Class(key, value)
Text, IntWritable, LongWritable, • Hadoop defines its own "box" classes • Strings : Text • Integers : IntWritable • Long : LongWritable • Any (WritableComparable, Writable) can be sent to the reducer • All keys are instances of WritableComparable • All values are instances of Writable
Mappers • Upper-case Mapper • Ex: let map(k, v) = emit(k.toUpper(), v.toUpper()) • ("foo", "bar") → ("FOO", "BAR") • ("Foo", "other") → ("FOO", "OTHER") • ("key2", "data") → ("KEY2", "DATA") • Explode Mapper • let map(k, v) = for each char c in v: emit(k, c) • ("A", "cats") → ("A", "c"), ("A", "a"), ("A", "t"), ("A", "s") • ("B", "hi") → ("B", "h"), ("B", "i") • Filter Mapper • let map(k, v) = if (isPrime(v)) then emit(k, v) • ("foo", 7) → ("foo", 7) • ("test", 10) → (nothing)
Class Reducer • Class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> • Reduces a set of intermediate values which share a key to a smaller set of values. • Ex: