1 / 49

A Hadoop Overview

A Hadoop Overview. Outline. Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A. Outline. Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A. Progress. Hadoop buildup has been completed.

kamin
Download Presentation

A Hadoop Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Hadoop Overview

  2. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A

  3. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A

  4. Progress • Hadoop buildup has been completed. • Version 0.19.0, running under Standalone mode. • HBase buildup has been completed. • Version 0.19.3, with no assists of HDFS. • Simple demonstration over MapReduce. • Simple word count program.

  5. Testing Platform • Fedora 10 • JDK1.6.0_18 • Hadoop-0.19.0 • Hbase-0.19.3 • One can connect to the machine using pietty or putty. • Host: 140.112.90.180 • Account: labuser • Password: robot3233 • Port: 3385 (using ssh connection)

  6. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A

  7. MapReduce • A computing framework including map phase, shuffling phase and reduce phase. • Map function and Reduce function are provided by the user. • Key-Value Pair(KVP) • map is initiated with each KVP ingested, and output any number of KVPs. • reduce is initiated with each key and its corresponding values, and output any number of KVPs.

  8. MapReduce(cont.)

  9. What user has to do? Specify the input/output format Specify the output key/valuetype Specify the input/outputlocation Specify the mapper/reducerclass Specify the number of reduce tasks Specify the partitioner class(dicussed later)

  10. What user has to do?(cont.) • Specify the input/output format • “Input/Output format” is class that translate raw data and KVPs. • Has to inherit class InputFormat/OutputFormat. • Input format is required. • The most common choice is KeyValueTextInputFormat class and SequenceFileInputFormat class. • Output format is selective, the default is TextOutputFormat class .

  11. What user has to do?(cont.) • Specify the output key/value type • The KVP type output by reducer. • The Key type has to implements WritableComparable interface. • The Value type has to implements Writable interface. • Specify the input/output location • The directory or for input files/output files. • The input directory should exist and contain at least one file. • The output directory should not exist or be empty.

  12. What user has to do?(cont.) • Specify the mapper/reducer class • The two classes should extend MapReduceBaseclass. • The map/reduce class should implement Mapper<K1, V1, K2, V2>/Reducer<K1, V1, K2, V2> interface • Specify the number of reduce tasks • Usually approximate the number of computing nodes. • 1 if we want a single output file. • 0 if we don’t need the reduce phase. • Note that we will not have our result sorted. • The reducer class is not required in this case.

  13. Map Phase Configuration

  14. Reduce Phase Configuration

  15. MapReduceIntro.java public class MapReduceIntro { protected static Logger logger = Logger.getLogger(MapReduceIntro.class); public static void main(final String[] args) { try { final JobConf conf = new JobConf(MapReduceIntro.class); conf.set("hadoop.tmp.dir","/tmp"); conf.setInputFormat(KeyValueTextInputFormat.class); FileInputFormat.setInputPaths(conf, MapReduceIntroConfig.getInputDirectory()); conf.setMapperClass(IdentityMapper.class); FileOutputFormat.setOutputPath(conf, MapReduceIntroConfig.getOutputDirectory()); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); conf.setNumReduceTasks(1); conf.setReducerClass(IdentityReducer.class); final RunningJob job = JobClient.runJob(conf); if (!job.isSuccessful()) { logger.error("The job failed."); System.exit(1); } System.exit(0); } } Initial Configuration Map Phase Configuration Reduce Phase Configuration Job Running

  16. IdentityMapper.java public class IdentityMapper<K, V> extends MapReduceBase implements Mapper<K, V, K, V> { public void map(K key, V val, OutputCollector<K, V> output, Reporter reporter) throws IOException { output.collect(key, val); } } Input type Output type Discussed later Collect output KVPs

  17. IdentityReducer.java public class IdentityReducer<K, V> extends MapReduceBase implements Reducer<K, V, K, V> { public void reduce(K key, Iterator<V> values, OutputCollector<K, V> output, Reporter reporter) throws IOException { while (values.hasNext()) { output.collect(key, values.next()); } } } The input value is an Iterator<V>!

  18. Compiling • Using default java compiler • Note that we have to supply –classpath parameter so that the compiler can find the hadoop core libraries and other classes needed. • $ javac –classpath $HADOOP_HOME/hadoop-0.19.0-core.jar:. –d . Myclass.java The hadoop core libraries The location of other class files

  19. Creating jar file • To create an executable jar file: • Create a file “manifest.mf” • Type the command: • $ jar –cmf MyExample.jar manifest.mf <list of classes> • Wildcard character * is also accepted. A white space! Main-Class: myclass Class-Path: MyExample.jar The driver class A return carriage! White space separate list!

  20. Run the jar file • Using hadoop command. • $ hadoop jar MyExample.jar <param list> • Remember that the output path should not exist. • If the path exist, use rm path –r command.

  21. A simple demonstration A simple word count program.

  22. Reporter

  23. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A

  24. Hadoop • Full name Apache Hadoop project. • Open source implementation for reliable, scalable distributed computing. • An aggregation of the following projects (and its core): • Avro • Chukwa • HBase • HDFS • Hive • MapReduce • Pig • ZooKeeper

  25. Virtual Machine (VM) • Virtualization • All services are delivered through VMs. • Allows for dynamically configuring and managing. • There can be multiple VMs running on a single commodity machine. • VMware

  26. HDFS(Hadoop Distributed File System) • The highly scalable distributed file system of Hadoop. • Resembles Google File System(GFS). • Provides reliability by replication. • NameNode & DataNode • NameNode • Maintains file system metadata and namespace. • Provides management and control services. • Usually one instance. • DataNode • Provides data storage and retrieval services. • Usually several instances.

  27. MapReduce • The sophisticate distributed computing service of Hadoop. • A computation framework. • Usually resides on HDFS. • JobTracker & TaskTracker • JobTracker • Manages the distribution of tasks to the TaskTrackers. • Provides job monitoring and control, and the submission of jobs. • TaskTracker • Manages single map or reduce tasks on a compute node.

  28. Cluster Makeup • A Hadoop cluster is usually make up by: • Real Machines. • Not required to be homogeneous. • Homogeneity will help maintainability. • Server Process. • Multiple process can be run on a single VM. • Master & Slave • The node/machine running the JobTracker or NameNode will be Master node. • The ones running the TaskTracker or DataNode will be Slave node.

  29. Cluster Makeup(cont.)

  30. Administrator Scripts • Administrator can use the following script files to start or stop server processes. • Can be located in $HADOOP_HOME/bin • start-all.sh/stop-all.sh • start-mapred.sh/stop-mapred.sh • start-dfs.sh/stop-dfs.sh • slaves.sh • hadoop

  31. Configuration • By default, each Hadoop Core server will load the configuration from several files. • These file will be located in $HADOOP_HOME/conf • Usually identical copies of those files are maintained in every machine in the cluster.

  32. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A

  33. HBase • The Hadoop scalable distributed database. • Resembles Google BigTable. • Not relational database. • Resides in HDFS. • Master & RegionServer • Master • For bootstrapping and RegionServer recovering. • Assigning regions to RegionServers. • RegionServer • Hold 0 or more regions. • responsible for data transaction.

  34. Hbase(cont.)

  35. Row, Column, Timestamp • The data cell is the intersection of an individual row key and a column. • Cells stores uninterrupted array of byte. • Cell data is versioned by timestamp.

  36. Row • Row (Key) is the primary key of database • Can be consisted by arbitrary byte array. • Strings, binary data. • Each row has to be distinguished. • The table is sorted by row key. • Any mutation action of a single row is atomic.

  37. Column/Column Family • Columns are grouped into families, with which shares a common prefix. • Ex: temperature:airand temperature:dew_point • The prefix has to be a printable string. • The column name can also be arbitrary byte array. • Column family member can be dynamically added or dropped. • Column families must be pre-specified as table schemas. • HBase is indeed a column-family-oriented storing. • The same column family will be stored together in any file system.

  38. Region • The table is automatically horizontally-partitioned into regions. • That is, a region is a subset of data rows. • Regions are stored in separated RegionServers. • A region is defined by its first row, last row, and a randomly generated identifier. • The partition will be completed by the master automatically.

  39. Administrator Scripts • Administrator can use the following script files to start or stop server processes. • Can be located in $HBASE_INSTALL/bin • start-hbase.sh / stop-hbase.sh • hbase • hbase shell to initial a command line interface. • hbase master / hbaseregionserver

  40. HBase shell command line • Type command help to get information. • create ‘table’, ‘column family1’, ‘column family2’, … • put ‘table’, ‘row’, ‘column’, ‘value’ • get ‘table’, ‘row’, {COLUMN=>…} • alter ‘table’, {NAME=>‘...’} • To modify a table schema, we have to disable it first! • scan ‘table’ • disable ‘table’ • drop ‘table’ • To drop a table, we have to disable it first!! • list

  41. A Simple Demonstration Command line operation

  42. Operations • Create table (and its schema) • Shell • create ‘table’, ‘cf1’, ‘cf2’,… • create ‘table’, {NAME=>‘cf1’}, {NAME=>‘cf2’},… • API HBaseAdmin admin = new HBaseAdmin(new HBaseConfiguration()); HTableDescriptortable = new HTableDescriptor(“table”); table.addFamily(new HColumnDescriptor(“cf1:”)); table.addFamily(new HColumnDescriptor(“cf2:”)); admin.createTable(table);

  43. Operations(cont.) • Modify table (and its schema) • Shell • alter ‘table’, {NAME=>’cf’, KEY=>’value’, …} • API • Note that there will be exceptions if the table is not disabled. HBaseAdmin admin = new HBaseAdmin(); Admin.modifyColumn(“table”,”cf”, new HColumnDescriptor(…)); Admin.modifyTable(new HTableDescriptor(…));

  44. Operations(cont.) • Write data • Shell • put ‘table’, ‘row’, ‘cf:name’, ‘value’, ts • API HTable table = new Htable(“table”); BatchUpdate update = new BatchUpdate(“row”); update.put(“cf:name”,”value”); table.commit(update);

  45. Operations(cont.) • Retrieve data • Shell • get ‘table’, ‘row’, {COLUMN=>’cf:name’, …} • API • If we don’t know the row retrieved at prior, we can use Scanner object instead. • Scanner scanner = table.getScanner(“cf:name”); HTable table = new HTable(“table”); RowResult row = table.getRow(“row”); Cell data = table.get(“row”,”cf:name”);

  46. Operations • Delete a cell • Shell • delete ‘table’, ‘row’, ‘cf:name’ • API HTable table = new HTable(“table”); BatchUpdate update = new BatchUpdate(“row”); Udpate.delete(“cf:name”); table.commit(update);

  47. Operations(cont.) • Enable/Disable a table • Shell • enable/disable ‘table’ • API HBaseAdmin admin = new HBaseAdmin(new HBaseConfiguration()); admin.disableTable(“table”); admin.enableTable(“table”);

  48. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A

  49. Q & A • Hadoop 0.19.0 API • http://hadoop.apache.org/common/docs/r0.19.0/api/index.html • HBase 0.19.3 API • http://hadoop.apache.org/hbase/docs/r0.19.3/api/index.html • Any question?

More Related