A Hadoop Overview

A Hadoop Overview

Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A

Progress • Hadoop buildup has been completed. • Version 0.19.0, running under Standalone mode. • HBase buildup has been completed. • Version 0.19.3, with no assists of HDFS. • Simple demonstration over MapReduce. • Simple word count program.

Testing Platform • Fedora 10 • JDK1.6.0_18 • Hadoop-0.19.0 • Hbase-0.19.3 • One can connect to the machine using pietty or putty. • Host: 140.112.90.180 • Account: labuser • Password: robot3233 • Port: 3385 (using ssh connection)

MapReduce • A computing framework including map phase, shuffling phase and reduce phase. • Map function and Reduce function are provided by the user. • Key-Value Pair(KVP) • map is initiated with each KVP ingested, and output any number of KVPs. • reduce is initiated with each key and its corresponding values, and output any number of KVPs.

MapReduce(cont.)

What user has to do? Specify the input/output format Specify the output key/valuetype Specify the input/outputlocation Specify the mapper/reducerclass Specify the number of reduce tasks Specify the partitioner class(dicussed later)

What user has to do?(cont.) • Specify the input/output format • “Input/Output format” is class that translate raw data and KVPs. • Has to inherit class InputFormat/OutputFormat. • Input format is required. • The most common choice is KeyValueTextInputFormat class and SequenceFileInputFormat class. • Output format is selective, the default is TextOutputFormat class .

What user has to do?(cont.) • Specify the output key/value type • The KVP type output by reducer. • The Key type has to implements WritableComparable interface. • The Value type has to implements Writable interface. • Specify the input/output location • The directory or for input files/output files. • The input directory should exist and contain at least one file. • The output directory should not exist or be empty.

What user has to do?(cont.) • Specify the mapper/reducer class • The two classes should extend MapReduceBaseclass. • The map/reduce class should implement Mapper<K1, V1, K2, V2>/Reducer<K1, V1, K2, V2> interface • Specify the number of reduce tasks • Usually approximate the number of computing nodes. • 1 if we want a single output file. • 0 if we don’t need the reduce phase. • Note that we will not have our result sorted. • The reducer class is not required in this case.

Map Phase Configuration

Reduce Phase Configuration

MapReduceIntro.java public class MapReduceIntro { protected static Logger logger = Logger.getLogger(MapReduceIntro.class); public static void main(final String[] args) { try { final JobConf conf = new JobConf(MapReduceIntro.class); conf.set("hadoop.tmp.dir","/tmp"); conf.setInputFormat(KeyValueTextInputFormat.class); FileInputFormat.setInputPaths(conf, MapReduceIntroConfig.getInputDirectory()); conf.setMapperClass(IdentityMapper.class); FileOutputFormat.setOutputPath(conf, MapReduceIntroConfig.getOutputDirectory()); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); conf.setNumReduceTasks(1); conf.setReducerClass(IdentityReducer.class); final RunningJob job = JobClient.runJob(conf); if (!job.isSuccessful()) { logger.error("The job failed."); System.exit(1); } System.exit(0); } } Initial Configuration Map Phase Configuration Reduce Phase Configuration Job Running

IdentityMapper.java public class IdentityMapper<K, V> extends MapReduceBase implements Mapper<K, V, K, V> { public void map(K key, V val, OutputCollector<K, V> output, Reporter reporter) throws IOException { output.collect(key, val); } } Input type Output type Discussed later Collect output KVPs

IdentityReducer.java public class IdentityReducer<K, V> extends MapReduceBase implements Reducer<K, V, K, V> { public void reduce(K key, Iterator<V> values, OutputCollector<K, V> output, Reporter reporter) throws IOException { while (values.hasNext()) { output.collect(key, values.next()); } } } The input value is an Iterator<V>!

Compiling • Using default java compiler • Note that we have to supply –classpath parameter so that the compiler can find the hadoop core libraries and other classes needed. • $ javac –classpath $HADOOP_HOME/hadoop-0.19.0-core.jar:. –d . Myclass.java The hadoop core libraries The location of other class files

Creating jar file • To create an executable jar file: • Create a file “manifest.mf” • Type the command: • $ jar –cmf MyExample.jar manifest.mf <list of classes> • Wildcard character * is also accepted. A white space! Main-Class: myclass Class-Path: MyExample.jar The driver class A return carriage! White space separate list!

Run the jar file • Using hadoop command. • $ hadoop jar MyExample.jar <param list> • Remember that the output path should not exist. • If the path exist, use rm path –r command.

A simple demonstration A simple word count program.

Reporter

Hadoop • Full name Apache Hadoop project. • Open source implementation for reliable, scalable distributed computing. • An aggregation of the following projects (and its core): • Avro • Chukwa • HBase • HDFS • Hive • MapReduce • Pig • ZooKeeper

Virtual Machine (VM) • Virtualization • All services are delivered through VMs. • Allows for dynamically configuring and managing. • There can be multiple VMs running on a single commodity machine. • VMware

HDFS(Hadoop Distributed File System) • The highly scalable distributed file system of Hadoop. • Resembles Google File System(GFS). • Provides reliability by replication. • NameNode & DataNode • NameNode • Maintains file system metadata and namespace. • Provides management and control services. • Usually one instance. • DataNode • Provides data storage and retrieval services. • Usually several instances.

MapReduce • The sophisticate distributed computing service of Hadoop. • A computation framework. • Usually resides on HDFS. • JobTracker & TaskTracker • JobTracker • Manages the distribution of tasks to the TaskTrackers. • Provides job monitoring and control, and the submission of jobs. • TaskTracker • Manages single map or reduce tasks on a compute node.

Cluster Makeup • A Hadoop cluster is usually make up by: • Real Machines. • Not required to be homogeneous. • Homogeneity will help maintainability. • Server Process. • Multiple process can be run on a single VM. • Master & Slave • The node/machine running the JobTracker or NameNode will be Master node. • The ones running the TaskTracker or DataNode will be Slave node.

Cluster Makeup(cont.)

Administrator Scripts • Administrator can use the following script files to start or stop server processes. • Can be located in $HADOOP_HOME/bin • start-all.sh/stop-all.sh • start-mapred.sh/stop-mapred.sh • start-dfs.sh/stop-dfs.sh • slaves.sh • hadoop

Configuration • By default, each Hadoop Core server will load the configuration from several files. • These file will be located in $HADOOP_HOME/conf • Usually identical copies of those files are maintained in every machine in the cluster.

HBase • The Hadoop scalable distributed database. • Resembles Google BigTable. • Not relational database. • Resides in HDFS. • Master & RegionServer • Master • For bootstrapping and RegionServer recovering. • Assigning regions to RegionServers. • RegionServer • Hold 0 or more regions. • responsible for data transaction.

Hbase(cont.)

Row, Column, Timestamp • The data cell is the intersection of an individual row key and a column. • Cells stores uninterrupted array of byte. • Cell data is versioned by timestamp.

Row • Row (Key) is the primary key of database • Can be consisted by arbitrary byte array. • Strings, binary data. • Each row has to be distinguished. • The table is sorted by row key. • Any mutation action of a single row is atomic.

Column/Column Family • Columns are grouped into families, with which shares a common prefix. • Ex: temperature:airand temperature:dew_point • The prefix has to be a printable string. • The column name can also be arbitrary byte array. • Column family member can be dynamically added or dropped. • Column families must be pre-specified as table schemas. • HBase is indeed a column-family-oriented storing. • The same column family will be stored together in any file system.

Region • The table is automatically horizontally-partitioned into regions. • That is, a region is a subset of data rows. • Regions are stored in separated RegionServers. • A region is defined by its first row, last row, and a randomly generated identifier. • The partition will be completed by the master automatically.

Administrator Scripts • Administrator can use the following script files to start or stop server processes. • Can be located in $HBASE_INSTALL/bin • start-hbase.sh / stop-hbase.sh • hbase • hbase shell to initial a command line interface. • hbase master / hbaseregionserver

HBase shell command line • Type command help to get information. • create ‘table’, ‘column family1’, ‘column family2’, … • put ‘table’, ‘row’, ‘column’, ‘value’ • get ‘table’, ‘row’, {COLUMN=>…} • alter ‘table’, {NAME=>‘...’} • To modify a table schema, we have to disable it first! • scan ‘table’ • disable ‘table’ • drop ‘table’ • To drop a table, we have to disable it first!! • list

A Simple Demonstration Command line operation

Operations • Create table (and its schema) • Shell • create ‘table’, ‘cf1’, ‘cf2’,… • create ‘table’, {NAME=>‘cf1’}, {NAME=>‘cf2’},… • API HBaseAdmin admin = new HBaseAdmin(new HBaseConfiguration()); HTableDescriptortable = new HTableDescriptor(“table”); table.addFamily(new HColumnDescriptor(“cf1:”)); table.addFamily(new HColumnDescriptor(“cf2:”)); admin.createTable(table);

Operations(cont.) • Modify table (and its schema) • Shell • alter ‘table’, {NAME=>’cf’, KEY=>’value’, …} • API • Note that there will be exceptions if the table is not disabled. HBaseAdmin admin = new HBaseAdmin(); Admin.modifyColumn(“table”,”cf”, new HColumnDescriptor(…)); Admin.modifyTable(new HTableDescriptor(…));

Operations(cont.) • Write data • Shell • put ‘table’, ‘row’, ‘cf:name’, ‘value’, ts • API HTable table = new Htable(“table”); BatchUpdate update = new BatchUpdate(“row”); update.put(“cf:name”,”value”); table.commit(update);

Operations(cont.) • Retrieve data • Shell • get ‘table’, ‘row’, {COLUMN=>’cf:name’, …} • API • If we don’t know the row retrieved at prior, we can use Scanner object instead. • Scanner scanner = table.getScanner(“cf:name”); HTable table = new HTable(“table”); RowResult row = table.getRow(“row”); Cell data = table.get(“row”,”cf:name”);

Operations • Delete a cell • Shell • delete ‘table’, ‘row’, ‘cf:name’ • API HTable table = new HTable(“table”); BatchUpdate update = new BatchUpdate(“row”); Udpate.delete(“cf:name”); table.commit(update);

Operations(cont.) • Enable/Disable a table • Shell • enable/disable ‘table’ • API HBaseAdmin admin = new HBaseAdmin(new HBaseConfiguration()); admin.disableTable(“table”); admin.enableTable(“table”);

Q & A • Hadoop 0.19.0 API • http://hadoop.apache.org/common/docs/r0.19.0/api/index.html • HBase 0.19.3 API • http://hadoop.apache.org/hbase/docs/r0.19.3/api/index.html • Any question?

A Hadoop Overview

A Hadoop Overview

Presentation Transcript

Hadoop

Hadoop

Hadoop

Hadoop , Hadoop , Hadoop !!!

Hadoop

Hadoop Ecosystem Overview

A Hadoop Overview

Hadoop

HADOOP

Hadoop Overview

Hadoop

HDFS - Hadoop Overview 2-

Hadoop

Hadoop

A Brief Overview of Hadoop Eco-System

Big Data Overview of apache Hadoop