Introduction to Apache Hadoop

Introduction to Apache Hadoop CSCI 572: Information Retrieval and Search Engines Summer 2010

Outline • What is Hadoop? • Where did it come from? • What are the current versions of Hadoop? • What can it do?

Apache Hadoop • The brainchild of DougCutting • Built out by brilliant engineers and contributors from Yahoo, and Facebook and Cloudera and other companies • Started in 2007/2008 when code was spun out of Nutch • Has grown into really large project at Apache with significant ecosystem

How to get started • Hadoop (0.20.0/0.20.2) • Put your Java hat on • Go here: • http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html • If you want to do this on Windows, get Cygwin, or VMWare or something that you can run Linux on • Run the Map Reduce examples on local mode • Check on the data generated in your HDFS • Scaling it out • Amazon Elastic Map Reduce • Setting it up on your own cluster: DataNodes and Task/JobTracker

Basic Operations • Listing files • ./bin/hadoop fs –ls • Writing files • ./bin/hadoop fs –put • Running Map Reduce Jobs • mkdir input • cp conf/*.xml input • ./bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+’ • cat output/*

Advanced Topics • Writing your Mappers and Reducers • Check out Map Reduce Tutorial here: • http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html • Code for several examples including Word Count

Other Hadoop ecosystem projects • HBase • Big Table • HIVE • Built at FB, provides SQL interface on HDFS • Chukwa • Log Processing • Pig • Scientific data analysis language on top of M/R and HDFS • Zookeeper • Distributed Systems management

No releases in a while • Stick with 0.20.x

Wrapup • Lots more information at • http://hadoop.apache.org • http://hadoop.apache.org/mapreduce/ • http://hadoop.apache.org/hdfs/ • Project ideas • Implement GIS or geometrical algorithm in Map Reduce • Write REST interface to control HDFS and to M/R • Add new Writeable input data formats • Integrate Solr and Hadoop

Acknowledgements • Material inspired by discussions and talks on the Apache Mailing lists for Hadoop and through discussions with the rest of the Hadoop community

Introduction to Apache Hadoop