Hadoop Tutorial

HadoopTutorial Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation

Why should we use Hadoop? • Need to process 10TB datasets • On 1 node: • scanning @ 50MB/s = 2.3 days • On 1000 node cluster: • scanning @ 50MB/s = 3.3 min • Need Efficient, Reliable and Usable framework • Google File System (GFS) paper • Google's MapReduce paper

HDFS - Hadoop Distributed FS • Hadoop uses HDFS, a distributed file system based on GFS, as its shared file system • Files are divided into large blocks and distributed across the cluster (64MB) • Blocks replicated to handle hardware failure • Current block replication is 3 (configurable) • It cannot be directly mounted by an existing operating system. • Once you use the DFS (put something in it), relative paths are from /user/{your usr id}. E.G. if your id is jwang30 … your “home dir” is /user/jwang30

Hadoop Architecture • Master-Slave Architecture • HDFS Master “Namenode” (irkm-1) • Accepts MR jobs submitted by users • Assigns Map and Reduce tasks to Tasktrackers • Monitors task and tasktracker status, re-executes tasks upon failure • HDFS Slaves “Datanodes” (irkm-1 to irkm-6) • Run Map and Reduce tasks upon instruction from the Jobtracker • Manage storage and transmission of intermediate output

Hadoop Paths • Hadoop is locally “installed” on each machine • Version 0.19.2 • Installed location is in /home/tmp/hadoop • Slave nodes store their data in /tmp/hadoop-${user.name} (configurable)

Format Namenode • If it is the first time that you use it, you need to format the namenode: • - log to irkm-1 • - cd /home/tmp/hadoop • - bin/hadoop namenode –format • Basically we see most commands look similar • bin/hadoop “some command” options • If you just type hadoop you get all possible commands (including undocumented)

Using HDFS • hadoop dfs • [-ls <path>] • [-du <path>] • [-cp <src> <dst>] • [-rm <path>] • [-put <localsrc> <dst>] • [-copyFromLocal <localsrc> <dst>] • [-moveFromLocal <localsrc> <dst>] • [-get [-crc] <src> <localdst>] • [-cat <src>] • [-copyToLocal [-crc] <src> <localdst>] • [-moveToLocal [-crc] <src> <localdst>] • [-mkdir <path>] • [-touchz <path>] • [-test -[ezd] <path>] • [-stat [format] <path>] • [-help [cmd]]

Starting / Stopping Hadoop • bin/start-all.sh – starts all slave nodes and master node • bin/stop-all.sh – stops all slave nodes and master node • Run jps to check the status

Copying Local files to HDFS • Log to irkm-1 • rm –fr /tmp/hadoop/$userID • cd /home/tmp/hadoop • bin/hadoop dfs –ls • bin/hadoop dfs –copyFromLocal example example • After that • bin/hadoop dfs –ls

Running jobs on Hadoop

Wordcount in python • Mapper.py

Wordcount in python • Reducer.py

Execution code • bin/hadoop dfs -ls • bin/hadoop dfs –copyFromLocal example example • bin/hadoop jar contrib/streaming/hadoop-0.19.2-streaming.jar -file wordcount-py.example/mapper.py -mapper wordcount-py.example/mapper.py -file wordcount-py.example/reducer.py -reducer wordcount-py.example/reducer.py -input example -output java-output • bin/hadoop dfs -cat java-output/part-00000 • bin/hadoop dfs -copyToLocal java-output/part-00000 java-output-local

Web interface • Hadoop job tracker • http://irkm-1.soe.ucsc.edu:50030/jobtracker.jsp • Hadoop task tracker • http://irkm-1.soe.ucsc.edu:50060/tasktracker.jsp • Hadoop dfs checker • http://irkm-1.soe.ucsc.edu:50070/dfshealth.jsp

Hadoop Tutorial

Hadoop Tutorial

Presentation Transcript

CS246 TA Session: Hadoop Tutorial

Tutorial for MapReduce (Hadoop) & Large Scale Processing

Hadoop

Hadoop

Hadoop , Hadoop , Hadoop !!!

Hadoop

Tutorial: Setting up Amazon EC2 and using Hadoop

Tutorial: Streaming Jobs (& Non-Java Hadoop )

CS246 TA Session: Hadoop Tutorial

HADOOP

Hadoop

Tutorial : Big Data Algorithms and Applications Under Hadoop

Hadoop

Hadoop tutorial

Big Data Hadoop Tutorial for Beginners

Hadoop MapReduce vs Spark | Hadoop Tutorial For Beginners | Hadoop & Spark Tutorial | Edureka

Apache spark tutorial in Big data hadoop

Hadoop Tutorial

Hadoop Tutorial

Presentation Transcript

CS246 TA Session: Hadoop Tutorial

Tutorial for MapReduce (Hadoop) &amp; Large Scale Processing

Hadoop

Hadoop

Hadoop , Hadoop , Hadoop !!!

Hadoop

Tutorial: Setting up Amazon EC2 and using Hadoop

Tutorial: Streaming Jobs (&amp; Non-Java Hadoop )

CS246 TA Session: Hadoop Tutorial

HADOOP

Hadoop

Tutorial : Big Data Algorithms and Applications Under Hadoop

Hadoop

Hadoop tutorial

Big Data Hadoop Tutorial for Beginners

Hadoop MapReduce vs Spark | Hadoop Tutorial For Beginners | Hadoop & Spark Tutorial | Edureka

Apache spark tutorial in Big data hadoop

Tutorial for MapReduce (Hadoop) & Large Scale Processing

Tutorial: Streaming Jobs (& Non-Java Hadoop )