190 likes | 406 Views
Hadoop Tutorial. Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation. Why should we use Hadoop ?. Need to process 10TB datasets On 1 node: scanning @ 50MB/s = 2.3 days On 1000 node cluster:
E N D
HadoopTutorial Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation
Why should we use Hadoop? • Need to process 10TB datasets • On 1 node: • scanning @ 50MB/s = 2.3 days • On 1000 node cluster: • scanning @ 50MB/s = 3.3 min • Need Efficient, Reliable and Usable framework • Google File System (GFS) paper • Google's MapReduce paper
HDFS - Hadoop Distributed FS • Hadoop uses HDFS, a distributed file system based on GFS, as its shared file system • Files are divided into large blocks and distributed across the cluster (64MB) • Blocks replicated to handle hardware failure • Current block replication is 3 (configurable) • It cannot be directly mounted by an existing operating system. • Once you use the DFS (put something in it), relative paths are from /user/{your usr id}. E.G. if your id is jwang30 … your “home dir” is /user/jwang30
Hadoop Architecture • Master-Slave Architecture • HDFS Master “Namenode” (irkm-1) • Accepts MR jobs submitted by users • Assigns Map and Reduce tasks to Tasktrackers • Monitors task and tasktracker status, re-executes tasks upon failure • HDFS Slaves “Datanodes” (irkm-1 to irkm-6) • Run Map and Reduce tasks upon instruction from the Jobtracker • Manage storage and transmission of intermediate output
Hadoop Paths • Hadoop is locally “installed” on each machine • Version 0.19.2 • Installed location is in /home/tmp/hadoop • Slave nodes store their data in /tmp/hadoop-${user.name} (configurable)
Format Namenode • If it is the first time that you use it, you need to format the namenode: • - log to irkm-1 • - cd /home/tmp/hadoop • - bin/hadoop namenode –format • Basically we see most commands look similar • bin/hadoop “some command” options • If you just type hadoop you get all possible commands (including undocumented)
Using HDFS • hadoop dfs • [-ls <path>] • [-du <path>] • [-cp <src> <dst>] • [-rm <path>] • [-put <localsrc> <dst>] • [-copyFromLocal <localsrc> <dst>] • [-moveFromLocal <localsrc> <dst>] • [-get [-crc] <src> <localdst>] • [-cat <src>] • [-copyToLocal [-crc] <src> <localdst>] • [-moveToLocal [-crc] <src> <localdst>] • [-mkdir <path>] • [-touchz <path>] • [-test -[ezd] <path>] • [-stat [format] <path>] • [-help [cmd]]
Starting / Stopping Hadoop • bin/start-all.sh – starts all slave nodes and master node • bin/stop-all.sh – stops all slave nodes and master node • Run jps to check the status
Copying Local files to HDFS • Log to irkm-1 • rm –fr /tmp/hadoop/$userID • cd /home/tmp/hadoop • bin/hadoop dfs –ls • bin/hadoop dfs –copyFromLocal example example • After that • bin/hadoop dfs –ls
Wordcount in python • Mapper.py
Wordcount in python • Reducer.py
Execution code • bin/hadoop dfs -ls • bin/hadoop dfs –copyFromLocal example example • bin/hadoop jar contrib/streaming/hadoop-0.19.2-streaming.jar -file wordcount-py.example/mapper.py -mapper wordcount-py.example/mapper.py -file wordcount-py.example/reducer.py -reducer wordcount-py.example/reducer.py -input example -output java-output • bin/hadoop dfs -cat java-output/part-00000 • bin/hadoop dfs -copyToLocal java-output/part-00000 java-output-local
Web interface • Hadoop job tracker • http://irkm-1.soe.ucsc.edu:50030/jobtracker.jsp • Hadoop task tracker • http://irkm-1.soe.ucsc.edu:50060/tasktracker.jsp • Hadoop dfs checker • http://irkm-1.soe.ucsc.edu:50070/dfshealth.jsp