380 likes | 547 Views
Hadoop Introducing Installation and Configuration. 数据挖掘研究组 Data Mining Group @ Xiamen University. A Distributed data-intensive Programming Framework. Distributed storage. Hadoop. Parallel computing. 数据挖掘研究组
E N D
HadoopIntroducingInstallation and Configuration 数据挖掘研究组 Data Mining Group @ Xiamen University
A Distributed data-intensive Programming Framework Distributed storage Hadoop Parallel computing 数据挖掘研究组 Data Mining Group @ Xiamen University
Introducing to HDFS • Hadoop Distributed File System (HDFS) • An open-source implementation of GFS • has many similarities with distributed file systems. • However, comes differences with it. • HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. • HDFS provides high throughput access to application data and is suitable for applications that have large data sets. 数据挖掘研究组 Data Mining Group @ Xiamen University
Features of it • An important feature of the design : • data is never moved through the namenode. • Instead, all data transferoccurs directly between clients and datanodes 数据挖掘研究组 Data Mining Group @ Xiamen University
MapReduce? Let’s talk it next time……… 数据挖掘研究组 Data Mining Group @ Xiamen University
“Running Hadoop?” What means for it?“Running Hadoop” means running a set of daemons.NameNodeDataNode Secondary NameNodeJobTrackerTaskTracker 数据挖掘研究组 Data Mining Group @ Xiamen University
Who Works for who? • NameNode • Sec ND • JobTracker • DataNode • TaskTracker Hadoop
NameNode • Hadoop employs a master/slave architecture for both distributed storage and distributed computation. • NameNode is the master of HDFS that directs the slave DataNode daemons to perform the low-level I/O tasks • NameNode is the bookkeeper of HDFS • keeps track of how your fi les are broken down into fi le blocks • keeps track of the overall health of the distributed fi lesystem
DataNode • reading and writing HDFS blocks for clients • communicate with other DataNodes to replicate its data blocks for redundancy 数据挖掘研究组 Data Mining Group @ Xiamen University
Secondary NameNode • SNN is an assistant daemon for monitoring the state of the cluster HDFS • differs from the NameNode in that this process doesn’t receive or record any real-time changes to HDFS • communicates with the NameNode to take snapshots of the HDFS metadata • Recovery: NameNode failure ???? We reconfigure the cluster to use the SNN as the primary NameNode
JobTracker • the liaison between your application and Hadoop • submit your code to your cluster, the JobTracker determines the execution plan • determining which fi les to process • assigns nodes to different tasks • monitors all tasks as they’re running • a task fail???? JobTrackerwill relaunch the task on a different node
TaskTracker • Each TaskTracker is responsible for executing the individual tasks that the JobTracker assigns 数据挖掘研究组 Data Mining Group @ Xiamen University
Installation and Configuration • Pseudo-distributed mode All daemons run on on the machine • Fully distributed mode What Different? 数据挖掘研究组 Data Mining Group @ Xiamen University
Installation forPseudo-distributed mode • Prerequisites • Ubuntu Linux • Hadoop 0.20.2 • Sun Java 6 $sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner“ $sudo apt-get update $sudo apt-get install sun-java6-jdk 数据挖掘研究组 Data Mining Group @ Xiamen University
Configuring SSH • Hadoop requires SSH access to manage its nodes, remote machines plus your local machine if you want to use Hadoop on it • $ sduo apt-get install openssh-server • $ ssh-keygen -t rsa -P “” • The second line will create an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction ,since you don’t want to enter the passphrase every time Hadoop interacts with its nodes.
Configuring SSH • $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys • sshlocalhost • The authenticity of host 'localhost (::1)' can't be established. RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'localhost' (RSA) to the list of known hosts. Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux Ubuntu 10.04 LTS [...snipp...] 数据挖掘研究组 Data Mining Group @ Xiamen University
extract Hadoop package • $ cd /usr/local • $ sudo tar xzf hadoop-0.20.2.tar.gz • $ sudochown -R dm:dm hadoop-0.20.2 数据挖掘研究组 Data Mining Group @ Xiamen University
Update ~/.bashrc • $sudo vim ~/.bashrc • # Set Hadoop-related environment variables export HADOOP_HOME=/usr/local/hadoop • # Set JAVA_HOME JAVA_HOME=/usr/lib/jvm/java-6-sun • # Add Hadoop bin/ directory to PATH export PATH=$PATH:$HADOOP_HOME/bin 数据挖掘研究组 Data Mining Group @ Xiamen University
hadoop.tmp.dir • Create /app/hadoop/tmp. • Hadoop’sdefault configurations usehadoop.tmp.diras the base temporary directory both for the local file system and HDFS • $ sudomkdir -p /app/hadoop/tmp • $ sudochowndm:dm/app/hadoop/tmp 数据挖掘研究组 Data Mining Group @ Xiamen University
Configuration hadoop-env.sh • Configure JAVA_HOME environment variable for Hadoop • Change • # The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.5-sun • to • # The java implementation to use. Required. export JAVA_HOME=/usr/lib/jvm/java-6-sun 数据挖掘研究组 Data Mining Group @ Xiamen University
Key stage • Configuration Key propertyies for hadoopdaemons • These propertyies should be set in XML files ,which locate in”/usr/local/hadoop-0.20.2/conf” core-site.xml mapred-site.xml hdfs-site.xml 数据挖掘研究组 Data Mining Group @ Xiamen University
Key propertyies for hadoop daemons • fs.default.name(core-site.xml) • hadoop.tmp.dir(core-site.xml) • mapred.job.tracker(mapred-site.xml) • dfs.data.dir(hdfs-site.xml) • dfs.replication(hdfs-site.xml) 数据挖掘研究组 Data Mining Group @ Xiamen University
Configuration core-site.xml • Add the following lines between the <configuration> ... </configuration> tags <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>A base for other temporary directories. </description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem. </description> </property>
Configuration mapred-site.xml <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> 数据挖掘研究组 Data Mining Group @ Xiamen University
Configuration hdfs-site.xml <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> 数据挖掘研究组 Data Mining Group @ Xiamen University
Formatting the name node • formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster” • $ bin/hadoopnamenode –format Installation Done! 数据挖掘研究组 Data Mining Group @ Xiamen University
Networking • assign the Static IP for all the hosts • Update /etc/hosts on both machines with the following lines:(for master AND slaves) 192.168.0.1 master 192.168.0.2 slave 数据挖掘研究组 Data Mining Group @ Xiamen University
SSH access add the hduser@master‘s public SSH key (which should be in $HOME/.ssh/id_rsa.pub) to the authorized_keys file of hduser@slave (in this user ’ $HOME/.ssh/authorized_keys) 数据挖掘研究组 Data Mining Group @ Xiamen University
Masters vs. Slaves • one machine in the cluster is designated as the NameNode another machine(maybe the same) as JobTracker. These are the actual “masters”. • The rest of the machines in the cluster must act as both DataNode and TaskTracker. These we call “slaves” 数据挖掘研究组 Data Mining Group @ Xiamen University
Masters vs. Slaves • conf/masters (master only) master • conf/slaves (master only) master slave 数据挖掘研究组 Data Mining Group @ Xiamen University
conf/*-site.xml (all machines) How? 数据挖掘研究组 Data Mining Group @ Xiamen University
Formatting the NameNode $bin/hadoop namenode –format $bin/start-all.sh $jps $bin/stop-all.sh 数据挖掘研究组 Data Mining Group @ Xiamen University
Thank youAny Question? 数据挖掘研究组 Data Mining Group @ Xiamen University