Setiing up an Hadoop Multi-node instance on Ubuntu can be challenging. In my instance I used my laptop to do it and it can be tricky as I ran 2 VM’s with 2GB RAM, which makes everything a bit slow…thanks to my new Apple MacBook Pro with 8GB RAM I had no worries.
I will break this tutorial into a few parts just to make it more organized and so you can track your progress. Remember you will have the follow each of these parts twice on each of your machines (master and slave).
- Part 1: Setting up your Ubuntu Environment
- Part 2: Configure the /etc/hosts file
- Part 3: SSH Setup
- Part 4: Download and configuring Hadoop
- Part 5: Configure Master Slave Settings
- Part 6: Starting Master Slave Setup
- Part 7: Running first Map Reduce on Multi-Node Setup
Part 1: Setting up your Ubuntu Environment:
By default Ubuntu does not come with Sun Java installed so you will have to install it. This is an easy way to install it via the command line.
> sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
> sudo apt-get update
> sudo apt-get install sun-java6-jdk
Java is installed, lets now export Java:
> export JAVA_HOME=/usr/lib/jvm/java-6-sun
By default Ubuntu will not have ssh installed, so let’s install it from the command line
> sudo apt-get install ssh
It is recommended not to run hadoop under your current user/group we will create a new user and group. We will call the user hduser and group hd. The command looks like follows:
> sudo addgroup hd
> sudo adduser --ingroup hd hduser
The last thing we need to do is to disable IPV6 for Hadoop. After you have downloaded Hadoop add this line to conf/hadoop-env.sh:
> export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
Part 2: Setup /etc/hosts file
You need to setup the /etc/hosts file with the details of the master and slave IP. Run the following command to edit the hosts file
> sudo vi hosts (use gedit if you don't know vi)
Add the following lines:
172.*.*.* master
172.*.*.* slave
You need to run the command: ifconfig , on your master and slave machine to determine the IP address of the two machines. You then fill the IP address in where I have *.
Part 3: SSH Setup
Let’s configure ssh, run
> su - hduser
> ssh-keygen -t rsa -P ""
> cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
On the Master machine run the following
> hduser@master:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@slave
Test if ssh works for master and slave run:
> ssh master
> ssh slave
Part 4: Download and configuring Hadoop
First we need to download the latest hadoop and extract to our local filesystem. Download the latest hadoop from: http://www.reverse.net/pub/apache//hadoop/common/
Extract Hadoop: tar -xvf Hadoop*.tar.gz
Now we need to change ownership of the extracted Hadoop folder to hduser, we can do that with the following command
> sudo chown hduser:hd /home/user/Downloads/hadoop/*
Best to move the hadoop folder out of Downloads folder you can do with the following command:
mv /home/user/Download/hadoop /usr/local/
Now we need to configure $HOME/.bashrc with the Hadoop variables enter the following commands:
> cd ~
> sudo vi .bashrc (if you don't know vi, you can type: sudo gedit .bashrc)
Add the following lines to the end
export JAVA_HOME=/usr/lib/jvm/java-6-sun
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
Now we are going to create a folder which Hadoop will use to store its data file
> sudo mkdir -p /app/hadoop/tmp
> sudo chown hduer:hd /app/hadoop/tmp
Good now can edit the *-sites.xml files in Hadoop/conf. We will add properties to 3 files:
- conf/core-site.xml
- conf/hdfs-site.xml
- conf/mapred-site.xml
Add the following property tags to core-site.xml:
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>Temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://master:54310</value>
<description>Default file system.</description>
</property>
Add the following property tags to mapred-site.xml:
<property>
<name>mapred.job.tracker</name>
<value>master:54311</value>
<description>MapReduce job tracker.</description>
</property>
Add the following property tags to hdfs-site.xml:
<property>
<name>dfs.replication</name>
<value>2</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
Part 5: Configure Master Slave Settings
We will configure the following 2 files on both the master and slave machines.
Let’s start with the Master machine:
- Open the following file: conf/masters and change ‘locahost’ to ‘master’:
master
- Open the following file: conf/slaves and change ‘localhost’ to ‘master’ and ‘slave’
master
slave
Now on the Slave machine:
- Open the following file: conf/masters and change ‘locahost’ to ‘slave’:
slave
- Open the following file: conf/slaves and change ‘localhost’ to ‘slave’
slave
Part 6: Starting your Master Slave Setup
Note all of the steps below will be done on the Master machine
First thing we need to do is format the hadoop namenode, run:
> hadoop namenode -format
Starting a multi-node cluster is two steps:
- Start HDFS daemons, run the following command in hadoop/bin
>./start-dfs.sh
Run following command on master > jps
14399 NameNode
16244 DataNode
16312 SecondaryNameNode
12215 Jps
Run following command on slave > jps
11501 DataNode
11612 Jps
- Start Map Reduce daemons, run the following command in hadoop/bin
./start-mapred.sh
Run following command on master > jps
14399 NameNode
16244 DataNode
16312 SecondaryNameNode
18215 Jps
17102 JobTracker
17211 TaskTracker
Run following command on slave > jps
11501 DataNode
11712 Jps
11695 TaskTracker
Part 7: Running first Map Reduce on Multi-Node Setup
If everything was successful you can run your multi-node map reduce job.
Let’s get some ebooks in UTF-8 format:
http://www.gutenberg.org/ebooks/118
Now we need to push the book to our hdfs. Run the command and edit path and filename where you saved the book:
> hadoop dfs -copyFromLocal /home/user/Downloads/*.txt /user/hduser/hdinput
Let’s run our map-reduce example that counts the amount of words in the document:
> hadoop jar ../hadoopexamples-1.0.0.jar wordcount /user/hduser/hdinput /user/hduser/hdinput_result
Check the following logs of the slave machine to see what map-reduce jobs was completed:
> hadoop-hduser-tasktracker-ubuntu.log
> hadoop-hduser-jobtracker-ubuntu.log
> hadoop-hduser-datanode-ubuntu.log
If you get stuck or get an error check my other blog post with tips when running hadoop on ubuntu:
https://thysmichels.com/2012/02/11/tips-running-hadoop-on-ubuntu/
Hope this was helpful, if you have any questions please leave me a contact.
0.000000
0.000000
Like this:
Like Loading...