Blog Archives

Tips running Hadoop on Ubuntu

Below is some tips when running Hadoop on Ubuntu. If you find some errors running Hadoop on Ubuntu please comment the problem and how you solved it.

When you get this Warning: $HADOOP_HOME is deprecated

Solution: add “export HADOOP_HOME_WARN_SUPPRESS=”TRUE”"  in the hadoop-env.sh.

Cannot create directory `/usr/local/hadoop/libexec/../logs

Solution: sudo chown -R hduser:hadoop /usr/local/hadoop/

Enter passphrase when running ./start-all.sh

Solution: ssh-keygen -t rsa -P “”     Create a ssh-key without a password.

Warning: <property>/<configuration> not set

Solution: make sure <property> and <configuration> tags are populated in core.site.xml, mapred.site.xml, hdfs.site.xml

Send or retrieve file to and from HDFS

Solution:

Send file to HDFS > bin/hadoop dfs -put /home/someone/interestingFile.txt /user/yourUserName/

Get file from HDFS > bin/hadoop dfs -get foo localFile

ssh: connect to host localhost port 22: Connection refused
Solution: By default Ubuntu will not have ssh installed so you will have to install and start it.

Install > sudo apt-get install ssh

Start > sudo service ssh start

hadoop Incompatible namespaceIDs in /app/hadoop/tmp/*

Solution: 

  1. Stop the cluster: ./stop-dfs.sh
  2. Delete the directory specified on the DataNode: rm -r /app/hadoop/tmp/*
  3. Reformat the NameNode: hadoop namenode -format

OR

  1. Stop the DataNode: ./stop.dfs.sh
  2. Edit the value of namespaceID in /current/VERSION to match the value of the current NameNode.
  3. Restart the DataNode: ./start.dfs.sh

hadoop java.net.UnknownHostException: ubuntu: ubuntu

Solution: 

1. Add ubuntu as your localhost IP to your /etc/hosts file: sudo vi /etc/hosts

2. Restart your network: sudo /etc/init.d/networking restart

So your /etc/hosts file on your master machine will look something like this:

172.16.62.152      master
172.16.62.151      slave
172.16.62.152      ubuntu

On your slave machine

172.16.62.152      master
172.16.62.151      slave
172.16.62.151      ubuntu

If none of it works then you can go and change the master/localhost hostname to the ipaddress in core-site.xml, mapred-site.xml

Python writing reading large datasets script

Below is a script to read and write to and from large datasets saved as csv files.

Writing datasets to an csv file.

import csv
#writing data into csv file
writer = csv.writer(open('dataset.csv', 'wb', buffering=0))
writer.writerows([
('GOOG', 'Google Inc.', 123.44, 0.32, 0.09),
('YHOO', 'Yahoo! Inc.', 2.33, 99.23, 0.123),
('IBM', 'IBM Inc.', 223.44, 212.32, 6.42)
])

Reading from large datasets csv files

import csv
dataset = csv.reader(open('dataset.csv', 'rb'))
status_labels = {-1: 'down', 0: 'unchanged', 1: 'up'}
for ticker, name, price, change, pct in dataset:
	status = status_labels[cmp(float(change), 0.0)]
print '%s is %s (%s%%)' % (name, status, pct)

This script is good for importing large datasets for you Hadoop jobs.

Running local mrjob streaming hadoop jobs

Follow the steps below to run an local mrjob. In this example I run an mrjob to calculate word frequency.

Prereq: Needs python 2.6 or 2.7 installed this to work.

Step 1. Download mrjob:

https://github.com/Yelp/mrjob

Step 2. Navigate to Yelp/mrjob/examples in your terminal

Step 3: Create a Dataset download a dataset from http://www.infochimps.com.

Step 4: Test your environment and make sure mrjob works, run:

import mrjob

This will show no errors or dependency issues.

Step 4: Running your mrjob

python mr_word_freq_count.py log1 > counts

log1 input was: (note each line was tabbed delimited)

test
one
two
three
four
five
one
two
test

Result:

"five"	1
"four"	1
"one"	2
"test"	2
"three"	1
"two"	2
Follow

Get every new post delivered to your Inbox.

Join 70 other followers