Running Hadoop on Ubuntu

Below is the steps to run your first Hadoop job after you have installed Hadoop.

Step 1. Format the NameNode: Initializes the directory specified by the dfs.name.dir variable.

 sudo -u hdfs hadoop namenode -format

Output

12/01/30 11:51:33 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ubuntu/127.0.1.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.205.0.2
STARTUP_MSG:   build =  -r ; compiled by 'jenkins' on Thu Nov  3 02:51:06 UTC 2011
************************************************************/
12/01/30 11:51:34 INFO util.GSet: VM type       = 64-bit
12/01/30 11:51:34 INFO util.GSet: 2% max memory = 19.33375 MB
12/01/30 11:51:34 INFO util.GSet: capacity      = 2^21 = 2097152 entries
12/01/30 11:51:34 INFO util.GSet: recommended=2097152, actual=2097152
12/01/30 11:51:34 INFO namenode.FSNamesystem: fsOwner=hdfs
12/01/30 11:51:34 INFO namenode.FSNamesystem: supergroup=supergroup
12/01/30 11:51:34 INFO namenode.FSNamesystem: isPermissionEnabled=false
12/01/30 11:51:34 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
12/01/30 11:51:34 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
12/01/30 11:51:34 INFO namenode.NameNode: Caching file names occuring more than 10 times 
12/01/30 11:51:35 INFO common.Storage: Image file of size 110 saved in 0 seconds.
12/01/30 11:51:35 INFO common.Storage: Storage directory /var/lib/hadoop/cache/hadoop/dfs/name has been successfully formatted.
12/01/30 11:51:35 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/

Step 2. Start the neccessary hadoop services. Below we start: hadoop-datanode, hadoop-jobtracker and hadoop-tasktracker

for i in hadoop-namenode hadoop-datanode hadoop-jobtracker hadoop-tasktracker ; do sudo service $i start ; done

OR run

bin/start-all.sh 

Output:

Starting Hadoop namenode daemon: starting namenode, logging to /var/log/hadoop/hadoop-hadoop-namenode-ubuntu.out
hadoop-namenode.
Starting Hadoop datanode daemon: starting datanode, logging to /var/log/hadoop/hadoop-hadoop-datanode-ubuntu.out
hadoop-datanode.
Starting Hadoop jobtracker daemon: starting jobtracker, logging to /var/log/hadoop/hadoop-hadoop-jobtracker-ubuntu.out
hadoop-jobtracker.
Starting Hadoop tasktracker daemon: starting tasktracker, logging to /var/log/hadoop/hadoop-hadoop-tasktracker-ubuntu.out
hadoop-tasktracker.

Step 3: Once the Hadoop cluster is running it is a good idea to create a home directory on HDFS.
Create the file:

sudo -u hdfs hadoop fs -mkdir /user/$USER

Change ownership:

sudo -u hdfs hadoop fs -chown $USER /user/$USER

Step 4: Run the following command to see all the hdfs files currently running

hadoop fs -lsr /

Output:

drwxr-xr-x   - hdfs   supergroup          0 2012-01-30 12:46 /user
drwxr-xr-x   - thysmichels supergroup          0 2012-01-30 12:46 /user/thysmichels
drwxr-xr-x   - mapred supergroup          0 2012-01-30 12:42 /var
drwxr-xr-x   - mapred supergroup          0 2012-01-30 12:42 /var/lib
drwxr-xr-x   - mapred supergroup          0 2012-01-30 12:42 /var/lib/hadoop
drwxr-xr-x   - mapred supergroup          0 2012-01-30 12:42 /var/lib/hadoop/cache
drwxr-xr-x   - mapred supergroup          0 2012-01-30 12:42 /var/lib/hadoop/cache/mapred
drwxr-xr-x   - mapred supergroup          0 2012-01-30 12:42 /var/lib/hadoop/cache/mapred/mapred
drwx------   - mapred supergroup          0 2012-01-30 12:42 /var/lib/hadoop/cache/mapred/mapred/system
-rw-------   1 mapred supergroup          4 2012-01-30 12:42 /var/lib/hadoop/cache/mapred/mapred/system/jobtracker.info

Step 5: Now we can run a Hadoop example. This example will do a word count and can be found at the following path:
/usr/lib/hadoop/ the example is: hadoop-examples.jar
If you want to learn to code your own Map Reduce for Hadoop job for hadoop see this link:
https://thysmichels.com/2012/01/30/write-your-own…reduce-in-java/

So we will run hadoop-examples.jar to estimates Pi using monte-carlo method:

hadoop jar /usr/lib/hadoop/hadoop-examples.jar pi 10 1000

Output:

Number of Maps  = 10
Samples per Map = 100
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
12/01/30 13:08:39 INFO mapred.FileInputFormat: Total input paths to process : 10
12/01/30 13:08:41 INFO mapred.JobClient: Running job: job_201201301242_0001
12/01/30 13:08:42 INFO mapred.JobClient:  map 0% reduce 0%
12/01/30 13:09:10 INFO mapred.JobClient:  map 10% reduce 0%
12/01/30 13:09:13 INFO mapred.JobClient:  map 20% reduce 0%
12/01/30 13:09:19 INFO mapred.JobClient:  map 30% reduce 0%
12/01/30 13:09:22 INFO mapred.JobClient:  map 40% reduce 0%
12/01/30 13:09:28 INFO mapred.JobClient:  map 50% reduce 10%
12/01/30 13:09:34 INFO mapred.JobClient:  map 60% reduce 10%
12/01/30 13:09:41 INFO mapred.JobClient:  map 60% reduce 16%
12/01/30 13:09:44 INFO mapred.JobClient:  map 60% reduce 20%
12/01/30 13:09:48 INFO mapred.JobClient:  map 70% reduce 20%
12/01/30 13:09:54 INFO mapred.JobClient:  map 90% reduce 20%
12/01/30 13:09:57 INFO mapred.JobClient:  map 90% reduce 23%
12/01/30 13:10:00 INFO mapred.JobClient:  map 100% reduce 23%
12/01/30 13:10:03 INFO mapred.JobClient:  map 100% reduce 66%
12/01/30 13:10:09 INFO mapred.JobClient:  map 100% reduce 100%
12/01/30 13:10:14 INFO mapred.JobClient: Job complete: job_201201301242_0001
12/01/30 13:10:14 INFO mapred.JobClient: Counters: 30
12/01/30 13:10:14 INFO mapred.JobClient:   Job Counters 
12/01/30 13:10:14 INFO mapred.JobClient:     Launched reduce tasks=1
12/01/30 13:10:14 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=116847
12/01/30 13:10:14 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
12/01/30 13:10:14 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
12/01/30 13:10:14 INFO mapred.JobClient:     Launched map tasks=10
12/01/30 13:10:14 INFO mapred.JobClient:     Data-local map tasks=10
12/01/30 13:10:14 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=58610
12/01/30 13:10:14 INFO mapred.JobClient:   File Input Format Counters 
12/01/30 13:10:14 INFO mapred.JobClient:     Bytes Read=1180
12/01/30 13:10:14 INFO mapred.JobClient:   File Output Format Counters 
12/01/30 13:10:14 INFO mapred.JobClient:     Bytes Written=97
12/01/30 13:10:14 INFO mapred.JobClient:   FileSystemCounters
12/01/30 13:10:14 INFO mapred.JobClient:     FILE_BYTES_READ=226
12/01/30 13:10:14 INFO mapred.JobClient:     HDFS_BYTES_READ=2410
12/01/30 13:10:14 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=240099
12/01/30 13:10:14 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=215
12/01/30 13:10:14 INFO mapred.JobClient:   Map-Reduce Framework
12/01/30 13:10:14 INFO mapred.JobClient:     Map output materialized bytes=280
12/01/30 13:10:14 INFO mapred.JobClient:     Map input records=10
12/01/30 13:10:14 INFO mapred.JobClient:     Reduce shuffle bytes=280
12/01/30 13:10:14 INFO mapred.JobClient:     Spilled Records=40
12/01/30 13:10:14 INFO mapred.JobClient:     Map output bytes=180
12/01/30 13:10:14 INFO mapred.JobClient:     Total committed heap usage (bytes)=1574019072
12/01/30 13:10:14 INFO mapred.JobClient:     CPU time spent (ms)=6470
12/01/30 13:10:14 INFO mapred.JobClient:     Map input bytes=240
12/01/30 13:10:14 INFO mapred.JobClient:     SPLIT_RAW_BYTES=1230
12/01/30 13:10:14 INFO mapred.JobClient:     Combine input records=0
12/01/30 13:10:14 INFO mapred.JobClient:     Reduce input records=20
12/01/30 13:10:14 INFO mapred.JobClient:     Reduce input groups=20
12/01/30 13:10:14 INFO mapred.JobClient:     Combine output records=0
12/01/30 13:10:14 INFO mapred.JobClient:     Physical memory (bytes) snapshot=1775329280
12/01/30 13:10:14 INFO mapred.JobClient:     Reduce output records=0
12/01/30 13:10:14 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=5800841216
12/01/30 13:10:14 INFO mapred.JobClient:     Map output records=20
Job Finished in 95.861 seconds
Estimated value of Pi is 3.14800000000000000000

Navigate to the following URL’s to see the Hadoop Web Interface
Job Tracker:

http://localhost:50030/

Task Tracker:

http://localhost:50060/

HDFS Name Node:

http://localhost:50070/

If you would like to play around a bit more below are all the arguments listed for hadoop-examples.jar:

aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
dbcount: An example job that count the pageview counts from a database.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using monte-carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sleep: A job that sleeps at each map and reduce task.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.