Running Hadoop on Ubuntu

Below is the steps to run your first Hadoop job after you have installed Hadoop.

Step 1. Format the NameNode: Initializes the directory specified by the dfs.name.dir variable.

 sudo -u hdfs hadoop namenode -format

Output

12/01/30 11:51:33 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ubuntu/127.0.1.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.205.0.2
STARTUP_MSG:   build =  -r ; compiled by 'jenkins' on Thu Nov  3 02:51:06 UTC 2011
************************************************************/
12/01/30 11:51:34 INFO util.GSet: VM type       = 64-bit
12/01/30 11:51:34 INFO util.GSet: 2% max memory = 19.33375 MB
12/01/30 11:51:34 INFO util.GSet: capacity      = 2^21 = 2097152 entries
12/01/30 11:51:34 INFO util.GSet: recommended=2097152, actual=2097152
12/01/30 11:51:34 INFO namenode.FSNamesystem: fsOwner=hdfs
12/01/30 11:51:34 INFO namenode.FSNamesystem: supergroup=supergroup
12/01/30 11:51:34 INFO namenode.FSNamesystem: isPermissionEnabled=false
12/01/30 11:51:34 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
12/01/30 11:51:34 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
12/01/30 11:51:34 INFO namenode.NameNode: Caching file names occuring more than 10 times 
12/01/30 11:51:35 INFO common.Storage: Image file of size 110 saved in 0 seconds.
12/01/30 11:51:35 INFO common.Storage: Storage directory /var/lib/hadoop/cache/hadoop/dfs/name has been successfully formatted.
12/01/30 11:51:35 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/

Step 2. Start the neccessary hadoop services. Below we start: hadoop-datanode, hadoop-jobtracker and hadoop-tasktracker

for i in hadoop-namenode hadoop-datanode hadoop-jobtracker hadoop-tasktracker ; do sudo service $i start ; done

OR run

bin/start-all.sh 

Output:

Starting Hadoop namenode daemon: starting namenode, logging to /var/log/hadoop/hadoop-hadoop-namenode-ubuntu.out
hadoop-namenode.
Starting Hadoop datanode daemon: starting datanode, logging to /var/log/hadoop/hadoop-hadoop-datanode-ubuntu.out
hadoop-datanode.
Starting Hadoop jobtracker daemon: starting jobtracker, logging to /var/log/hadoop/hadoop-hadoop-jobtracker-ubuntu.out
hadoop-jobtracker.
Starting Hadoop tasktracker daemon: starting tasktracker, logging to /var/log/hadoop/hadoop-hadoop-tasktracker-ubuntu.out
hadoop-tasktracker.

Step 3: Once the Hadoop cluster is running it is a good idea to create a home directory on HDFS.
Create the file:

sudo -u hdfs hadoop fs -mkdir /user/$USER

Change ownership:

sudo -u hdfs hadoop fs -chown $USER /user/$USER

Step 4: Run the following command to see all the hdfs files currently running

hadoop fs -lsr /

Output:

drwxr-xr-x   - hdfs   supergroup          0 2012-01-30 12:46 /user
drwxr-xr-x   - thysmichels supergroup          0 2012-01-30 12:46 /user/thysmichels
drwxr-xr-x   - mapred supergroup          0 2012-01-30 12:42 /var
drwxr-xr-x   - mapred supergroup          0 2012-01-30 12:42 /var/lib
drwxr-xr-x   - mapred supergroup          0 2012-01-30 12:42 /var/lib/hadoop
drwxr-xr-x   - mapred supergroup          0 2012-01-30 12:42 /var/lib/hadoop/cache
drwxr-xr-x   - mapred supergroup          0 2012-01-30 12:42 /var/lib/hadoop/cache/mapred
drwxr-xr-x   - mapred supergroup          0 2012-01-30 12:42 /var/lib/hadoop/cache/mapred/mapred
drwx------   - mapred supergroup          0 2012-01-30 12:42 /var/lib/hadoop/cache/mapred/mapred/system
-rw-------   1 mapred supergroup          4 2012-01-30 12:42 /var/lib/hadoop/cache/mapred/mapred/system/jobtracker.info

Step 5: Now we can run a Hadoop example. This example will do a word count and can be found at the following path:
/usr/lib/hadoop/ the example is: hadoop-examples.jar
If you want to learn to code your own Map Reduce for Hadoop job for hadoop see this link:
https://thysmichels.com/2012/01/30/write-your-own…reduce-in-java/

So we will run hadoop-examples.jar to estimates Pi using monte-carlo method:

hadoop jar /usr/lib/hadoop/hadoop-examples.jar pi 10 1000

Output:

Number of Maps  = 10
Samples per Map = 100
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
12/01/30 13:08:39 INFO mapred.FileInputFormat: Total input paths to process : 10
12/01/30 13:08:41 INFO mapred.JobClient: Running job: job_201201301242_0001
12/01/30 13:08:42 INFO mapred.JobClient:  map 0% reduce 0%
12/01/30 13:09:10 INFO mapred.JobClient:  map 10% reduce 0%
12/01/30 13:09:13 INFO mapred.JobClient:  map 20% reduce 0%
12/01/30 13:09:19 INFO mapred.JobClient:  map 30% reduce 0%
12/01/30 13:09:22 INFO mapred.JobClient:  map 40% reduce 0%
12/01/30 13:09:28 INFO mapred.JobClient:  map 50% reduce 10%
12/01/30 13:09:34 INFO mapred.JobClient:  map 60% reduce 10%
12/01/30 13:09:41 INFO mapred.JobClient:  map 60% reduce 16%
12/01/30 13:09:44 INFO mapred.JobClient:  map 60% reduce 20%
12/01/30 13:09:48 INFO mapred.JobClient:  map 70% reduce 20%
12/01/30 13:09:54 INFO mapred.JobClient:  map 90% reduce 20%
12/01/30 13:09:57 INFO mapred.JobClient:  map 90% reduce 23%
12/01/30 13:10:00 INFO mapred.JobClient:  map 100% reduce 23%
12/01/30 13:10:03 INFO mapred.JobClient:  map 100% reduce 66%
12/01/30 13:10:09 INFO mapred.JobClient:  map 100% reduce 100%
12/01/30 13:10:14 INFO mapred.JobClient: Job complete: job_201201301242_0001
12/01/30 13:10:14 INFO mapred.JobClient: Counters: 30
12/01/30 13:10:14 INFO mapred.JobClient:   Job Counters 
12/01/30 13:10:14 INFO mapred.JobClient:     Launched reduce tasks=1
12/01/30 13:10:14 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=116847
12/01/30 13:10:14 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
12/01/30 13:10:14 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
12/01/30 13:10:14 INFO mapred.JobClient:     Launched map tasks=10
12/01/30 13:10:14 INFO mapred.JobClient:     Data-local map tasks=10
12/01/30 13:10:14 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=58610
12/01/30 13:10:14 INFO mapred.JobClient:   File Input Format Counters 
12/01/30 13:10:14 INFO mapred.JobClient:     Bytes Read=1180
12/01/30 13:10:14 INFO mapred.JobClient:   File Output Format Counters 
12/01/30 13:10:14 INFO mapred.JobClient:     Bytes Written=97
12/01/30 13:10:14 INFO mapred.JobClient:   FileSystemCounters
12/01/30 13:10:14 INFO mapred.JobClient:     FILE_BYTES_READ=226
12/01/30 13:10:14 INFO mapred.JobClient:     HDFS_BYTES_READ=2410
12/01/30 13:10:14 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=240099
12/01/30 13:10:14 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=215
12/01/30 13:10:14 INFO mapred.JobClient:   Map-Reduce Framework
12/01/30 13:10:14 INFO mapred.JobClient:     Map output materialized bytes=280
12/01/30 13:10:14 INFO mapred.JobClient:     Map input records=10
12/01/30 13:10:14 INFO mapred.JobClient:     Reduce shuffle bytes=280
12/01/30 13:10:14 INFO mapred.JobClient:     Spilled Records=40
12/01/30 13:10:14 INFO mapred.JobClient:     Map output bytes=180
12/01/30 13:10:14 INFO mapred.JobClient:     Total committed heap usage (bytes)=1574019072
12/01/30 13:10:14 INFO mapred.JobClient:     CPU time spent (ms)=6470
12/01/30 13:10:14 INFO mapred.JobClient:     Map input bytes=240
12/01/30 13:10:14 INFO mapred.JobClient:     SPLIT_RAW_BYTES=1230
12/01/30 13:10:14 INFO mapred.JobClient:     Combine input records=0
12/01/30 13:10:14 INFO mapred.JobClient:     Reduce input records=20
12/01/30 13:10:14 INFO mapred.JobClient:     Reduce input groups=20
12/01/30 13:10:14 INFO mapred.JobClient:     Combine output records=0
12/01/30 13:10:14 INFO mapred.JobClient:     Physical memory (bytes) snapshot=1775329280
12/01/30 13:10:14 INFO mapred.JobClient:     Reduce output records=0
12/01/30 13:10:14 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=5800841216
12/01/30 13:10:14 INFO mapred.JobClient:     Map output records=20
Job Finished in 95.861 seconds
Estimated value of Pi is 3.14800000000000000000

Navigate to the following URL’s to see the Hadoop Web Interface
Job Tracker:

http://localhost:50030/

Task Tracker:

http://localhost:50060/

HDFS Name Node:

http://localhost:50070/

If you would like to play around a bit more below are all the arguments listed for hadoop-examples.jar:

aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
dbcount: An example job that count the pageview counts from a database.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using monte-carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sleep: A job that sleeps at each map and reduce task.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.

1 Comment

  1. Undeniably believe that which you stated. Your favourite reason appeared to be at
    the net the easiest thing to consider of. I say to you, I certainly
    get irked even as people think about worries that they plainly do not know about.

    You controlled to hit the nail upon the highest as smartly as defined out the entire
    thing with no need side-effects , folks could take a signal.
    Will likely be back to get more. Thank you

Leave a Comment