Below is the steps to run your first Hadoop job after you have installed Hadoop.
Step 1. Format the NameNode: Initializes the directory specified by the dfs.name.dir variable.
sudo -u hdfs hadoop namenode -format
Output
12/01/30 11:51:33 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = ubuntu/127.0.1.1 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 0.20.205.0.2 STARTUP_MSG: build = -r ; compiled by 'jenkins' on Thu Nov 3 02:51:06 UTC 2011 ************************************************************/ 12/01/30 11:51:34 INFO util.GSet: VM type = 64-bit 12/01/30 11:51:34 INFO util.GSet: 2% max memory = 19.33375 MB 12/01/30 11:51:34 INFO util.GSet: capacity = 2^21 = 2097152 entries 12/01/30 11:51:34 INFO util.GSet: recommended=2097152, actual=2097152 12/01/30 11:51:34 INFO namenode.FSNamesystem: fsOwner=hdfs 12/01/30 11:51:34 INFO namenode.FSNamesystem: supergroup=supergroup 12/01/30 11:51:34 INFO namenode.FSNamesystem: isPermissionEnabled=false 12/01/30 11:51:34 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100 12/01/30 11:51:34 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s) 12/01/30 11:51:34 INFO namenode.NameNode: Caching file names occuring more than 10 times 12/01/30 11:51:35 INFO common.Storage: Image file of size 110 saved in 0 seconds. 12/01/30 11:51:35 INFO common.Storage: Storage directory /var/lib/hadoop/cache/hadoop/dfs/name has been successfully formatted. 12/01/30 11:51:35 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1 ************************************************************/
Step 2. Start the neccessary hadoop services. Below we start: hadoop-datanode, hadoop-jobtracker and hadoop-tasktracker
for i in hadoop-namenode hadoop-datanode hadoop-jobtracker hadoop-tasktracker ; do sudo service $i start ; done
OR run
bin/start-all.sh
Output:
Starting Hadoop namenode daemon: starting namenode, logging to /var/log/hadoop/hadoop-hadoop-namenode-ubuntu.out hadoop-namenode. Starting Hadoop datanode daemon: starting datanode, logging to /var/log/hadoop/hadoop-hadoop-datanode-ubuntu.out hadoop-datanode. Starting Hadoop jobtracker daemon: starting jobtracker, logging to /var/log/hadoop/hadoop-hadoop-jobtracker-ubuntu.out hadoop-jobtracker. Starting Hadoop tasktracker daemon: starting tasktracker, logging to /var/log/hadoop/hadoop-hadoop-tasktracker-ubuntu.out hadoop-tasktracker.
Step 3: Once the Hadoop cluster is running it is a good idea to create a home directory on HDFS.
Create the file:
sudo -u hdfs hadoop fs -mkdir /user/$USER
Change ownership:
sudo -u hdfs hadoop fs -chown $USER /user/$USER
Step 4: Run the following command to see all the hdfs files currently running
hadoop fs -lsr /
Output:
drwxr-xr-x - hdfs supergroup 0 2012-01-30 12:46 /user drwxr-xr-x - thysmichels supergroup 0 2012-01-30 12:46 /user/thysmichels drwxr-xr-x - mapred supergroup 0 2012-01-30 12:42 /var drwxr-xr-x - mapred supergroup 0 2012-01-30 12:42 /var/lib drwxr-xr-x - mapred supergroup 0 2012-01-30 12:42 /var/lib/hadoop drwxr-xr-x - mapred supergroup 0 2012-01-30 12:42 /var/lib/hadoop/cache drwxr-xr-x - mapred supergroup 0 2012-01-30 12:42 /var/lib/hadoop/cache/mapred drwxr-xr-x - mapred supergroup 0 2012-01-30 12:42 /var/lib/hadoop/cache/mapred/mapred drwx------ - mapred supergroup 0 2012-01-30 12:42 /var/lib/hadoop/cache/mapred/mapred/system -rw------- 1 mapred supergroup 4 2012-01-30 12:42 /var/lib/hadoop/cache/mapred/mapred/system/jobtracker.info
Step 5: Now we can run a Hadoop example. This example will do a word count and can be found at the following path:
/usr/lib/hadoop/ the example is: hadoop-examples.jar
If you want to learn to code your own Map Reduce for Hadoop job for hadoop see this link:
https://thysmichels.com/2012/01/30/write-your-own…reduce-in-java/
So we will run hadoop-examples.jar to estimates Pi using monte-carlo method:
hadoop jar /usr/lib/hadoop/hadoop-examples.jar pi 10 1000
Output:
Number of Maps = 10 Samples per Map = 100 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Wrote input for Map #4 Wrote input for Map #5 Wrote input for Map #6 Wrote input for Map #7 Wrote input for Map #8 Wrote input for Map #9 Starting Job 12/01/30 13:08:39 INFO mapred.FileInputFormat: Total input paths to process : 10 12/01/30 13:08:41 INFO mapred.JobClient: Running job: job_201201301242_0001 12/01/30 13:08:42 INFO mapred.JobClient: map 0% reduce 0% 12/01/30 13:09:10 INFO mapred.JobClient: map 10% reduce 0% 12/01/30 13:09:13 INFO mapred.JobClient: map 20% reduce 0% 12/01/30 13:09:19 INFO mapred.JobClient: map 30% reduce 0% 12/01/30 13:09:22 INFO mapred.JobClient: map 40% reduce 0% 12/01/30 13:09:28 INFO mapred.JobClient: map 50% reduce 10% 12/01/30 13:09:34 INFO mapred.JobClient: map 60% reduce 10% 12/01/30 13:09:41 INFO mapred.JobClient: map 60% reduce 16% 12/01/30 13:09:44 INFO mapred.JobClient: map 60% reduce 20% 12/01/30 13:09:48 INFO mapred.JobClient: map 70% reduce 20% 12/01/30 13:09:54 INFO mapred.JobClient: map 90% reduce 20% 12/01/30 13:09:57 INFO mapred.JobClient: map 90% reduce 23% 12/01/30 13:10:00 INFO mapred.JobClient: map 100% reduce 23% 12/01/30 13:10:03 INFO mapred.JobClient: map 100% reduce 66% 12/01/30 13:10:09 INFO mapred.JobClient: map 100% reduce 100% 12/01/30 13:10:14 INFO mapred.JobClient: Job complete: job_201201301242_0001 12/01/30 13:10:14 INFO mapred.JobClient: Counters: 30 12/01/30 13:10:14 INFO mapred.JobClient: Job Counters 12/01/30 13:10:14 INFO mapred.JobClient: Launched reduce tasks=1 12/01/30 13:10:14 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=116847 12/01/30 13:10:14 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/01/30 13:10:14 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/01/30 13:10:14 INFO mapred.JobClient: Launched map tasks=10 12/01/30 13:10:14 INFO mapred.JobClient: Data-local map tasks=10 12/01/30 13:10:14 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=58610 12/01/30 13:10:14 INFO mapred.JobClient: File Input Format Counters 12/01/30 13:10:14 INFO mapred.JobClient: Bytes Read=1180 12/01/30 13:10:14 INFO mapred.JobClient: File Output Format Counters 12/01/30 13:10:14 INFO mapred.JobClient: Bytes Written=97 12/01/30 13:10:14 INFO mapred.JobClient: FileSystemCounters 12/01/30 13:10:14 INFO mapred.JobClient: FILE_BYTES_READ=226 12/01/30 13:10:14 INFO mapred.JobClient: HDFS_BYTES_READ=2410 12/01/30 13:10:14 INFO mapred.JobClient: FILE_BYTES_WRITTEN=240099 12/01/30 13:10:14 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=215 12/01/30 13:10:14 INFO mapred.JobClient: Map-Reduce Framework 12/01/30 13:10:14 INFO mapred.JobClient: Map output materialized bytes=280 12/01/30 13:10:14 INFO mapred.JobClient: Map input records=10 12/01/30 13:10:14 INFO mapred.JobClient: Reduce shuffle bytes=280 12/01/30 13:10:14 INFO mapred.JobClient: Spilled Records=40 12/01/30 13:10:14 INFO mapred.JobClient: Map output bytes=180 12/01/30 13:10:14 INFO mapred.JobClient: Total committed heap usage (bytes)=1574019072 12/01/30 13:10:14 INFO mapred.JobClient: CPU time spent (ms)=6470 12/01/30 13:10:14 INFO mapred.JobClient: Map input bytes=240 12/01/30 13:10:14 INFO mapred.JobClient: SPLIT_RAW_BYTES=1230 12/01/30 13:10:14 INFO mapred.JobClient: Combine input records=0 12/01/30 13:10:14 INFO mapred.JobClient: Reduce input records=20 12/01/30 13:10:14 INFO mapred.JobClient: Reduce input groups=20 12/01/30 13:10:14 INFO mapred.JobClient: Combine output records=0 12/01/30 13:10:14 INFO mapred.JobClient: Physical memory (bytes) snapshot=1775329280 12/01/30 13:10:14 INFO mapred.JobClient: Reduce output records=0 12/01/30 13:10:14 INFO mapred.JobClient: Virtual memory (bytes) snapshot=5800841216 12/01/30 13:10:14 INFO mapred.JobClient: Map output records=20 Job Finished in 95.861 seconds Estimated value of Pi is 3.14800000000000000000
Navigate to the following URL’s to see the Hadoop Web Interface
Job Tracker:
http://localhost:50030/
Task Tracker:
http://localhost:50060/
HDFS Name Node:
http://localhost:50070/
If you would like to play around a bit more below are all the arguments listed for hadoop-examples.jar:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files. aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files. dbcount: An example job that count the pageview counts from a database. grep: A map/reduce program that counts the matches of a regex in the input. join: A job that effects a join over sorted, equally partitioned datasets multifilewc: A job that counts words from several files. pentomino: A map/reduce tile laying program to find solutions to pentomino problems. pi: A map/reduce program that estimates Pi using monte-carlo method. randomtextwriter: A map/reduce program that writes 10GB of random textual data per node. randomwriter: A map/reduce program that writes 10GB of random data per node. secondarysort: An example defining a secondary sort to the reduce. sleep: A job that sleeps at each map and reduce task. sort: A map/reduce program that sorts the data written by the random writer. sudoku: A sudoku solver. teragen: Generate data for the terasort terasort: Run the terasort teravalidate: Checking results of terasort wordcount: A map/reduce program that counts the words in the input files.