Below is the steps to run your first Hadoop job after you have installed Hadoop.
Step 1. Format the NameNode: Initializes the directory specified by the dfs.name.dir variable.
sudo -u hdfs hadoop namenode -format
Output
12/01/30 11:51:33 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = ubuntu/127.0.1.1 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 0.20.205.0.2 STARTUP_MSG: build = -r ; compiled by 'jenkins' on Thu Nov 3 02:51:06 UTC 2011 ************************************************************/ 12/01/30 11:51:34 INFO util.GSet: VM type = 64-bit 12/01/30 11:51:34 INFO util.GSet: 2% max memory = 19.33375 MB 12/01/30 11:51:34 INFO util.GSet: capacity = 2^21 = 2097152 entries 12/01/30 11:51:34 INFO util.GSet: recommended=2097152, actual=2097152 12/01/30 11:51:34 INFO namenode.FSNamesystem: fsOwner=hdfs 12/01/30 11:51:34 INFO namenode.FSNamesystem: supergroup=supergroup 12/01/30 11:51:34 INFO namenode.FSNamesystem: isPermissionEnabled=false 12/01/30 11:51:34 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100 12/01/30 11:51:34 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s) 12/01/30 11:51:34 INFO namenode.NameNode: Caching file names occuring more than 10 times 12/01/30 11:51:35 INFO common.Storage: Image file of size 110 saved in 0 seconds. 12/01/30 11:51:35 INFO common.Storage: Storage directory /var/lib/hadoop/cache/hadoop/dfs/name has been successfully formatted. 12/01/30 11:51:35 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1 ************************************************************/
Step 2. Start the neccessary hadoop services. Below we start: hadoop-datanode, hadoop-jobtracker and hadoop-tasktracker
for i in hadoop-namenode hadoop-datanode hadoop-jobtracker hadoop-tasktracker ; do sudo service $i start ; done
OR run
bin/start-all.sh
Output:
Starting Hadoop namenode daemon: starting namenode, logging to /var/log/hadoop/hadoop-hadoop-namenode-ubuntu.out hadoop-namenode. Starting Hadoop datanode daemon: starting datanode, logging to /var/log/hadoop/hadoop-hadoop-datanode-ubuntu.out hadoop-datanode. Starting Hadoop jobtracker daemon: starting jobtracker, logging to /var/log/hadoop/hadoop-hadoop-jobtracker-ubuntu.out hadoop-jobtracker. Starting Hadoop tasktracker daemon: starting tasktracker, logging to /var/log/hadoop/hadoop-hadoop-tasktracker-ubuntu.out hadoop-tasktracker.
Step 3: Once the Hadoop cluster is running it is a good idea to create a home directory on HDFS.
Create the file:
sudo -u hdfs hadoop fs -mkdir /user/$USER
Change ownership:
sudo -u hdfs hadoop fs -chown $USER /user/$USER
Step 4: Run the following command to see all the hdfs files currently running
hadoop fs -lsr /
Output:
drwxr-xr-x - hdfs supergroup 0 2012-01-30 12:46 /user drwxr-xr-x - thysmichels supergroup 0 2012-01-30 12:46 /user/thysmichels drwxr-xr-x - mapred supergroup 0 2012-01-30 12:42 /var drwxr-xr-x - mapred supergroup 0 2012-01-30 12:42 /var/lib drwxr-xr-x - mapred supergroup 0 2012-01-30 12:42 /var/lib/hadoop drwxr-xr-x - mapred supergroup 0 2012-01-30 12:42 /var/lib/hadoop/cache drwxr-xr-x - mapred supergroup 0 2012-01-30 12:42 /var/lib/hadoop/cache/mapred drwxr-xr-x - mapred supergroup 0 2012-01-30 12:42 /var/lib/hadoop/cache/mapred/mapred drwx------ - mapred supergroup 0 2012-01-30 12:42 /var/lib/hadoop/cache/mapred/mapred/system -rw------- 1 mapred supergroup 4 2012-01-30 12:42 /var/lib/hadoop/cache/mapred/mapred/system/jobtracker.info
Step 5: Now we can run a Hadoop example. This example will do a word count and can be found at the following path:
/usr/lib/hadoop/ the example is: hadoop-examples.jar
If you want to learn to code your own Map Reduce for Hadoop job for hadoop see this link:
https://thysmichels.com/2012/01/30/write-your-own…reduce-in-java/
So we will run hadoop-examples.jar to estimates Pi using monte-carlo method:
hadoop jar /usr/lib/hadoop/hadoop-examples.jar pi 10 1000
Output:
Number of Maps = 10 Samples per Map = 100 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Wrote input for Map #4 Wrote input for Map #5 Wrote input for Map #6 Wrote input for Map #7 Wrote input for Map #8 Wrote input for Map #9 Starting Job 12/01/30 13:08:39 INFO mapred.FileInputFormat: Total input paths to process : 10 12/01/30 13:08:41 INFO mapred.JobClient: Running job: job_201201301242_0001 12/01/30 13:08:42 INFO mapred.JobClient: map 0% reduce 0% 12/01/30 13:09:10 INFO mapred.JobClient: map 10% reduce 0% 12/01/30 13:09:13 INFO mapred.JobClient: map 20% reduce 0% 12/01/30 13:09:19 INFO mapred.JobClient: map 30% reduce 0% 12/01/30 13:09:22 INFO mapred.JobClient: map 40% reduce 0% 12/01/30 13:09:28 INFO mapred.JobClient: map 50% reduce 10% 12/01/30 13:09:34 INFO mapred.JobClient: map 60% reduce 10% 12/01/30 13:09:41 INFO mapred.JobClient: map 60% reduce 16% 12/01/30 13:09:44 INFO mapred.JobClient: map 60% reduce 20% 12/01/30 13:09:48 INFO mapred.JobClient: map 70% reduce 20% 12/01/30 13:09:54 INFO mapred.JobClient: map 90% reduce 20% 12/01/30 13:09:57 INFO mapred.JobClient: map 90% reduce 23% 12/01/30 13:10:00 INFO mapred.JobClient: map 100% reduce 23% 12/01/30 13:10:03 INFO mapred.JobClient: map 100% reduce 66% 12/01/30 13:10:09 INFO mapred.JobClient: map 100% reduce 100% 12/01/30 13:10:14 INFO mapred.JobClient: Job complete: job_201201301242_0001 12/01/30 13:10:14 INFO mapred.JobClient: Counters: 30 12/01/30 13:10:14 INFO mapred.JobClient: Job Counters 12/01/30 13:10:14 INFO mapred.JobClient: Launched reduce tasks=1 12/01/30 13:10:14 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=116847 12/01/30 13:10:14 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/01/30 13:10:14 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/01/30 13:10:14 INFO mapred.JobClient: Launched map tasks=10 12/01/30 13:10:14 INFO mapred.JobClient: Data-local map tasks=10 12/01/30 13:10:14 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=58610 12/01/30 13:10:14 INFO mapred.JobClient: File Input Format Counters 12/01/30 13:10:14 INFO mapred.JobClient: Bytes Read=1180 12/01/30 13:10:14 INFO mapred.JobClient: File Output Format Counters 12/01/30 13:10:14 INFO mapred.JobClient: Bytes Written=97 12/01/30 13:10:14 INFO mapred.JobClient: FileSystemCounters 12/01/30 13:10:14 INFO mapred.JobClient: FILE_BYTES_READ=226 12/01/30 13:10:14 INFO mapred.JobClient: HDFS_BYTES_READ=2410 12/01/30 13:10:14 INFO mapred.JobClient: FILE_BYTES_WRITTEN=240099 12/01/30 13:10:14 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=215 12/01/30 13:10:14 INFO mapred.JobClient: Map-Reduce Framework 12/01/30 13:10:14 INFO mapred.JobClient: Map output materialized bytes=280 12/01/30 13:10:14 INFO mapred.JobClient: Map input records=10 12/01/30 13:10:14 INFO mapred.JobClient: Reduce shuffle bytes=280 12/01/30 13:10:14 INFO mapred.JobClient: Spilled Records=40 12/01/30 13:10:14 INFO mapred.JobClient: Map output bytes=180 12/01/30 13:10:14 INFO mapred.JobClient: Total committed heap usage (bytes)=1574019072 12/01/30 13:10:14 INFO mapred.JobClient: CPU time spent (ms)=6470 12/01/30 13:10:14 INFO mapred.JobClient: Map input bytes=240 12/01/30 13:10:14 INFO mapred.JobClient: SPLIT_RAW_BYTES=1230 12/01/30 13:10:14 INFO mapred.JobClient: Combine input records=0 12/01/30 13:10:14 INFO mapred.JobClient: Reduce input records=20 12/01/30 13:10:14 INFO mapred.JobClient: Reduce input groups=20 12/01/30 13:10:14 INFO mapred.JobClient: Combine output records=0 12/01/30 13:10:14 INFO mapred.JobClient: Physical memory (bytes) snapshot=1775329280 12/01/30 13:10:14 INFO mapred.JobClient: Reduce output records=0 12/01/30 13:10:14 INFO mapred.JobClient: Virtual memory (bytes) snapshot=5800841216 12/01/30 13:10:14 INFO mapred.JobClient: Map output records=20 Job Finished in 95.861 seconds Estimated value of Pi is 3.14800000000000000000
Navigate to the following URL’s to see the Hadoop Web Interface
Job Tracker:
http://localhost:50030/
Task Tracker:
http://localhost:50060/
HDFS Name Node:
http://localhost:50070/
If you would like to play around a bit more below are all the arguments listed for hadoop-examples.jar:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files. aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files. dbcount: An example job that count the pageview counts from a database. grep: A map/reduce program that counts the matches of a regex in the input. join: A job that effects a join over sorted, equally partitioned datasets multifilewc: A job that counts words from several files. pentomino: A map/reduce tile laying program to find solutions to pentomino problems. pi: A map/reduce program that estimates Pi using monte-carlo method. randomtextwriter: A map/reduce program that writes 10GB of random textual data per node. randomwriter: A map/reduce program that writes 10GB of random data per node. secondarysort: An example defining a secondary sort to the reduce. sleep: A job that sleeps at each map and reduce task. sort: A map/reduce program that sorts the data written by the random writer. sudoku: A sudoku solver. teragen: Generate data for the terasort terasort: Run the terasort teravalidate: Checking results of terasort wordcount: A map/reduce program that counts the words in the input files.
Undeniably believe that which you stated. Your favourite reason appeared to be at
the net the easiest thing to consider of. I say to you, I certainly
get irked even as people think about worries that they plainly do not know about.
You controlled to hit the nail upon the highest as smartly as defined out the entire
thing with no need side-effects , folks could take a signal.
Will likely be back to get more. Thank you