Hadoop in Practice

I had the privilege to get an early release of the Hadoop in Practice book from Manning Publishers. The book has the following chapters:

Table of Contents
  1: Getting started – FREE

Part I: Data Logistics
  2: Moving Data in and out of Hadoop
  3: Data Serialization: Working with Text and BeyondPart II: Big Data Patterns
  4: Applying MapReduce Patterns to Big Data
  5: Streamlining HDFS for Big Data
  6: Measuring and Optimizing Performance

Part III: Data Science
  7: Utilizing Data Structures and Algorithms
  8. Applying Statistics
  9. Machine Learning

Part IV: Taming the Elephant
10. Hive
11. Pig
12. Crunch and Other Technologies
13. Testing and Debugging
14: Job Coordination
15. Proficient Administration

Appendixes
  A: Related Technologies
  B: Hadoop Built-in Ingress and Egress Tools
  C: HDFS Dissected
  D: Optimized MapReduce Join Frameworks

If you are new to Hadoop or a manager and want to learn how Hadoop can help solve your big data challenges then this book is for you.

You can purchase the book online here:

http://www.manning.com/holmes/

Great investment and lots of great content.

Hadoop Multi-node setup on Ubuntu

Setiing up an Hadoop Multi-node instance on Ubuntu can be challenging. In my instance I used my laptop to do it and it can be tricky as I ran 2 VM’s with 2GB RAM, which makes everything a bit slow…thanks to my new Apple MacBook Pro with 8GB RAM I had no worries.

I will break this tutorial into a few parts just to make it more organized and so you can track your progress. Remember you will have the follow each of these parts twice on each of your machines (master and slave).

  • Part 1: Setting up your Ubuntu Environment 
  • Part 2: Configure the /etc/hosts file 
  • Part 3: SSH Setup 
  • Part 4: Download and configuring Hadoop
  • Part 5: Configure Master Slave Settings
  • Part 6: Starting Master Slave Setup
  • Part 7: Running first Map Reduce on Multi-Node Setup

Part 1: Setting up your Ubuntu Environment:

By default Ubuntu does not come with Sun Java installed so you will have to install it. This is an easy way to install it via the command line.

> sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
> sudo apt-get update
> sudo apt-get install sun-java6-jdk

Java is installed, lets now export Java:

> export JAVA_HOME=/usr/lib/jvm/java-6-sun

By default Ubuntu will not have ssh installed, so let’s install it from the command line

> sudo apt-get install ssh

It is recommended not to run hadoop under your current user/group we will create a new user and group. We will call the user hduser and group hd. The command looks like follows:

> sudo addgroup hd
> sudo adduser --ingroup hd hduser

The last thing we need to do is to disable IPV6 for Hadoop. After you have downloaded Hadoop add this line to conf/hadoop-env.sh:

> export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

Part 2: Setup /etc/hosts file

You need to setup the /etc/hosts file with the details of the master and slave IP. Run the following command to edit the hosts file

> sudo vi hosts (use gedit if you don't know vi)

Add the following lines:

172.*.*.*       master
172.*.*.*       slave

You need to run the command: ifconfig , on your master and slave machine to determine the IP address of the two machines. You then fill the IP address in where I have *.

Part 3: SSH Setup

Let’s configure ssh, run

> su - hduser
> ssh-keygen -t rsa -P ""
> cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

On the Master machine run the following

> hduser@master:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@slave

Test if ssh works for master and slave run:

> ssh master
> ssh slave

Part 4: Download and configuring Hadoop

First we need to download the latest hadoop and extract to our local filesystem. Download the latest hadoop from: http://www.reverse.net/pub/apache//hadoop/common/

Extract Hadoop: tar -xvf Hadoop*.tar.gz

Now we need to change ownership of the extracted Hadoop folder to hduser, we can do that with the following command

> sudo chown hduser:hd /home/user/Downloads/hadoop/*

Best to move the hadoop folder out of Downloads folder you can do with the following command:

mv /home/user/Download/hadoop /usr/local/

Now we need to configure $HOME/.bashrc with the Hadoop variables enter the following commands:

> cd ~
> sudo vi .bashrc (if you don't know vi, you can type: sudo gedit .bashrc)

Add the following lines to the end

export JAVA_HOME=/usr/lib/jvm/java-6-sun
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin

Now we are going to create a folder which Hadoop will use to store its data file

> sudo mkdir -p /app/hadoop/tmp
> sudo chown hduer:hd /app/hadoop/tmp

Good now can edit the *-sites.xml files in Hadoop/conf. We will add properties to 3 files:

  • conf/core-site.xml
  • conf/hdfs-site.xml
  • conf/mapred-site.xml

Add the following property tags to core-site.xml:

<property>
  <name>hadoop.tmp.dir</name>
  <value>/app/hadoop/tmp</value>
  <description>Temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://master:54310</value>
  <description>Default file system.</description>
</property>

Add the following property tags to mapred-site.xml:

<property>
  <name>mapred.job.tracker</name>
  <value>master:54311</value>
  <description>MapReduce job tracker.</description>
</property>

Add the following property tags to hdfs-site.xml:

<property>
  <name>dfs.replication</name>
  <value>2</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>

Part 5: Configure Master Slave Settings
We will configure the following 2 files on both the master and slave machines.

  • conf/masters
  • conf/slaves

Let’s start with the Master machine:

  • Open the following file: conf/masters and change ‘locahost’ to ‘master’:
master
  • Open the following file: conf/slaves and change ‘localhost’ to ‘master’ and ‘slave’
master
slave

Now on the Slave machine:

  • Open the following file: conf/masters and change ‘locahost’ to ‘slave’:
slave
  • Open the following file: conf/slaves and change ‘localhost’ to ‘slave’
slave

Part 6: Starting your Master Slave Setup
Note all of the steps below will be done on the Master machine
First thing we need to do is format the hadoop namenode, run:

> hadoop namenode -format

Starting a multi-node cluster is two steps:

  • Start HDFS daemons, run the following command in hadoop/bin
>./start-dfs.sh

Run following command on master > jps

14399 NameNode
16244 DataNode
16312 SecondaryNameNode
12215 Jps

Run following command on slave > jps

11501 DataNode
11612 Jps
  • Start Map Reduce daemons, run the following command in hadoop/bin
./start-mapred.sh

Run following command on master > jps

14399 NameNode
16244 DataNode
16312 SecondaryNameNode
18215 Jps
17102 JobTracker
17211 TaskTracker

Run following command on slave > jps

11501 DataNode
11712 Jps
11695 TaskTracker

Part 7: Running first Map Reduce on Multi-Node Setup

If everything was successful you can run your multi-node map reduce job.

Let’s get some ebooks in UTF-8 format:

http://www.gutenberg.org/ebooks/118

Now we need to push the book to our hdfs. Run the command and edit path and filename where you saved the book:

> hadoop dfs -copyFromLocal /home/user/Downloads/*.txt /user/hduser/hdinput

Let’s run our map-reduce example that counts the amount of words in the document:

> hadoop jar ../hadoopexamples-1.0.0.jar wordcount /user/hduser/hdinput /user/hduser/hdinput_result

Check the following logs of the slave machine to see what map-reduce jobs was completed:
> hadoop-hduser-tasktracker-ubuntu.log
> hadoop-hduser-jobtracker-ubuntu.log
> hadoop-hduser-datanode-ubuntu.log

If you get stuck or get an error check my other blog post with tips when running hadoop on ubuntu:

https://thysmichels.com/2012/02/11/tips-running-hadoop-on-ubuntu/

Hope this was helpful, if you have any questions please leave me a contact.

Working with HDFS Java Example

This is a java shows how we can work with the Hadoop File System.

Prerequisite for using the code in Eclipse is that you download and add the following jars to your project libraries:

  •  hadoop-core-0.20.2.jar
  • commons-logging-*.jar

See comments in code:

import java.io.IOException;
//hadoop imports
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.Path;

/**
 * @author thysmichels
 *
 */
public class HDFSWordCounter {

	//change this to string arg in main
	public static final String inputfile = "hdfsinput.txt";
	public static final String inputmsg = "Count the amount of words in this sentence!\n";
	/**
	 * @param args
	 */
	public static void main(String [] args) throws IOException {
		// Create a default hadoop configuration
		Configuration config = new Configuration();
                // Parse created config to the HDFS 
		FileSystem fs = FileSystem.get(config);
		// Specifies a new file in HDFS.
		Path filenamePath = new Path(inputfile);

		try
		{
                        // if the file already exists delete it.
			if (fs.exists(filenamePath))
			{
				//remove the file
				fs.delete(filenamePath, true);
			}

                //FSOutputStream to write the inputmsg into the HDFS file 
		FSDataOutputStream fin = fs.create(filenamePath);
		fin.writeUTF(inputmsg);
		fin.close();

                //FSInputStream to read out of the filenamePath file
		FSDataInputStream fout = fs.open(filenamePath);
		String msgIn = fout.readUTF();
                //Print to screen
		System.out.println(msgIn);
		fout.close();
		}
		catch (IOException ioe)
		{
			System.err.println("IOException during operation " + ioe.toString());
			System.exit(1);
		}
	}
}</pre>

In this example we created a HDFS Configuration, specified a Path for our file, Read string to our file and read string out of our file using the HDFS library.

Play around with this to solve more intricate problems.

Tips running Hadoop on Ubuntu

Below is some tips when running Hadoop on Ubuntu. If you find some errors running Hadoop on Ubuntu please comment the problem and how you solved it.

When you get this Warning: $HADOOP_HOME is deprecated

Solution: add “export HADOOP_HOME_WARN_SUPPRESS=”TRUE””  in the hadoop-env.sh.

Cannot create directory `/usr/local/hadoop/libexec/../logs

Solution: sudo chown -R hduser:hadoop /usr/local/hadoop/

Enter passphrase when running ./start-all.sh

Solution: ssh-keygen -t rsa -P “”     Create a ssh-key without a password.

Warning: <property>/<configuration> not set

Solution: make sure <property> and <configuration> tags are populated in core.site.xml, mapred.site.xml, hdfs.site.xml

Send or retrieve file to and from HDFS

Solution:

Send file to HDFS > bin/hadoop dfs -put /home/someone/interestingFile.txt /user/yourUserName/

Get file from HDFS > bin/hadoop dfs -get foo localFile

ssh: connect to host localhost port 22: Connection refused
Solution: By default Ubuntu will not have ssh installed so you will have to install and start it.

Install > sudo apt-get install ssh

Start > sudo service ssh start

hadoop Incompatible namespaceIDs in /app/hadoop/tmp/*

Solution: 

  1. Stop the cluster: ./stop-dfs.sh
  2. Delete the directory specified on the DataNode: rm -r /app/hadoop/tmp/*
  3. Reformat the NameNode: hadoop namenode -format

OR

  1. Stop the DataNode: ./stop.dfs.sh
  2. Edit the value of namespaceID in /current/VERSION to match the value of the current NameNode.
  3. Restart the DataNode: ./start.dfs.sh

hadoop java.net.UnknownHostException: ubuntu: ubuntu

Solution: 

1. Add ubuntu as your localhost IP to your /etc/hosts file: sudo vi /etc/hosts

2. Restart your network: sudo /etc/init.d/networking restart

So your /etc/hosts file on your master machine will look something like this:

172.16.62.152      master
172.16.62.151      slave
172.16.62.152      ubuntu

On your slave machine

172.16.62.152      master
172.16.62.151      slave
172.16.62.151      ubuntu

If none of it works then you can go and change the master/localhost hostname to the ipaddress in core-site.xml, mapred-site.xml

Installing Hadoop on Windows

Below is the steps you can follow to install Hadoop on windows:

Step 1.I downloaded the following file: http://www.poolsaboveground.com/apache//hadoop/core/hadoop-0.23.0/hadoop-0.23.0.tar.gz/

Step 2. Copy into C:/Cygwin/home folder.

Step 3. Extract: tar -xvf hadoop-0.23.0.tar.gz

Step 4. Open up /hadoop/conf/yarn-site.xml. Copy the following between </configuration></configuration>

<!-- Site specific YARN configuration properties -->
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9100</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9101</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

Step 5. Create log folder: hadoop> mkdir logs
Step 6. Format NameNode :

hadoop>bin/hadoop namenode -format 

Output:

Formatting using clusterid: CID-e8adf4f5-d339-40aa-8845-3dea10a28701
12/01/30 19:20:03 INFO util.HostsFileReader: Refreshing hosts (include/exclude) list
12/01/30 19:20:03 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
12/01/30 19:20:03 INFO util.GSet: VM type       = 64-bit
12/01/30 19:20:03 INFO util.GSet: 2% max memory = 17.77875 MB
12/01/30 19:20:03 INFO util.GSet: capacity      = 2^21 = 2097152 entries
12/01/30 19:20:03 INFO util.GSet: recommended=2097152, actual=2097152
12/01/30 19:20:03 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false
12/01/30 19:20:03 INFO blockmanagement.BlockManager: defaultReplication = 3
12/01/30 19:20:03 INFO blockmanagement.BlockManager: maxReplication     = 512
12/01/30 19:20:03 INFO blockmanagement.BlockManager: minReplication     = 1
12/01/30 19:20:03 INFO blockmanagement.BlockManager: maxReplicationStreams      = 2
12/01/30 19:20:03 INFO blockmanagement.BlockManager: shouldCheckForEnoughRacks  = false
12/01/30 19:20:03 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
12/01/30 19:20:04 INFO namenode.FSNamesystem: fsOwner=thys_michels
12/01/30 19:20:04 INFO namenode.FSNamesystem: supergroup=supergroup
12/01/30 19:20:04 INFO namenode.FSNamesystem: isPermissionEnabled=true
12/01/30 19:20:05 INFO namenode.NameNode: Caching file names occuring more than 10 times
12/01/30 19:20:06 INFO namenode.NNStorage: Storage directory \test\dfs\name has been successfully formatted.
12/01/30 19:20:06 INFO namenode.FSImage: Saving image file \test\dfs\name\current\fsimage.ckpt_0000000000000000000 using no compression
12/01/30 19:20:06 INFO namenode.FSImage: Image file of size 127 saved in 0 seconds.
12/01/30 19:20:06 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
12/01/30 19:20:06 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at us-lap085/64.9.237.126
************************************************************/

Step 7. Start Cluster

bin/start-dfs.sh

Java-based HDFS API Tutorial

In this tutorial I show how to use Java to interact with your Hadoop Distributed File System (HDFS) using libHDFS.

This Java program creates a file named hadoop.txt, writes a short message into it, then reads it back and prints it to the screen. If the file already existed, it is deleted first.

 

  import java.io.File;
  import java.io.IOException;

  //Import LibHDFS Packages
  import org.apache.hadoop.conf.Configuration;
  import org.apache.hadoop.fs.FileSystem;
  import org.apache.hadoop.fs.FSDataInputStream;
  import org.apache.hadoop.fs.FSDataOutputStream;
  import org.apache.hadoop.fs.Path;

  // Create Class
  public class HDFSExample {
   public static final String FileName = "hadoop.txt";
   public static final String message = "My First Hadoop API call!\n";

   public static void main (String [] args) throws IOException {
     //Initialize new default Hadoop Configuration
     Configuration conf = new Configuration();
     //Initialize new abstract Hadoop FileSystem
     FileSystem fs = FileSystem.get(conf);
     //Specify File Path of Hadoop File System
     Path filenamePath = new Path(theFilename);

     try {
       //Check if file doesn't exist
       if (fs.exists(filenamePath)) {
         // if file exist, remove file first
         fs.delete(filenamePath);
      }
       //Write Configuration to File
       FSDataOutputStream out = fs.create(filenamePath);
       out.writeUTF(message);
       out.close();

       //Open Config file to read
       FSDataInputStream in = fs.open(filenamePath);
       String messageIn = in.readUTF();
       System.out.print(messageIn);
       in.close();
     } catch (IOException ioe) {
       System.err.println("IOException during operation: " + ioe.toString());
       System.exit(1);
     }
  }
 }

For more Information:

Complete JavaDoc for the HDFS API is provided at http://wiki.apache.org/hadoop/LibHDFS

Hadoop Distributed File System (HDFS) Tutorial

HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information. Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications.

One of the primary advantages of HDFS is its transparency. Clients do not need to be particularly aware that they are working on files stored remotely. The existing standard library methods like open()close()fread(), etc. will work on files hosted over NFS.

Configuring HDFS

The HDFS can be found in the /conf folder of your Hadoop installation. The conf/hadoop-defaults.xml file contains default values for every parameter in Hadoop, this file is read-only. You override this configuration by setting new values inconf/hadoop-site.xml. This file should be replicated consistently across all machines in the cluster.

Configuration settings are a set of key-value pairs of the format:

  <property>
    <name>property-name</name>
    <value>property-value</value>
  </property>

The following settings are necessary to configure HDFS:
fs.default.name : This is the URI (protocol specifier, hostname, and port) that describes the NameNode for the cluster. eg. hdfs://thys.michels.com:9000
dfs.data.dir : his is the path on the local file system in which the DataNode instance should store its data. eg. /home/username/hdfs/data
dfs.name.dir : This is the path on the local file system of the NameNode instance where the NameNode metadata is stored. eg. /home/username/hdfs/name

the result will looks like follows:

 
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://your.server.name.com:9000</value>
  </property>
  <property>
    <name>dfs.data.dir</name>
    <value>/home/username/hdfs/data</value>
  </property>
  <property>
    <name>dfs.name.dir</name>
    <value>/home/username/hdfs/name</value>
  </property>
</configuration>

The master node needs to know the addresses of all the machines to use as DataNodes; the startup scripts depend on this. Also in the conf/ directory, edit the file slaves so that it contains a list of fully-qualified hostnames for the slave instances, one host per line. On a multi-node setup, the master node (e.g., localhost) is not usually present in this file.

The next step is to

user@machine$ mkdir -p $HOME/hdfs/data

user@namenode$ mkdir -p $HOME/hdfs/name

Folder access chmod +rw and needs to be accessed by all that will use this node. Best practice is to create a hadoop user and group. It is not recommended that you run Hadoop as root.

Running Hadoop on Ubuntu

Below is the steps to run your first Hadoop job after you have installed Hadoop.

Step 1. Format the NameNode: Initializes the directory specified by the dfs.name.dir variable.

 sudo -u hdfs hadoop namenode -format

Output

12/01/30 11:51:33 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ubuntu/127.0.1.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.205.0.2
STARTUP_MSG:   build =  -r ; compiled by 'jenkins' on Thu Nov  3 02:51:06 UTC 2011
************************************************************/
12/01/30 11:51:34 INFO util.GSet: VM type       = 64-bit
12/01/30 11:51:34 INFO util.GSet: 2% max memory = 19.33375 MB
12/01/30 11:51:34 INFO util.GSet: capacity      = 2^21 = 2097152 entries
12/01/30 11:51:34 INFO util.GSet: recommended=2097152, actual=2097152
12/01/30 11:51:34 INFO namenode.FSNamesystem: fsOwner=hdfs
12/01/30 11:51:34 INFO namenode.FSNamesystem: supergroup=supergroup
12/01/30 11:51:34 INFO namenode.FSNamesystem: isPermissionEnabled=false
12/01/30 11:51:34 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
12/01/30 11:51:34 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
12/01/30 11:51:34 INFO namenode.NameNode: Caching file names occuring more than 10 times 
12/01/30 11:51:35 INFO common.Storage: Image file of size 110 saved in 0 seconds.
12/01/30 11:51:35 INFO common.Storage: Storage directory /var/lib/hadoop/cache/hadoop/dfs/name has been successfully formatted.
12/01/30 11:51:35 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/

Step 2. Start the neccessary hadoop services. Below we start: hadoop-datanode, hadoop-jobtracker and hadoop-tasktracker

for i in hadoop-namenode hadoop-datanode hadoop-jobtracker hadoop-tasktracker ; do sudo service $i start ; done

OR run

bin/start-all.sh 

Output:

Starting Hadoop namenode daemon: starting namenode, logging to /var/log/hadoop/hadoop-hadoop-namenode-ubuntu.out
hadoop-namenode.
Starting Hadoop datanode daemon: starting datanode, logging to /var/log/hadoop/hadoop-hadoop-datanode-ubuntu.out
hadoop-datanode.
Starting Hadoop jobtracker daemon: starting jobtracker, logging to /var/log/hadoop/hadoop-hadoop-jobtracker-ubuntu.out
hadoop-jobtracker.
Starting Hadoop tasktracker daemon: starting tasktracker, logging to /var/log/hadoop/hadoop-hadoop-tasktracker-ubuntu.out
hadoop-tasktracker.

Step 3: Once the Hadoop cluster is running it is a good idea to create a home directory on HDFS.
Create the file:

sudo -u hdfs hadoop fs -mkdir /user/$USER

Change ownership:

sudo -u hdfs hadoop fs -chown $USER /user/$USER

Step 4: Run the following command to see all the hdfs files currently running

hadoop fs -lsr /

Output:

drwxr-xr-x   - hdfs   supergroup          0 2012-01-30 12:46 /user
drwxr-xr-x   - thysmichels supergroup          0 2012-01-30 12:46 /user/thysmichels
drwxr-xr-x   - mapred supergroup          0 2012-01-30 12:42 /var
drwxr-xr-x   - mapred supergroup          0 2012-01-30 12:42 /var/lib
drwxr-xr-x   - mapred supergroup          0 2012-01-30 12:42 /var/lib/hadoop
drwxr-xr-x   - mapred supergroup          0 2012-01-30 12:42 /var/lib/hadoop/cache
drwxr-xr-x   - mapred supergroup          0 2012-01-30 12:42 /var/lib/hadoop/cache/mapred
drwxr-xr-x   - mapred supergroup          0 2012-01-30 12:42 /var/lib/hadoop/cache/mapred/mapred
drwx------   - mapred supergroup          0 2012-01-30 12:42 /var/lib/hadoop/cache/mapred/mapred/system
-rw-------   1 mapred supergroup          4 2012-01-30 12:42 /var/lib/hadoop/cache/mapred/mapred/system/jobtracker.info

Step 5: Now we can run a Hadoop example. This example will do a word count and can be found at the following path:
/usr/lib/hadoop/ the example is: hadoop-examples.jar
If you want to learn to code your own Map Reduce for Hadoop job for hadoop see this link:
https://thysmichels.com/2012/01/30/write-your-own…reduce-in-java/

So we will run hadoop-examples.jar to estimates Pi using monte-carlo method:

hadoop jar /usr/lib/hadoop/hadoop-examples.jar pi 10 1000

Output:

Number of Maps  = 10
Samples per Map = 100
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
12/01/30 13:08:39 INFO mapred.FileInputFormat: Total input paths to process : 10
12/01/30 13:08:41 INFO mapred.JobClient: Running job: job_201201301242_0001
12/01/30 13:08:42 INFO mapred.JobClient:  map 0% reduce 0%
12/01/30 13:09:10 INFO mapred.JobClient:  map 10% reduce 0%
12/01/30 13:09:13 INFO mapred.JobClient:  map 20% reduce 0%
12/01/30 13:09:19 INFO mapred.JobClient:  map 30% reduce 0%
12/01/30 13:09:22 INFO mapred.JobClient:  map 40% reduce 0%
12/01/30 13:09:28 INFO mapred.JobClient:  map 50% reduce 10%
12/01/30 13:09:34 INFO mapred.JobClient:  map 60% reduce 10%
12/01/30 13:09:41 INFO mapred.JobClient:  map 60% reduce 16%
12/01/30 13:09:44 INFO mapred.JobClient:  map 60% reduce 20%
12/01/30 13:09:48 INFO mapred.JobClient:  map 70% reduce 20%
12/01/30 13:09:54 INFO mapred.JobClient:  map 90% reduce 20%
12/01/30 13:09:57 INFO mapred.JobClient:  map 90% reduce 23%
12/01/30 13:10:00 INFO mapred.JobClient:  map 100% reduce 23%
12/01/30 13:10:03 INFO mapred.JobClient:  map 100% reduce 66%
12/01/30 13:10:09 INFO mapred.JobClient:  map 100% reduce 100%
12/01/30 13:10:14 INFO mapred.JobClient: Job complete: job_201201301242_0001
12/01/30 13:10:14 INFO mapred.JobClient: Counters: 30
12/01/30 13:10:14 INFO mapred.JobClient:   Job Counters 
12/01/30 13:10:14 INFO mapred.JobClient:     Launched reduce tasks=1
12/01/30 13:10:14 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=116847
12/01/30 13:10:14 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
12/01/30 13:10:14 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
12/01/30 13:10:14 INFO mapred.JobClient:     Launched map tasks=10
12/01/30 13:10:14 INFO mapred.JobClient:     Data-local map tasks=10
12/01/30 13:10:14 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=58610
12/01/30 13:10:14 INFO mapred.JobClient:   File Input Format Counters 
12/01/30 13:10:14 INFO mapred.JobClient:     Bytes Read=1180
12/01/30 13:10:14 INFO mapred.JobClient:   File Output Format Counters 
12/01/30 13:10:14 INFO mapred.JobClient:     Bytes Written=97
12/01/30 13:10:14 INFO mapred.JobClient:   FileSystemCounters
12/01/30 13:10:14 INFO mapred.JobClient:     FILE_BYTES_READ=226
12/01/30 13:10:14 INFO mapred.JobClient:     HDFS_BYTES_READ=2410
12/01/30 13:10:14 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=240099
12/01/30 13:10:14 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=215
12/01/30 13:10:14 INFO mapred.JobClient:   Map-Reduce Framework
12/01/30 13:10:14 INFO mapred.JobClient:     Map output materialized bytes=280
12/01/30 13:10:14 INFO mapred.JobClient:     Map input records=10
12/01/30 13:10:14 INFO mapred.JobClient:     Reduce shuffle bytes=280
12/01/30 13:10:14 INFO mapred.JobClient:     Spilled Records=40
12/01/30 13:10:14 INFO mapred.JobClient:     Map output bytes=180
12/01/30 13:10:14 INFO mapred.JobClient:     Total committed heap usage (bytes)=1574019072
12/01/30 13:10:14 INFO mapred.JobClient:     CPU time spent (ms)=6470
12/01/30 13:10:14 INFO mapred.JobClient:     Map input bytes=240
12/01/30 13:10:14 INFO mapred.JobClient:     SPLIT_RAW_BYTES=1230
12/01/30 13:10:14 INFO mapred.JobClient:     Combine input records=0
12/01/30 13:10:14 INFO mapred.JobClient:     Reduce input records=20
12/01/30 13:10:14 INFO mapred.JobClient:     Reduce input groups=20
12/01/30 13:10:14 INFO mapred.JobClient:     Combine output records=0
12/01/30 13:10:14 INFO mapred.JobClient:     Physical memory (bytes) snapshot=1775329280
12/01/30 13:10:14 INFO mapred.JobClient:     Reduce output records=0
12/01/30 13:10:14 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=5800841216
12/01/30 13:10:14 INFO mapred.JobClient:     Map output records=20
Job Finished in 95.861 seconds
Estimated value of Pi is 3.14800000000000000000

Navigate to the following URL’s to see the Hadoop Web Interface
Job Tracker:

http://localhost:50030/

Task Tracker:

http://localhost:50060/

HDFS Name Node:

http://localhost:50070/

If you would like to play around a bit more below are all the arguments listed for hadoop-examples.jar:

aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
dbcount: An example job that count the pageview counts from a database.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using monte-carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sleep: A job that sleeps at each map and reduce task.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.

Install Apache Big Top on Ubuntu

Below is the steps you can follow to install Appache Big Top on Ubuntu. For those that don’t know Apache BigTop is a project for the development of packaging and tests of the Apache Hadoop ecosystem.

Step 1: First you have to install the Big GPG Key

wget -O- http://www.apache.org/dist/incubator/bigtop/stable/repos/GPG-KEY-bigtop | sudo apt-key add -

Step 2: Do a wget to rectrieve the BigTop repo files and add to bigtop.list

sudo wget -O /etc/apt/sources.list.d/bigtop.list http://www.apache.org/dist/incubator/bigtop/stable/repos/ubuntu/bigtop.list

Step 3: Now we have all the repo files of bigtop and just need to select the one closet to your geography:

sudo vi /etc/apt/sources.list.d/bigtop.list

Uncomment (remove the #) in front of the deb and deb-src lines for the repo that is closest to you. Uncomment one and only one pair of deb/deb-src lines.

Step 4: Now we can update apt

sudo apt-get update 

Step 5: Lets see if hadoop is part of the repo we selected

apt-cache search hadoop

You will see all the hadoop packages

Step 6: Install Sun Java, look at this post to download Sun-Java from command line.
https://thysmichels.com/2012/01/30/install-sun-java-on-ubuntu-using-command-lin/

Step 7: export JAVA_HOME=XXXX

Step 8: Install the Hadoop stack

sudo apt-get install hadoop\* flume-* mahout\* oozie\* whirr-*