Running local mrjob streaming hadoop jobs

Follow the steps below to run an local mrjob. In this example I run an mrjob to calculate word frequency.

Prereq: Needs python 2.6 or 2.7 installed this to work.

Step 1. Download mrjob:

https://github.com/Yelp/mrjob

Step 2. Navigate to Yelp/mrjob/examples in your terminal

Step 3: Create a Dataset download a dataset from http://www.infochimps.com.

Step 4: Test your environment and make sure mrjob works, run:

import mrjob

This will show no errors or dependency issues.

Step 4: Running your mrjob

python mr_word_freq_count.py log1 > counts

log1 input was: (note each line was tabbed delimited)

test	
one	
two	
three	
four	
five	
one	
two	
test	

Result:

"five"	1
"four"	1
"one"	2
"test"	2
"three"	1
"two"	2

Map and Reduce Python Script Example

Below is an example of your first Map, Reduce and Data Sample.

Let’s look at the Mapper.py file:

import sys
from numpy import mat, mean, power
#read input folder line by line
def read_input(file):
for line in file:

#returns file input with training char removed (same as Trim())
yield line.rstrip()
#creates a list of input lines
input = read_input(sys.stdin)

#cast to floats
input = [float(line) for line in input]

#determine number of inputs
numInputs = len(input)

#convert list to matrix
input = mat(input)

#Form a vector of squares
sqInput = power(input,2)

#calculate output size, mean and mean(square values)
print numInputs, mean(input), mean(sqInput)

#calc mean of columns
print >> sys.stderr, "report: still alive"

#pass squared values to the reducer
if __name__ == '__main__':
pass

Now for the Reducer.py. So 3 elements per lines are passed to the Reducer from the Mapper:
numInputs, mean(input), mean(sqInput)

import sys
from numpy import mat, mean, power

def read_input(file):
    for line in file:
        yield line.rstrip()

#creates a list of input lines from mapper
input = read_input(sys.stdin)

#split the 3 input into separate items and store in list of lists

mapperOut = [instance.split() for instance in input]

#assign total number of samples (cumN), overall sum(cumVal) and overall sum sq (cumSumSq) to 0
cumVal=0.0
cumSumSq=0.0
cumN=0.0

for instance in mapperOut:
#for each item in the list cast to float
    nj = float(instance[0])
#increase cumN with item value
    cumN = cumN + nj
#multiply instance[0] with instance[1] and instance [2] with CumVal and cumSumSq
    cumVal = cumVal + nj*float(instance[1])
    cumSumSq = cumSumSq + nj*float(instance[2])

#calculate means
mean = cumVal/cumN
#calculate means squared
meanSq = cumSumSq/cumN

#output size, mean, mean(square values)
print cumN, mean, meanSq
print >> sys.stderr, "report: still alive" 

if __name__ == '__main__':
    pass

See the sample dataset:

0.865670009848
0.240464946103
0.38583753445
0.851896046359
0.56613365811
0.901353547484
0.47530934886
0.903698474043
0.690057722624
0.549349071622
0.374166366825
0.63335531551
0.607434274558
0.1626603772

Running Map Reduce on Amazon Elastic MapReduce

Below is the steps to write your fist Map Reduce on Amazon EMR.

Step 1: Register on Amazon: aws.amazon.com/free

After you login we will complete the following two parts:

  • Part 1: Create the buckets for our input file, map and reduce file
  • Part 2: Create our map reduce job

Part 1: Create the buckets for our input file, map and reduce file:

Step 2. Navigate to the S3 tab inside your console.

Step 3: Click on Create Bucket button:

Make sure your bucket name does not start with an upper case, has not numbers or symbols. Your bucket name needs to be unique so you may have to try a few names as many names may already been taken.

Step 4: Create two folders inside your bucket by clicking on the ‘Create Folder’ Button

Your folder structure will look like follows:

s3://mlbucket/mlcode

s3://mlbucket/mlinput

It will looks like follows:

Don’t worry about the extra folders they will make sense after you have create your Map Reduce Flow

Step 5: Upload your Map Reduce Dataset following folder:

s3://mlbucket/mlinput

Step 6: Upload your Map and Reduce Python file into the following folder:

s3://mlbucket/mlcode

Part 2: Create our map reduce job

Step 7: Login and navigate to Elastic MapReduce Tab on your AWS Management Console

Step 8: Select the region to run your MapReduce. This is very important that you create your map reduce in the same region as where your  buckets and key pairs where created.

Step 9: Click on ‘Create new job flow’ button to start the MapReduce wizard

Step 10: Name your MapReduce job and specify if it is your own application or an sample application. In this instance it we will be creating our own MapReduce and select our Job Type as Streaming.

Why Streaming: A Streaming job flow runs a single Hadoop job consisting of map and reduce functions that you have uploaded to Amazon S3. The functions can be implemented in any of the following supported languages: Ruby, Perl, Python, PHP, R, Bash, C++.

Step 11: Click on ‘Continue’ to specify the parameters for your MapReduce job:

Here is an explanation of the parameters:

Input location: location to your dataset

Output location: Where you want your MapReduce output be put.

Mapper: Location of your Mapper, make sure to add the quotation marks around your statement.

Reducer: Location of your Reducer, make sure you add the qoutation marks around your statement.

Extra Args: No need for extra arguments in this case.

Step 12: Configure your Amazon Elastic MapReduce instance

Note: In my instance I specified the small instance with 1 instance count as this job is small. For large jobs select the Large or XLarge instances.

Note on Usage: Amazon bills you per hour for usage. So if your job runs for 2 min your will be billed for an hour. So the best is to run a few jobs in an hour time frame.

Going forward look the 3 options: On Demand Instance, Reserved Instance or Spot instance. Depending on your usage and the demand on AWS different options may be the best option for you.

Step 13: Specify advance options

If you have any specific key pairs created you can select them here. Also add an directory for AWS to place your log files. This directory will be automatically created.

Step 14: Proceed with no Bootstrap or Configure your Bootstrap Actions

Note: Bootstrap Actions is a feature in Amazon Elastic MapReduce that provides users a way to run custom set-up prior to the execution of their job flow. Bootstrap Actions can be used to install software or configure instances before running your job flow.

In this case we don’t need any bootstrap action to take place.

Step 15: Review your configuration

Click ‘Create Job Flow’ button to start your mapreduce job.

You will see your job will move from State: STARTING

CLUSTER SHUTTING DOWN

COMPLETED

Step 16: Check the result from your Map Reduce Job, open the meanVar001Log file as specified as the Output Location.

Open part-00000 to see the result: