# Running local mrjob streaming hadoop jobs

Follow the steps below to run an local mrjob. In this example I run an mrjob to calculate word frequency.

Prereq: Needs python 2.6 or 2.7 installed this to work.

https://github.com/Yelp/mrjob

Step 2. Navigate to Yelp/mrjob/examples in your terminal

Step 4: Test your environment and make sure mrjob works, run:

import mrjob

This will show no errors or dependency issues.

```python mr_word_freq_count.py log1 > counts
```

log1 input was: (note each line was tabbed delimited)

```test
one
two
three
four
five
one
two
test
```

Result:

```"five"	1
"four"	1
"one"	2
"test"	2
"three"	1
"two"	2
```

# Map and Reduce Python Script Example

Below is an example of your first Map, Reduce and Data Sample.

Let’s look at the Mapper.py file:

```import sys
from numpy import mat, mean, power
#read input folder line by line
for line in file:

#returns file input with training char removed (same as Trim())
yield line.rstrip()
#creates a list of input lines

#cast to floats
input = [float(line) for line in input]

#determine number of inputs
numInputs = len(input)

#convert list to matrix
input = mat(input)

#Form a vector of squares
sqInput = power(input,2)

#calculate output size, mean and mean(square values)
print numInputs, mean(input), mean(sqInput)

#calc mean of columns
print >> sys.stderr, "report: still alive"

#pass squared values to the reducer
if __name__ == '__main__':
pass```

Now for the Reducer.py. So 3 elements per lines are passed to the Reducer from the Mapper:
numInputs, mean(input), mean(sqInput)

```import sys
from numpy import mat, mean, power

for line in file:
yield line.rstrip()

#creates a list of input lines from mapper

#split the 3 input into separate items and store in list of lists

mapperOut = [instance.split() for instance in input]

#assign total number of samples (cumN), overall sum(cumVal) and overall sum sq (cumSumSq) to 0
cumVal=0.0
cumSumSq=0.0
cumN=0.0

for instance in mapperOut:
#for each item in the list cast to float
nj = float(instance[0])
#increase cumN with item value
cumN = cumN + nj
#multiply instance[0] with instance[1] and instance [2] with CumVal and cumSumSq
cumVal = cumVal + nj*float(instance[1])
cumSumSq = cumSumSq + nj*float(instance[2])

#calculate means
mean = cumVal/cumN
#calculate means squared
meanSq = cumSumSq/cumN

#output size, mean, mean(square values)
print cumN, mean, meanSq
print >> sys.stderr, "report: still alive"

if __name__ == '__main__':
pass
```

See the sample dataset:

```0.865670009848
0.240464946103
0.38583753445
0.851896046359
0.56613365811
0.901353547484
0.47530934886
0.903698474043
0.690057722624
0.549349071622
0.374166366825
0.63335531551
0.607434274558
0.1626603772
```

# Running Map Reduce on Amazon Elastic MapReduce

Below is the steps to write your fist Map Reduce on Amazon EMR.

Step 1: Register on Amazon: aws.amazon.com/free

After you login we will complete the following two parts:

• Part 1: Create the buckets for our input file, map and reduce file
• Part 2: Create our map reduce job

Part 1: Create the buckets for our input file, map and reduce file:

Step 2. Navigate to the S3 tab inside your console.

Step 3: Click on Create Bucket button:

Make sure your bucket name does not start with an upper case, has not numbers or symbols. Your bucket name needs to be unique so you may have to try a few names as many names may already been taken.

Step 4: Create two folders inside your bucket by clicking on the ‘Create Folder’ Button

Your folder structure will look like follows:

s3://mlbucket/mlcode

s3://mlbucket/mlinput

It will looks like follows:

Don’t worry about the extra folders they will make sense after you have create your Map Reduce Flow

s3://mlbucket/mlinput

Step 6: Upload your Map and Reduce Python file into the following folder:

s3://mlbucket/mlcode

Part 2: Create our map reduce job

Step 7: Login and navigate to Elastic MapReduce Tab on your AWS Management Console

Step 8: Select the region to run your MapReduce. This is very important that you create your map reduce in the same region as where your  buckets and key pairs where created.

Step 9: Click on ‘Create new job flow’ button to start the MapReduce wizard

Step 10: Name your MapReduce job and specify if it is your own application or an sample application. In this instance it we will be creating our own MapReduce and select our Job Type as Streaming.

Why Streaming: A Streaming job flow runs a single Hadoop job consisting of map and reduce functions that you have uploaded to Amazon S3. The functions can be implemented in any of the following supported languages: Ruby, Perl, Python, PHP, R, Bash, C++.

Step 11: Click on ‘Continue’ to specify the parameters for your MapReduce job:

Here is an explanation of the parameters:

Input location: location to your dataset

Output location: Where you want your MapReduce output be put.

Extra Args: No need for extra arguments in this case.

Step 12: Configure your Amazon Elastic MapReduce instance

Note: In my instance I specified the small instance with 1 instance count as this job is small. For large jobs select the Large or XLarge instances.

Note on Usage: Amazon bills you per hour for usage. So if your job runs for 2 min your will be billed for an hour. So the best is to run a few jobs in an hour time frame.

Going forward look the 3 options: On Demand Instance, Reserved Instance or Spot instance. Depending on your usage and the demand on AWS different options may be the best option for you.

If you have any specific key pairs created you can select them here. Also add an directory for AWS to place your log files. This directory will be automatically created.

Step 14: Proceed with no Bootstrap or Configure your Bootstrap Actions

Note: Bootstrap Actions is a feature in Amazon Elastic MapReduce that provides users a way to run custom set-up prior to the execution of their job flow. Bootstrap Actions can be used to install software or configure instances before running your job flow.

In this case we don’t need any bootstrap action to take place.