Running local mrjob streaming hadoop jobs

Follow the steps below to run an local mrjob. In this example I run an mrjob to calculate word frequency.

Prereq: Needs python 2.6 or 2.7 installed this to work.

Step 1. Download mrjob:

https://github.com/Yelp/mrjob

Step 2. Navigate to Yelp/mrjob/examples in your terminal

Step 3: Create a Dataset download a dataset from http://www.infochimps.com.

Step 4: Test your environment and make sure mrjob works, run:

import mrjob

This will show no errors or dependency issues.

Step 4: Running your mrjob

python mr_word_freq_count.py log1 > counts

log1 input was: (note each line was tabbed delimited)

test	
one	
two	
three	
four	
five	
one	
two	
test	

Result:

"five"	1
"four"	1
"one"	2
"test"	2
"three"	1
"two"	2
Advertisements

One Comment on “Running local mrjob streaming hadoop jobs

  1. Pingback: Running mrjob on Amazon Elastic MapReduce « Thys Michels IBM WebSphere & Salesforce Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: