Running local mrjob streaming hadoop jobs

Follow the steps below to run an local mrjob. In this example I run an mrjob to calculate word frequency.

Prereq: Needs python 2.6 or 2.7 installed this to work.

Step 1. Download mrjob:

https://github.com/Yelp/mrjob

Step 2. Navigate to Yelp/mrjob/examples in your terminal

Step 3: Create a Dataset download a dataset from http://www.infochimps.com.

Step 4: Test your environment and make sure mrjob works, run:

import mrjob

This will show no errors or dependency issues.

Step 4: Running your mrjob

python mr_word_freq_count.py log1 > counts

log1 input was: (note each line was tabbed delimited)

test
one
two
three
four
five
one
two
test

Result:

"five"	1
"four"	1
"one"	2
"test"	2
"three"	1
"two"	2
Advertisement

About Thys Michels

Certified IBM Specialist in United States, California. Focus on WebSphere Integration and Middleware, Cloud Computing, Big Data, Mobile Development, Software Development.

Posted on January 27, 2012, in Amazon EC2 and tagged , , , . Bookmark the permalink. 1 Comment.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 70 other followers