Below is the steps to write your fist Map Reduce on Amazon EMR.
Step 1: Register on Amazon: aws.amazon.com/free
After you login we will complete the following two parts:
- Part 1: Create the buckets for our input file, map and reduce file
- Part 2: Create our map reduce job
Part 1: Create the buckets for our input file, map and reduce file:
Step 2. Navigate to the S3 tab inside your console.
Step 3: Click on Create Bucket button:
Make sure your bucket name does not start with an upper case, has not numbers or symbols. Your bucket name needs to be unique so you may have to try a few names as many names may already been taken.
Step 4: Create two folders inside your bucket by clicking on the ‘Create Folder’ Button
Your folder structure will look like follows:
s3://mlbucket/mlcode
s3://mlbucket/mlinput
It will looks like follows:
Don’t worry about the extra folders they will make sense after you have create your Map Reduce Flow
Step 5: Upload your Map Reduce Dataset following folder:
s3://mlbucket/mlinput
Step 6: Upload your Map and Reduce Python file into the following folder:
s3://mlbucket/mlcode
Part 2: Create our map reduce job
Step 7: Login and navigate to Elastic MapReduce Tab on your AWS Management Console
Step 8: Select the region to run your MapReduce. This is very important that you create your map reduce in the same region as where your buckets and key pairs where created.
Step 9: Click on ‘Create new job flow’ button to start the MapReduce wizard
Step 10: Name your MapReduce job and specify if it is your own application or an sample application. In this instance it we will be creating our own MapReduce and select our Job Type as Streaming.
Why Streaming: A Streaming job flow runs a single Hadoop job consisting of map and reduce functions that you have uploaded to Amazon S3. The functions can be implemented in any of the following supported languages: Ruby, Perl, Python, PHP, R, Bash, C++.
Step 11: Click on ‘Continue’ to specify the parameters for your MapReduce job:
Here is an explanation of the parameters:
Input location: location to your dataset
Output location: Where you want your MapReduce output be put.
Mapper: Location of your Mapper, make sure to add the quotation marks around your statement.
Reducer: Location of your Reducer, make sure you add the qoutation marks around your statement.
Extra Args: No need for extra arguments in this case.
Step 12: Configure your Amazon Elastic MapReduce instance
Note: In my instance I specified the small instance with 1 instance count as this job is small. For large jobs select the Large or XLarge instances.
Note on Usage: Amazon bills you per hour for usage. So if your job runs for 2 min your will be billed for an hour. So the best is to run a few jobs in an hour time frame.
Going forward look the 3 options: On Demand Instance, Reserved Instance or Spot instance. Depending on your usage and the demand on AWS different options may be the best option for you.
Step 13: Specify advance options
If you have any specific key pairs created you can select them here. Also add an directory for AWS to place your log files. This directory will be automatically created.
Step 14: Proceed with no Bootstrap or Configure your Bootstrap Actions
Note: Bootstrap Actions is a feature in Amazon Elastic MapReduce that provides users a way to run custom set-up prior to the execution of their job flow. Bootstrap Actions can be used to install software or configure instances before running your job flow.
In this case we don’t need any bootstrap action to take place.
Step 15: Review your configuration
Click ‘Create Job Flow’ button to start your mapreduce job.
You will see your job will move from State: STARTING
Step 16: Check the result from your Map Reduce Job, open the meanVar001Log file as specified as the Output Location.
Open part-00000 to see the result: