Running mrjob on Amazon Elastic MapReduce

Below is the steps to run your first Amazon Elastic Map Reduce on Amazon EC2.

First step is to make sure you have completed the steps specified in my previous post:
https://thysmichels.com/2012/01/27/running-local-mrjob-streaming-hadoop-jobs/

Ok let’s start:
Step 1. Create a new file called mrjob.conf. The location of the file is important.

  • The location specified by MRJOB_CONF
  • ~/.mrjob.conf
  • ~/.mrjob (deprecated)
  • mrjob.conf in any directory in PYTHONPATH (deprecated)
  • /etc/mrjob.conf

I created my .mrjob.conf in my /home/thys_michels/.mrjob.conf

Step 2: Below is the mrjob.conf explained. Make sure you comment out all the lines below and also modify parameters where necessary.

Note: Sample of mrjob.conf can be downloaded from: https://github.com/Yelp/mrjob/blob/master/mrjob.conf.example

Below is my .mrjob.conf. See lines in bold is what I have changed. I also comment out allot of the lines.

# This is basically the config file we use in production at Yelp, with some
# strategic edits. ;)
#
# If you don't have the yaml module installed, you'll have to use JSON instead,
# which would look something like this:
#
# {"runners": {
# "emr": {
# "aws_access_key_id": "HADOOPHADOOPBOBADOOP",
# "aws_region": "us-west-1",
# "aws_secret_access_key": "MEMIMOMADOOPBANANAFANAFOFADOOPHADOOP",
# "base_tmp_dir": "/scratch/$USER"
# "bootstrap_python_packages": [
# "$BT/aws/python-packages/*.tar.gz"
# ],
# ...
#
runners:
  emr:
    aws_access_key_id: ### 
    # See Step 3 on how to create an AWS Access key
    # We run on in the west region because we're located on the west coast,
    # and there are no eventual consistency issues with newly created S3 keys.
    aws_region: us-west-1 
    # make sure your keys are created in the same aws_region
   aws_secret_access_key: ### 
    # see step 3 on how to create an access key and access your secret key
    # alternate tmp dir
    base_tmp_dir: /scratch/$USER
    # make sure you have priviliges to /scratch file
    # $BT is the path to our source tree. This lets us add modules to
    # install on EMR by simply dumping them in this dir.
    ##bootstrap_python_packages:
    ## $BT/aws/python-packages/*.tar.gz
    # specifying an ssh key pair allows us to ssh tunnel to the job tracker
    # and fetch logs via ssh
    ec2_key_pair: mrjobkey2 
    ec2_key_pair_file: /home/thys_michels/Documents/mrjobkey2.pem 
    # See Step 4 to create key_pairs
    # use beefier instances in production
    ec2_instance_type: m1.small
    # make sure to change this from c1.xlarge to m1.small if you are running small mapreduce jobs. As you will be charged more for xlarge instance.
    # but only use one unless overridden
    num_ec2_instances: 1
    # use our local time zone (this is important for deciding when
    # days start and end, for instance)
    cmdenv:
     TZ: America/Los_Angeles 
    # Confirm you your keys and images are created in this TimeZone
    # we create the src-tree.tar.gz tarball with a Makefile. It only contains
    # a subset of our code
    ##python_archives: &python_archives
    ##- $BT/aws/src-tree.tar.gz
    # our bucket also lives in the us-west region
   s3_log_uri: s3://mrbucket1/
   s3_scratch_uri: s3://mrbucket1/tmp/
    # Create these two bucks and one tmp folder inside the bucket. Make sure your bucket is in the same TimeZone as your keys.
    ##setup_cmds: &setup_cmds
    # these files are different between dev and production, so they're
    # uploaded separately. copying them into place isn't safe because
    # src-tree.tar.gz is actually shared between several mappers/reducers.
    # Another safe approach would be to add a rule to Makefile.emr that
    # copies these files if they haven't already been copied (setup_cmds
    # from two mappers/reducers won't run simultaneously on the same machine)
    ##- ln -sf $(readlink -f config.py) src-tree.tar.gz/config/config.py
    ##- ln -sf $(readlink -f secret.py) src-tree.tar.gz/config/secret.py
    # run Makefile.emr to compile C code (EMR has a different architecture,
    # so we can't just upload the .so files)
    ##- cd src-tree.tar.gz; make -f Makefile.emr
    # generally, we run jobs on a Linux server separate from our desktop
    # machine. So the SSH tunnel needs to be open so a browser on our
    # desktop machine can connect to it.
    ssh_tunnel_is_open: true
    ssh_tunnel_to_job_tracker: true
    # upload these particular files on the fly because they're different
    # between development and production
    ##upload_files: &upload_files
    ##- $BT/config/config.py
    ##- $BT/config/secret.py
  hadoop:
    # Note the use of YAML references to re-use parts of the EMR config.
    # We don't currently run our own hadoop cluster, so this section is
    # pretty boring.
    base_tmp_dir: /scratch/$USER
    ##python_archives: *python_archives
    ##setup_cmds: *setup_cmds
    ##upload_files: *upload_files
  local:
    # We don't have gcc installed in production, so if we have to run an
    # MRJob in local mode in production, don't run the Makefile
    # and whatnot; just fall back on the original copy of the code.
    base_tmp_dir: /scratch/$USER

Step 3: Creating your AWS Access Key. Login to AWS and navigate to My Account > Security Credentials
Click on ‘Create New Access’. It will create an Access Key ID an Secret Access Key. Assign them link follow to your mrjob.config
aws_access_key_id = Access Key ID
aws_secret_access_key = Secret Access Key

Step 4: Create a key value pair: Navigate to AWS Management Console > EC2 tab:
Confirm you are in the right region before you create your key pairs.
Click ‘Create Key Pair’ button and give a name to your key.
Important you will only have once chance to download your key pair. Download .pem after it has been created and safe it somewhere safe.

Specify your pem files in your mrjobs.conf as follow:
ec2_key_pair = name of key value pair
ec2_key_pair_file = location of .pem file

Make sure you have read access on your pem file. Do chmod 0400 if you are not sure.

After you have done all of this it is time to run your mrjob. Use the following command:
python mr_word_freq_count.py log1 -r emr > counts
You will see the following output:

using configs in /home/thys_michels/.mrjob.conf
Uploading input to s3://mrbucket1/tmp/mr_word_freq_count.root.20120127.202504.109284/input/
creating tmp directory /scratch/root/mr_word_freq_count.root.20120127.202504.109284
writing master bootstrap script to /scratch/root/mr_word_freq_count.root.20120127.202504.109284/b.py
Copying non-input files into s3://mrbucket1/tmp/mr_word_freq_count.root.20120127.202504.109284/files/
Waiting 5.0s for S3 eventual consistency
Creating Elastic MapReduce job flow
Job flow created with ID: j-2Y1JXJT4FKQ7Y
Job launched 30.3s ago, status STARTING: Starting instances
Job launched 61.8s ago, status STARTING: Starting instances
Job launched 92.1s ago, status STARTING: Starting instances
Job launched 122.4s ago, status STARTING: Starting instances
Job launched 152.7s ago, status STARTING: Starting instances
Job launched 183.0s ago, status BOOTSTRAPPING: Running bootstrap actions
Job launched 213.3s ago, status BOOTSTRAPPING: Running bootstrap actions
Job launched 243.8s ago, status BOOTSTRAPPING: Running bootstrap actions
Job launched 274.1s ago, status BOOTSTRAPPING: Running bootstrap actions
Job launched 304.6s ago, status RUNNING: Running step (mr_word_freq_count.root.20120127.202504.109284: Step 1 of 1)
Opening ssh tunnel to Hadoop job tracker
Connect to job tracker at: http://ubuntu:40053/jobtracker.jsp
Job launched 336.3s ago, status RUNNING: Running step (mr_word_freq_count.root.20120127.202504.109284: Step 1 of 1)
 map 100% reduce 100%
Job launched 367.0s ago, status RUNNING: Running step (mr_word_freq_count.root.20120127.202504.109284: Step 1 of 1)
 map 100% reduce 100%
Job completed.
Running time was 52.0s (not counting time spent waiting for the EC2 instances)
Fetching counters from S3...
Waiting 5.0s for S3 eventual consistency
Counters from step 1:
  FileSystemCounters:
    FILE_BYTES_READ: 90
    FILE_BYTES_WRITTEN: 281
    S3_BYTES_READ: 78
    S3_BYTES_WRITTEN: 53
  Job Counters :
    Launched map tasks: 2
    Launched reduce tasks: 1
    Rack-local map tasks: 2
  Map-Reduce Framework:
    Combine input records: 9
    Combine output records: 9
    Map input bytes: 51
    Map input records: 9
    Map output bytes: 78
    Map output records: 9
    Reduce input groups: 6
    Reduce input records: 9
    Reduce output records: 6
    Reduce shuffle bytes: 127
    Spilled Records: 18
Streaming final output from s3://mrbucket1/tmp/mr_word_freq_count.root.20120127.202504.109284/output/
removing tmp directory /scratch/root/mr_word_freq_count.root.20120127.202504.109284
Removing all files in s3://mrbucket1/tmp/mr_word_freq_count.root.20120127.202504.109284/
Removing all files in s3://mrbucket1/j-2Y1JXJT4FKQ7Y/
Killing our SSH tunnel (pid 17859)
Terminating job flow: j-2Y1JXJT4FKQ7Y

Cool you can open your jobtracker as seen in logs: http://ubuntu:40053/jobtracker.jsp
You will see a nice breakdown of your map reduce job:

Advertisements

One Comment on “Running mrjob on Amazon Elastic MapReduce

  1. I am using windows OS. Can you specify where to create the mrjob.conf file? In C drive?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: