Running mrjob on Amazon Elastic MapReduce
Below is the steps to run your first Amazon Elastic Map Reduce on Amazon EC2.
First step is to make sure you have completed the steps specified in my previous post:
http://thysmichels.com/2012/01/27/running-local-mrjob-streaming-hadoop-jobs/
Ok let’s start:
Step 1. Create a new file called mrjob.conf. The location of the file is important.
- The location specified by MRJOB_CONF
- ~/.mrjob.conf
- ~/.mrjob (deprecated)
- mrjob.conf in any directory in PYTHONPATH (deprecated)
- /etc/mrjob.conf
I created my .mrjob.conf in my /home/thys_michels/.mrjob.conf
Step 2: Below is the mrjob.conf explained. Make sure you comment out all the lines below and also modify parameters where necessary.
Note: Sample of mrjob.conf can be downloaded from: https://github.com/Yelp/mrjob/blob/master/mrjob.conf.example
Below is my .mrjob.conf. See lines in bold is what I have changed. I also comment out allot of the lines.
# This is basically the config file we use in production at Yelp, with some # strategic edits.# # If you don't have the yaml module installed, you'll have to use JSON instead, # which would look something like this: # # {"runners": { # "emr": { # "aws_access_key_id": "HADOOPHADOOPBOBADOOP", # "aws_region": "us-west-1", # "aws_secret_access_key": "MEMIMOMADOOPBANANAFANAFOFADOOPHADOOP", # "base_tmp_dir": "/scratch/$USER" # "bootstrap_python_packages": [ # "$BT/aws/python-packages/*.tar.gz" # ], # ... # runners: emr: aws_access_key_id: ### # See Step 3 on how to create an AWS Access key # We run on in the west region because we're located on the west coast, # and there are no eventual consistency issues with newly created S3 keys. aws_region: us-west-1 # make sure your keys are created in the same aws_region aws_secret_access_key: ### # see step 3 on how to create an access key and access your secret key # alternate tmp dir base_tmp_dir: /scratch/$USER # make sure you have priviliges to /scratch file # $BT is the path to our source tree. This lets us add modules to # install on EMR by simply dumping them in this dir. ##bootstrap_python_packages: ## $BT/aws/python-packages/*.tar.gz # specifying an ssh key pair allows us to ssh tunnel to the job tracker # and fetch logs via ssh ec2_key_pair: mrjobkey2 ec2_key_pair_file: /home/thys_michels/Documents/mrjobkey2.pem # See Step 4 to create key_pairs # use beefier instances in production ec2_instance_type: m1.small # make sure to change this from c1.xlarge to m1.small if you are running small mapreduce jobs. As you will be charged more for xlarge instance. # but only use one unless overridden num_ec2_instances: 1 # use our local time zone (this is important for deciding when # days start and end, for instance) cmdenv: TZ: America/Los_Angeles # Confirm you your keys and images are created in this TimeZone # we create the src-tree.tar.gz tarball with a Makefile. It only contains # a subset of our code ##python_archives: &python_archives ##- $BT/aws/src-tree.tar.gz # our bucket also lives in the us-west region s3_log_uri: s3://mrbucket1/ s3_scratch_uri: s3://mrbucket1/tmp/ # Create these two bucks and one tmp folder inside the bucket. Make sure your bucket is in the same TimeZone as your keys. ##setup_cmds: &setup_cmds # these files are different between dev and production, so they're # uploaded separately. copying them into place isn't safe because # src-tree.tar.gz is actually shared between several mappers/reducers. # Another safe approach would be to add a rule to Makefile.emr that # copies these files if they haven't already been copied (setup_cmds # from two mappers/reducers won't run simultaneously on the same machine) ##- ln -sf $(readlink -f config.py) src-tree.tar.gz/config/config.py ##- ln -sf $(readlink -f secret.py) src-tree.tar.gz/config/secret.py # run Makefile.emr to compile C code (EMR has a different architecture, # so we can't just upload the .so files) ##- cd src-tree.tar.gz; make -f Makefile.emr # generally, we run jobs on a Linux server separate from our desktop # machine. So the SSH tunnel needs to be open so a browser on our # desktop machine can connect to it. ssh_tunnel_is_open: true ssh_tunnel_to_job_tracker: true # upload these particular files on the fly because they're different # between development and production ##upload_files: &upload_files ##- $BT/config/config.py ##- $BT/config/secret.py hadoop: # Note the use of YAML references to re-use parts of the EMR config. # We don't currently run our own hadoop cluster, so this section is # pretty boring. base_tmp_dir: /scratch/$USER ##python_archives: *python_archives ##setup_cmds: *setup_cmds ##upload_files: *upload_files local: # We don't have gcc installed in production, so if we have to run an # MRJob in local mode in production, don't run the Makefile # and whatnot; just fall back on the original copy of the code. base_tmp_dir: /scratch/$USER
Step 3: Creating your AWS Access Key. Login to AWS and navigate to My Account > Security Credentials
Click on ‘Create New Access’. It will create an Access Key ID an Secret Access Key. Assign them link follow to your mrjob.config
aws_access_key_id = Access Key ID
aws_secret_access_key = Secret Access Key
Step 4: Create a key value pair: Navigate to AWS Management Console > EC2 tab:
Confirm you are in the right region before you create your key pairs.
Click ‘Create Key Pair’ button and give a name to your key.
Important you will only have once chance to download your key pair. Download .pem after it has been created and safe it somewhere safe.
Specify your pem files in your mrjobs.conf as follow:
ec2_key_pair = name of key value pair
ec2_key_pair_file = location of .pem file
Make sure you have read access on your pem file. Do chmod 0400 if you are not sure.
After you have done all of this it is time to run your mrjob. Use the following command:
python mr_word_freq_count.py log1 -r emr > counts
You will see the following output:
using configs in /home/thys_michels/.mrjob.conf Uploading input to s3://mrbucket1/tmp/mr_word_freq_count.root.20120127.202504.109284/input/ creating tmp directory /scratch/root/mr_word_freq_count.root.20120127.202504.109284 writing master bootstrap script to /scratch/root/mr_word_freq_count.root.20120127.202504.109284/b.py Copying non-input files into s3://mrbucket1/tmp/mr_word_freq_count.root.20120127.202504.109284/files/ Waiting 5.0s for S3 eventual consistency Creating Elastic MapReduce job flow Job flow created with ID: j-2Y1JXJT4FKQ7Y Job launched 30.3s ago, status STARTING: Starting instances Job launched 61.8s ago, status STARTING: Starting instances Job launched 92.1s ago, status STARTING: Starting instances Job launched 122.4s ago, status STARTING: Starting instances Job launched 152.7s ago, status STARTING: Starting instances Job launched 183.0s ago, status BOOTSTRAPPING: Running bootstrap actions Job launched 213.3s ago, status BOOTSTRAPPING: Running bootstrap actions Job launched 243.8s ago, status BOOTSTRAPPING: Running bootstrap actions Job launched 274.1s ago, status BOOTSTRAPPING: Running bootstrap actions Job launched 304.6s ago, status RUNNING: Running step (mr_word_freq_count.root.20120127.202504.109284: Step 1 of 1) Opening ssh tunnel to Hadoop job tracker Connect to job tracker at: http://ubuntu:40053/jobtracker.jsp Job launched 336.3s ago, status RUNNING: Running step (mr_word_freq_count.root.20120127.202504.109284: Step 1 of 1) map 100% reduce 100% Job launched 367.0s ago, status RUNNING: Running step (mr_word_freq_count.root.20120127.202504.109284: Step 1 of 1) map 100% reduce 100% Job completed. Running time was 52.0s (not counting time spent waiting for the EC2 instances) Fetching counters from S3... Waiting 5.0s for S3 eventual consistency Counters from step 1: FileSystemCounters: FILE_BYTES_READ: 90 FILE_BYTES_WRITTEN: 281 S3_BYTES_READ: 78 S3_BYTES_WRITTEN: 53 Job Counters : Launched map tasks: 2 Launched reduce tasks: 1 Rack-local map tasks: 2 Map-Reduce Framework: Combine input records: 9 Combine output records: 9 Map input bytes: 51 Map input records: 9 Map output bytes: 78 Map output records: 9 Reduce input groups: 6 Reduce input records: 9 Reduce output records: 6 Reduce shuffle bytes: 127 Spilled Records: 18 Streaming final output from s3://mrbucket1/tmp/mr_word_freq_count.root.20120127.202504.109284/output/ removing tmp directory /scratch/root/mr_word_freq_count.root.20120127.202504.109284 Removing all files in s3://mrbucket1/tmp/mr_word_freq_count.root.20120127.202504.109284/ Removing all files in s3://mrbucket1/j-2Y1JXJT4FKQ7Y/ Killing our SSH tunnel (pid 17859) Terminating job flow: j-2Y1JXJT4FKQ7Y
Cool you can open your jobtracker as seen in logs: http://ubuntu:40053/jobtracker.jsp
You will see a nice breakdown of your map reduce job:
Running local mrjob streaming hadoop jobs
Follow the steps below to run an local mrjob. In this example I run an mrjob to calculate word frequency.
Prereq: Needs python 2.6 or 2.7 installed this to work.
Step 1. Download mrjob:
https://github.com/Yelp/mrjob
Step 2. Navigate to Yelp/mrjob/examples in your terminal
Step 3: Create a Dataset download a dataset from http://www.infochimps.com.
Step 4: Test your environment and make sure mrjob works, run:
import mrjob
This will show no errors or dependency issues.
Step 4: Running your mrjob
python mr_word_freq_count.py log1 > counts
log1 input was: (note each line was tabbed delimited)
test one two three four five one two test
Result:
"five" 1 "four" 1 "one" 2 "test" 2 "three" 1 "two" 2
Map and Reduce Python Script Example
Below is an example of your first Map, Reduce and Data Sample.
Let’s look at the Mapper.py file:
import sys from numpy import mat, mean, power #read input folder line by line def read_input(file): for line in file: #returns file input with training char removed (same as Trim()) yield line.rstrip() #creates a list of input lines input = read_input(sys.stdin) #cast to floats input = [float(line) for line in input] #determine number of inputs numInputs = len(input) #convert list to matrix input = mat(input) #Form a vector of squares sqInput = power(input,2) #calculate output size, mean and mean(square values) print numInputs, mean(input), mean(sqInput) #calc mean of columns print >> sys.stderr, "report: still alive" #pass squared values to the reducer if __name__ == '__main__': pass
Now for the Reducer.py. So 3 elements per lines are passed to the Reducer from the Mapper:
numInputs, mean(input), mean(sqInput)
import sys
from numpy import mat, mean, power
def read_input(file):
for line in file:
yield line.rstrip()
#creates a list of input lines from mapper
input = read_input(sys.stdin)
#split the 3 input into separate items and store in list of lists
mapperOut = [instance.split() for instance in input]
#assign total number of samples (cumN), overall sum(cumVal) and overall sum sq (cumSumSq) to 0
cumVal=0.0
cumSumSq=0.0
cumN=0.0
for instance in mapperOut:
#for each item in the list cast to float
nj = float(instance[0])
#increase cumN with item value
cumN = cumN + nj
#multiply instance[0] with instance[1] and instance [2] with CumVal and cumSumSq
cumVal = cumVal + nj*float(instance[1])
cumSumSq = cumSumSq + nj*float(instance[2])
#calculate means
mean = cumVal/cumN
#calculate means squared
meanSq = cumSumSq/cumN
#output size, mean, mean(square values)
print cumN, mean, meanSq
print >> sys.stderr, "report: still alive"
if __name__ == '__main__':
pass
See the sample dataset:
0.865670009848 0.240464946103 0.38583753445 0.851896046359 0.56613365811 0.901353547484 0.47530934886 0.903698474043 0.690057722624 0.549349071622 0.374166366825 0.63335531551 0.607434274558 0.1626603772
Running Map Reduce on Amazon Elastic MapReduce
Below is the steps to write your fist Map Reduce on Amazon EMR.
Step 1: Register on Amazon: aws.amazon.com/free
After you login we will complete the following two parts:
- Part 1: Create the buckets for our input file, map and reduce file
- Part 2: Create our map reduce job
Part 1: Create the buckets for our input file, map and reduce file:
Step 2. Navigate to the S3 tab inside your console.
Step 3: Click on Create Bucket button:
Make sure your bucket name does not start with an upper case, has not numbers or symbols. Your bucket name needs to be unique so you may have to try a few names as many names may already been taken.
Step 4: Create two folders inside your bucket by clicking on the ‘Create Folder’ Button
Your folder structure will look like follows:
s3://mlbucket/mlcode
s3://mlbucket/mlinput
It will looks like follows:
Don’t worry about the extra folders they will make sense after you have create your Map Reduce Flow
Step 5: Upload your Map Reduce Dataset following folder:
s3://mlbucket/mlinput
Step 6: Upload your Map and Reduce Python file into the following folder:
s3://mlbucket/mlcode
Part 2: Create our map reduce job
Step 7: Login and navigate to Elastic MapReduce Tab on your AWS Management Console
Step 8: Select the region to run your MapReduce. This is very important that you create your map reduce in the same region as where your buckets and key pairs where created.
Step 9: Click on ‘Create new job flow’ button to start the MapReduce wizard
Step 10: Name your MapReduce job and specify if it is your own application or an sample application. In this instance it we will be creating our own MapReduce and select our Job Type as Streaming.
Why Streaming: A Streaming job flow runs a single Hadoop job consisting of map and reduce functions that you have uploaded to Amazon S3. The functions can be implemented in any of the following supported languages: Ruby, Perl, Python, PHP, R, Bash, C++.
Step 11: Click on ‘Continue’ to specify the parameters for your MapReduce job:
Here is an explanation of the parameters:
Input location: location to your dataset
Output location: Where you want your MapReduce output be put.
Mapper: Location of your Mapper, make sure to add the quotation marks around your statement.
Reducer: Location of your Reducer, make sure you add the qoutation marks around your statement.
Extra Args: No need for extra arguments in this case.
Step 12: Configure your Amazon Elastic MapReduce instance
Note: In my instance I specified the small instance with 1 instance count as this job is small. For large jobs select the Large or XLarge instances.
Note on Usage: Amazon bills you per hour for usage. So if your job runs for 2 min your will be billed for an hour. So the best is to run a few jobs in an hour time frame.
Going forward look the 3 options: On Demand Instance, Reserved Instance or Spot instance. Depending on your usage and the demand on AWS different options may be the best option for you.
Step 13: Specify advance options
If you have any specific key pairs created you can select them here. Also add an directory for AWS to place your log files. This directory will be automatically created.
Step 14: Proceed with no Bootstrap or Configure your Bootstrap Actions
Note: Bootstrap Actions is a feature in Amazon Elastic MapReduce that provides users a way to run custom set-up prior to the execution of their job flow. Bootstrap Actions can be used to install software or configure instances before running your job flow.
In this case we don’t need any bootstrap action to take place.
Step 15: Review your configuration
Click ‘Create Job Flow’ button to start your mapreduce job.
You will see your job will move from State: STARTING
Step 16: Check the result from your Map Reduce Job, open the meanVar001Log file as specified as the Output Location.
Open part-00000 to see the result:
Salesforce Business Processes, Drivers and Solutions
Below is a very nice chart on the Salesforce Business Process, Drivers and Solutions that you can implement for each Business Driver.
| Business Drivers | Business Processes | Sales Cloud Solution |
| Build Stronger Pipeline | Lead generation | Automated Lead Capture and Import |
| Lead Qualification | Lead Scoring & Routing, Lead Conversion, Alerts & Monitoring | |
| Manage the funnel | Sales Methodology | Opportunity Management |
| Visibility of Sales forecast | Customizable Forecasting | |
| Improve Sales Rep productivity | Account and Contact Management | 360 Degree View Approvals |
| Activity Management | Activity Sharing & Tracking | |
| Drive more business | Demand Generation | Campaign Management Segmentation |
| Search Marketing | Website integration and Google Adwords | |
| Lead Handoff | Feedback from Sales | |
| Align Sales and Marketing | Brand Management | Email Templates, Communication |
| Brand Collateral Management | Collateral & Documents |
Salesforce Reports and Analytics
Standard Report – out-of-the-box reports which may be used as a starting point for custom reports. Folders may be hidden but cannot be removed or deleted.
Reports Wizard – an easy-to-use, multi-step wizard used to create a custom report. The number of wizard steps depends on the Report Type selected.
Custom Reports - created with you specific criteria, may be edited or deleted, can be searched for in Custom Report search.
Tabular Reports – provides a simple listing of data without subtotals.
Summary Reports – provides a listing of data just like a Tabular report PLUS:
- Sorting
- Subtotals of data
Matrix Reports – summarizes data in a grid against horizontal and vertical criteria. The matrix report is used for comparing related totals, similar to a pivot table in Excel.
Export Report to Excel – 256 columns and 65,536 rows of data in one report.
Printing Report Results – Report format is lost when you export directly to Excel.
Scheduling and Email Reports – specify a running user, frequency and start and end date. The email contains: report information, link to reports, data in HTML that links back to Salesforce.
Data components – used in reports when selected groupings. Date values can be selected in Matrix and Summary reports when grouping.
Advance filer criteria – used in reports where and/or is used. Enables the usage of “and”, “or” and “not” operators. Use parentheses to specify calculation priority. Use up to 10 Advance filters per report.
Trend Reports – report opportunity history data by filtering on “as of” date. Between 2 “as of” dates specify the interval.
Charts – graphical representation of data of a single Summary or Matrix report. Types of charts can be:
- Pie
- Line
- Horizontal or Vertical
Summary and Matrix charts can be grouped or stacked.
Relative dates are used in:
- Views
- Reports
which specify relative dates like: this/next month, this/next quarter.
Custom Report Types – allow users to create and customize reports using the report wizard. Custom reports are report types off the relationships (master-detail and lookup) to:
1. Choose standard and custom objects to display users creating and customizing reports.
2. Select object fields can be used as columns in reports
3. Define the relationships between objects displayed to the users creating and customizing reports
Conditional Highlighting – see the threshold for report analysis.
Custom Summary Formulas – Calculations on Summary fields.
Dashboard – is an visual representation of key business information that shows multiple reports. Dashboards are made up of Components that uses the matrix or summary reports as source. Refresh of Dashboard data can be scheduled, email to a dashboard is allowed.
Running User – allow users to view summarized data they might not normally have access to.
Dashboard Components – Chart, Table, Metric, Gauge.
Campaigns – specific marketing program or marketing tactic build to create awareness and generates leads.
Campaign Member – Lead or contact who is associated to the Campaign by responding to a campaign.
Anyone is your organization can view campaigns but they can only be edited, deleted or saved by marketing users with appropriate permissions.
Lead – is a prospective users that shows interest in your company by capturing all his information. Assigned ownership either manually or via assignment rule.
Contact – individual who is associated to an Account
Converting Leads – lead qualification depends on your organization specific business process. All the lead information is mapped to the appropriate business objects:
- Account
- Contact
- Opportunity
Integrated Campaign Builder – cannot filter an campaign by more than one campaign at a time. The maximum leads that can be added at one time is 50 000 and with the wizard it is 250.
Campaign Hierarchy - view the entire hierarchy of campaigns
EyeOS 2.5 Error – Cloud OS
Parse error: syntax error, unexpected T_PAAMAYIM_NEKUDOTAYIM in /home/orion/public_html/cloudbootstrap.com/eyeos/system/kernel/services/Meta/implementations/MetaDataConverter.php on line 62
Replace the line 62 with:
return eval($handlerToLoad.’::getInstance();’);
Happy New 2012 to all my subscribers and blog fans
HAPPY NEW YEAR
╭━━━╮╭━━━╮┈┏┓╭━━━╮
┗━━╮┃┃╭━╮┃┏╯┃┗━━╮┃
╭━━╯┃┃┃┈┃┃┗┓┃╭━━╯┃
┃╭━━╯┃┃┈┃┃┈┃┃┃╭━━╯
┃┗━━┓┃╰━╯┃┈┃┃┃┗━━┓
┗━━━┛╰━━━╯┈┗┛┗━━━┛
Apatar Salesforce data integration runs on 64 bit
If you want to run Apatar on a Windows 7 64bit all you need to do is install jre6 and add it path to your environmental variables.
This stays the same:
start javaw -Djpf.boot.config=”C:\Program Files (x86)\Apatar\boot1.properties” -Xms128m -Xmx512m -cp “C:\Program Files (x86)\Apatar\plugins\core\core\lib\ibmpkcs.jar”;”C:\Program Files (x86)\Apatar\plugins\core\core\lib\mail.jar”;”C:\Program Files (x86)\Apatar\plugins\connectors\eloqua\lib\webservices-rt.jar”;”C:\Program Files (x86)\Apatar\plugins\connectors\eloqua\lib\apatarEloquaAuth.jar”;”C:\Program Files (x86)\Apatar\lib\jpf-boot.jar”;”C:\Program Files (x86)\Apatar/plugins/core/core/lib/jdic.jar”; org.java.plugin.boot.Boot setting=”C:/Program Files (x86)/Apatar/boot1.properties” debug=”false” %1 %2 %3 %4 %5
exit
Add this to your Windows PATH variable:
C:\Program Files (x86)\Java\jre6\bin














