Author Archives: Thys Michels

Hadoop in Practice

I had the privilege to get an early release of the Hadoop in Practice book from Manning Publishers. The book has the following chapters:

Table of Contents
  1: Getting started - FREE

Part I: Data Logistics
  2: Moving Data in and out of Hadoop
  3: Data Serialization: Working with Text and BeyondPart II: Big Data Patterns
  4: Applying MapReduce Patterns to Big Data
  5: Streamlining HDFS for Big Data
  6: Measuring and Optimizing Performance

Part III: Data Science
  7: Utilizing Data Structures and Algorithms
  8. Applying Statistics
  9. Machine Learning

Part IV: Taming the Elephant
10. Hive
11. Pig
12. Crunch and Other Technologies
13. Testing and Debugging
14: Job Coordination
15. Proficient Administration

Appendixes
  A: Related Technologies
  B: Hadoop Built-in Ingress and Egress Tools
  C: HDFS Dissected
  D: Optimized MapReduce Join Frameworks

If you are new to Hadoop or a manager and want to learn how Hadoop can help solve your big data challenges then this book is for you.

You can purchase the book online here:

http://www.manning.com/holmes/

Great investment and lots of great content.

Salesforce Integration with Pervasive Data Cloud, WebSphere Cast Iron, Informatica Cloud Services and Dell Boomi

The need for integration

With the accelerating growth of Cloud Computing, many companies are getting stuck in migrating, integrating and extracting information from their on-premise to their cloud environment and vice versa.

Given these concerns regarding integration, it is not surprising that many IT organizations have felt hesitation if not outright rejection of Cloud applications. According to a Gartner study conducted in 2009 on why many CIO’s were actually transitioning away from Cloud deployments, 56% responded it was due to the impact of integration requirements on their systems. So, we can see the complexity of integrating applications has obviously been a driving factor for adoption and implementation of Cloud solutions.

To realize the full power of force.com, integration with on-premise databases and other Cloud-based solutions must be simple yet complete.  It must be possible to implement solutions in days, not weeks or months. At the same time, the solution needs the sophistication required to harmonize business processes across multiple cloud and on-premise applications. The integration solution should be able to run anywhere, connect applications deployed anywhere, be managed from anywhere and require limited specialist integration skills or IT infrastructure. These solutions must be easily configurable, flexible and scalable, meaning no coding.

Integration challenges

Below are some of the integration challenges that customers face when integrating on-premise applications with Cloud applications:

  • Firewall mediation – How do you open up the firewall for integrating force.com applications with on-premise apps?
  • Security - How do you encrypt and otherwise protect sensitive information, stored or on the move, Cloud or on-premise?
  • Semantic mediation – How do you account for differences in data structure between the source and target?
  • Performance required - How fast do you need to move data and how quickly does data transformation and routing mechanisms need to function?
  • Data integrity – How do you make sure the right data is delivered to the right target at the right time? This includes ensuring that the data is clean when it arrives at the target database.
  • Maintenance and upgrades – How do you support new Cloud or enterprise system interfaces as they evolve?
  • Governance – How do you monitor all points of integration and log the data being synchronized?

Integration options to consider

The following options are available:

Option

Pros

Cons

Custom development

If an organization has enough IT resources and programmers to create a one-off custom integration, this can often be a viable solution.

A number of resource-intensive hidden costs in maintenance, support, and any future changes should the need arise to grow the solution to integrate more applications.

 

On-demand solution specializing in cloud-to-cloud connectivity

Low-cost alternative for simple cloud integration projects.

Scalability and functionality to address on-premise or hybrid scenarios. Pure on-demand point solutions are not equipped to handle complex processes and back office applications.

Traditional on-premise solution

Based on a more classic ETL architecture, designed for extracting, processing and storing large quantities of data.

Longer install and implementation time as well as a much larger IT footprint. Companies may end up purchasing and maintaining two or more complex systems to solve one problem.

Specific integration tool examples

The following are examples of the force.com integration tools available:

  • Pervasive Data Integrator
  • WebSphere Cast Iron Appliance
  • Informatica Cloud Services
  • Dell Boomi

Pervasive Data Cloud

Pervasive offers highly efficient and reliable force.com integration with ERP, HR and other systems without custom coding, extensive software libraries or a big price tag.

Pervasive can connect force.com  to all your data – integrate with your accounting, ERP, SaaS, MIS and any other mission critical business application. Pervasive can connect/migrate/integrate your data and provide flexible delivery options.

When to choose Pervasive:

  • Data migration
  • Batch/Real-time processing integration
  • Advanced workflow

When not to choose Pervasive:

  • Limited deployment models  as Pervasive cloud offering is not multi-tenant
  • Limited design capability in the cloud and also no secure connector to on-premise applications.
  • Limited API’s made available for 3rd party product integration.

WebSphere Cast Iron

Many enterprises need to synchronize a master list of their current customers, products, prices and all their transaction history between force.com and corporate on-premise systems. WebSphere Cast Iron integration for force.com is a fast, simple solution specifically for integrating force.com with other applications. With a few clicks and configurations you can migrate and integrate your current on-premise applications with force.com.

WebSphere Cast Iron has multiple force.com adapters that provide the capability to migrate, integrate or extract information in a fast and secure manner. WebSphere Cast Iron provides the following options for data migration and data quality:

  • Data profiling – Asses the quality of the data before migrating it.
  • Data cleansing – Remove duplicates from various sources and setup validation rules.
  • Data enrichment – Perform lookups with external providers to enrich data.

It provides the following options for integration and extraction.

  • Connectivity – Configurable connectivity between on-premise applications and Sales force.
  • Transformation – Drag and drop user interface for data transformation.
  • Workflow – Visual interface for designing workflow rules.
  • Management – Easy manageability through a single web-based console.

When to use WebSphere Cast Iron:

  • Implementing multiple deployment models
  • Batch/Real time process integration
  • Enterprise connectivity – multiple enterprise adapters for popular enterprise software.
  • UI Mashup
  • Template Development Kit

When not to use WebSphere Cast Iron:

  • Limited data migration functionality  which can only migrate data to and from specific software and not all
  • Limited data quality functionality to clean data to and from the cloud

Informatica Cloud Services

Informatica Cloud Services are specifically designed to meet the data integration needs of line-of-business users requirements. Informatica Cloud Services are based entirely “in the cloud” which makes integrating cloud-based to force.com data quick and easy. It can also be used to synchronize and replicate data between local databases and files.

Informatica Cloud Services can be used to integrate SaaS applications with a variety of common on-premise systems and databases. The tools to build a cloud service requiring very little training to set up and administer.

When to use Informatica Cloud Services:

  • Batch processing
  • High availability
  • Web Service Integration
  • Advance Workflow

When not to use Informatica Cloud Services:

  • Real-time integration is needed
  • Security agent is necessary for secure cloud connection.
  • Multiple deployment models
  • UI Mashup
  • Multiple environments

Dell Boomi

The Dell IT group used the Boomi AtomSphere® application to unify the force.com .com instances, enabling fully integrated and synchronized customer information across sales groups and businesses processes.

When you choose Dell Boomi you get:

  • Batch Process Integration
  • Real-time Process Integration
  • Data Migration
  • Advance Workflow
  • Web Services
  • High Availability

When not to use Boomi:

  • Multiple Deployment models
  • UI Mashup
  • Multiple environments
  • Template Development Kit

Integration Vendor Comparison Chart

This table below gives a breakdown comparison of the different force.com Integration software’s that is available:

Capabilities Pervasive WebSphere Cast Iron Informatica Cloud Services Boomi

Multiple Deployment Models

 x

Data Migration

 x  x x x

Batch Process Integration

x x x x

Real-time Process Integration

x  x x x

Workflow

x x  x

Enterprise Connectivity

x x  x

Data Quality

x x x x

UI Mashup

x x  x

Multiple environments

 x x

Template Development Kit

x

Web Service API Gateway

 x x  x

Management APIs

 x  x

High Availability

 x x x x

What to look for when evaluating a cloud integration solution

Below are some of the evaluating factors that need to be considered when choosing an integration solution:

Cloud Area

Pervasive Data Cloud

WebSphere Cast Iron

Informatica Cloud Service

Boomi

Design in the Cloud

 x x x

Manage in the Cloud

 x x x  x

Run in the Cloud

 x x  x x

Multi-tenant Cloud platform

 x x x

Cloud to on-premise data

 x  x

APIs for the Cloud

x  x

Connector kit

 x x

Summary

To get the most value out of your current IT investments, you need to be able to integrate existing applications with force.com applications.

The solutions described above meet the need of today’s businesses by providing a simplified, fast, and low-cost approach to integration projects, with the flexibility to deploy integrations in the cloud or on premise, and the option to change form factors if needed.

Salesforce Debug Logs, Audit Trail and Field History Comparison

Below is a table showing the Salesforce Audit Tools compared between Debug Logs, Audit Trail and Field History.

Debug Logs Audit Trail Field History
Tracks Tracks automated actions – activity performed and results generated by end user or code Tracks configuration changes by administrator or developers Track changes of data value for fields.
Examples Execution of Apex trigger Change to a workflow rule Update to Status or Pay Grade field

Visualforce Standard vs Custom Apex Chart

The Visualforce chart below is key to understand how customization and development work in Salesforce.

Standard Behavior Custom Behavior
Standard Look and Feel Application Framework and Default UI (Page Layout)Example: Standard Opportunity page with a default page layout. Page Layouts and Custom Apex ClassesExample: Standard Opportunity page that validates the opportunity stage.
Custom Look and Feel Visualforce pages with standard controllersExample: Visualforce page containing opportunity information Visualforce pages with custom Apex ControllersExample:Opportunity search portal to search for specific opportunities in Salesforce.

Salesforce Data Management Decision Chart

Below is an easy way to see when to use the Salesforce Import Wizard or Data loader for different scenarios.

Considerations for Import Tool

Import Wizard

Data Loader

De-duping or deduplicate data

X

Schedule data loads

X

Load two related objects at once

X

Load less than 50,000 records

X

Export data that needs to be used as a backup

X

Load data into a object not supported by the import wizard

X

Use the upsert or upsert with relationship functionality.

X

Note X marks the best tools for each consideration.

Hadoop Multi-node setup on Ubuntu

Setiing up an Hadoop Multi-node instance on Ubuntu can be challenging. In my instance I used my laptop to do it and it can be tricky as I ran 2 VM’s with 2GB RAM, which makes everything a bit slow…thanks to my new Apple MacBook Pro with 8GB RAM I had no worries.

I will break this tutorial into a few parts just to make it more organized and so you can track your progress. Remember you will have the follow each of these parts twice on each of your machines (master and slave).

  • Part 1: Setting up your Ubuntu Environment 
  • Part 2: Configure the /etc/hosts file 
  • Part 3: SSH Setup 
  • Part 4: Download and configuring Hadoop
  • Part 5: Configure Master Slave Settings
  • Part 6: Starting Master Slave Setup
  • Part 7: Running first Map Reduce on Multi-Node Setup

Part 1: Setting up your Ubuntu Environment:

By default Ubuntu does not come with Sun Java installed so you will have to install it. This is an easy way to install it via the command line.

> sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
> sudo apt-get update
> sudo apt-get install sun-java6-sdk

Java is installed, lets now export Java:

> export JAVA_HOME=/usr/lib/jvm/java-6-sun

By default Ubuntu will not have ssh installed, so let’s install it from the command line

> sudo apt-get install ssh

It is recommended not to run hadoop under your current user/group we will create a new user and group. We will call the user hduser and group hd. The command looks like follows:

> sudo addgroup hd
> sudo adduser --ingroup hd hduser

The last thing we need to do is to disable IPV6 for Hadoop. After you have downloaded Hadoop add this line to conf/hadoop-env.sh:

> export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

Part 2: Setup /etc/hosts file

You need to setup the /etc/hosts file with the details of the master and slave IP. Run the following command to edit the hosts file

> sudo vi hosts (use gedit if you don't know vi)

Add the following lines:

172.*.*.*       master
172.*.*.*       slave

You need to run the command: ifconfig , on your master and slave machine to determine the IP address of the two machines. You then fill the IP address in where I have *.

Part 3: SSH Setup

Let’s configure ssh, run

> su - hduser
> ssh-keygen -t rsa -P ""
> cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

On the Master machine run the following

> hduser@master:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@slave

Test if ssh works for master and slave run:

> ssh master
> ssh slave

Part 4: Download and configuring Hadoop

First we need to download the latest hadoop and extract to our local filesystem. Download the latest hadoop from: http://www.reverse.net/pub/apache//hadoop/common/

Extract Hadoop: tar -xvf Hadoop*.tar.gz

Now we need to change ownership of the extracted Hadoop folder to hduser, we can do that with the following command

> sudo chown hduser:hd /home/user/Downloads/hadoop/*

Best to move the hadoop folder out of Downloads folder you can do with the following command:

mv /home/user/Download/hadoop /usr/local/

Now we need to configure $HOME/.bashrc with the Hadoop variables enter the following commands:

> cd ~
> sudo vi .bashrc (if you don't know vi, you can type: sudo gedit .bashrc)

Add the following lines to the end

export JAVA_HOME=/usr/lib/jvm/java-6-sun
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin

Now we are going to create a folder which Hadoop will use to store its data file

> sudo mkdir -p /app/hadoop/tmp
> sudo chown hduer:hd /app/hadoop/tmp

Good now can edit the *-sites.xml files in Hadoop/conf. We will add properties to 3 files:

  • conf/core-site.xml
  • conf/hdfs-site.xml
  • conf/mapred-site.xml

Add the following property tags to core-site.xml:

<property>
  <name>hadoop.tmp.dir</name>
  <value>/app/hadoop/tmp</value>
  <description>Temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://master:54310</value>
  <description>Default file system.</description>
</property>

Add the following property tags to mapred-site.xml:

<property>
  <name>mapred.job.tracker</name>
  <value>master:54311</value>
  <description>MapReduce job tracker.</description>
</property>

Add the following property tags to hdfs-site.xml:

<property>
  <name>dfs.replication</name>
  <value>2</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>

Part 5: Configure Master Slave Settings
We will configure the following 2 files on both the master and slave machines.

  • conf/masters
  • conf/slaves

Let’s start with the Master machine:

  • Open the following file: conf/masters and change ‘locahost’ to ‘master’:
master
  • Open the following file: conf/slaves and change ‘localhost’ to ‘master’ and ‘slave’
master
slave

Now on the Slave machine:

  • Open the following file: conf/masters and change ‘locahost’ to ‘slave’:
slave
  • Open the following file: conf/slaves and change ‘localhost’ to ‘slave’
slave

Part 6: Starting your Master Slave Setup
Note all of the steps below will be done on the Master machine
First thing we need to do is format the hadoop namenode, run:

> hadoop namenode -format

Starting a multi-node cluster is two steps:

  • Start HDFS daemons, run the following command in hadoop/bin
>./start-dfs.sh

Run following command on master > jps

14399 NameNode
16244 DataNode
16312 SecondaryNameNode
12215 Jps

Run following command on slave > jps

11501 DataNode
11612 Jps
  • Start Map Reduce daemons, run the following command in hadoop/bin
./start-mapred.sh

Run following command on master > jps

14399 NameNode
16244 DataNode
16312 SecondaryNameNode
18215 Jps
17102 JobTracker
17211 TaskTracker

Run following command on slave > jps

11501 DataNode
11712 Jps
11695 TaskTracker

Part 7: Running first Map Reduce on Multi-Node Setup

If everything was successful you can run your multi-node map reduce job.

Let’s get some ebooks in UTF-8 format:

http://www.gutenberg.org/ebooks/118

Now we need to push the book to our hdfs. Run the command and edit path and filename where you saved the book:

> hadoop dfs -copyFromLocal /home/user/Downloads/*.txt /user/hduser/hdinput

Let’s run our map-reduce example that counts the amount of words in the document:

> hadoop jar ../hadoopexamples-1.0.0.jar wordcount /user/hduser/hdinput /user/hduser/hdinput_result

Check the following logs of the slave machine to see what map-reduce jobs was completed:
> hadoop-hduser-tasktracker-ubuntu.log
> hadoop-hduser-jobtracker-ubuntu.log
> hadoop-hduser-datanode-ubuntu.log

If you get stuck or get an error check my other blog post with tips when running hadoop on ubuntu:

http://thysmichels.com/2012/02/11/tips-running-hadoop-on-ubuntu/

Hope this was helpful, if you have any questions please leave me a contact.

Working with HDFS Java Example

This is a java shows how we can work with the Hadoop File System.

Prerequisite for using the code in Eclipse is that you download and add the following jars to your project libraries:

  •  hadoop-core-0.20.2.jar
  • commons-logging-*.jar

See comments in code:


import java.io.IOException;
//hadoop imports
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.Path;

/**
 * @author thysmichels
 *
 */
public class HDFSWordCounter {

	//change this to string arg in main
	public static final String inputfile = "hdfsinput.txt";
	public static final String inputmsg = "Count the amount of words in this sentence!\n";
	/**
	 * @param args
	 */
	public static void main(String [] args) throws IOException {
		// Create a default hadoop configuration
		Configuration config = new Configuration();
                // Parse created config to the HDFS
		FileSystem fs = FileSystem.get(config);
		// Specifies a new file in HDFS.
		Path filenamePath = new Path(inputfile);

		try
		{
                        // if the file already exists delete it.
			if (fs.exists(filenamePath))
			{
				//remove the file
				fs.delete(filenamePath, true);
			}

                //FSOutputStream to write the inputmsg into the HDFS file
		FSDataOutputStream fin = fs.create(filenamePath);
		fin.writeUTF(inputmsg);
		fin.close();

                //FSInputStream to read out of the filenamePath file
		FSDataInputStream fout = fs.open(filenamePath);
		String msgIn = fout.readUTF();
                //Print to screen
		System.out.println(msgIn);
		fout.close();
		}
		catch (IOException ioe)
		{
			System.err.println("IOException during operation " + ioe.toString());
			System.exit(1);
		}
	}
}

In this example we created a HDFS Configuration, specified a Path for our file, Read string to our file and read string out of our file using the HDFS library.

Play around with this to solve more intricate problems.

Arduino Blinking Light

Yeah my first Ardiuno Blinking Light project done, check it out. Place LED in breadboard, connect wires to Arduino board, add a few lines of code, seen below:

int ledPin = 13;
void setup()
{
pinMode(ledPin, OUTPUT);
}
void loop()
{
digitalWrite(ledPin,HIGH);
delay(1000);
digitalWrite(ledPin, LOW);
delay(1000);
}

And you have a blinking light…AMAZING…maybe just for geeks!!!:)

 

Tips running Hadoop on Ubuntu

Below is some tips when running Hadoop on Ubuntu. If you find some errors running Hadoop on Ubuntu please comment the problem and how you solved it.

When you get this Warning: $HADOOP_HOME is deprecated

Solution: add “export HADOOP_HOME_WARN_SUPPRESS=”TRUE”"  in the hadoop-env.sh.

Cannot create directory `/usr/local/hadoop/libexec/../logs

Solution: sudo chown -R hduser:hadoop /usr/local/hadoop/

Enter passphrase when running ./start-all.sh

Solution: ssh-keygen -t rsa -P “”     Create a ssh-key without a password.

Warning: <property>/<configuration> not set

Solution: make sure <property> and <configuration> tags are populated in core.site.xml, mapred.site.xml, hdfs.site.xml

Send or retrieve file to and from HDFS

Solution:

Send file to HDFS > bin/hadoop dfs -put /home/someone/interestingFile.txt /user/yourUserName/

Get file from HDFS > bin/hadoop dfs -get foo localFile

ssh: connect to host localhost port 22: Connection refused
Solution: By default Ubuntu will not have ssh installed so you will have to install and start it.

Install > sudo apt-get install ssh

Start > sudo service ssh start

hadoop Incompatible namespaceIDs in /app/hadoop/tmp/*

Solution: 

  1. Stop the cluster: ./stop-dfs.sh
  2. Delete the directory specified on the DataNode: rm -r /app/hadoop/tmp/*
  3. Reformat the NameNode: hadoop namenode -format

OR

  1. Stop the DataNode: ./stop.dfs.sh
  2. Edit the value of namespaceID in /current/VERSION to match the value of the current NameNode.
  3. Restart the DataNode: ./start.dfs.sh

hadoop java.net.UnknownHostException: ubuntu: ubuntu

Solution: 

1. Add ubuntu as your localhost IP to your /etc/hosts file: sudo vi /etc/hosts

2. Restart your network: sudo /etc/init.d/networking restart

So your /etc/hosts file on your master machine will look something like this:

172.16.62.152      master
172.16.62.151      slave
172.16.62.152      ubuntu

On your slave machine

172.16.62.152      master
172.16.62.151      slave
172.16.62.151      ubuntu

If none of it works then you can go and change the master/localhost hostname to the ipaddress in core-site.xml, mapred-site.xml

Force.com Contact Apex HTML5 Iphone App

This is part 2 of the tutorial so make sure you have completed Part 1:

http://thysmichels.com/2012/02/07/force-com-contact-apex-html5-tutorial/

We will now create a Hybrid Iphone App.

Step 1. Create a new remote access setting by navigating to Setup-> Developer -> Remote Access. Create a remote access by specifying a callback URL:

sfdc://success

Note: Remember the newly created consumer key cause you will use it the bootstrap.js file.

Step 2. Download and Install the Salesforce iOS SDK

Download:

https://github.com/forcedotcom/SalesforceMobileSDK-iOS.

Install:

./install.sh

Step 3. Open XCode and create a new Force.com Hybrid App

Step 4. Delete the www folder and replace with file attached. Replace with the new www file which you can download from:

https://github.com/forcedotcom/SalesforceMobileSDK-Samples

Step 5. Navigate to the bootstrap.js file and edit the following settings

var debugMode = true;

// The client ID value specified for your remote access object that defines
// your application in Salesforce.
var remoteAccessConsumerKey = "###";

// The redirect URI value specified for your remote access object that defines
// your application in Salesforce.
var oauthRedirectURI = "sfdc://success";

// The authorization/access scope(s) you wish to define for your application.
var oauthScopes = ["visualforce","api"];

// The start page of the application.  This is the [pagePath] portion of
// http://[host]/[pagePath].  Leave blank to use the local index.html page.
var startPage = "apex/SalesKing";  // Used for local REST-based"index.html" PhoneGap app.
//var startPage = "apex/BasicVFPage"; //used for Visualforce-based apps

// Whether the container app should automatically refresh our oauth session on app foreground:
// generally a good idea.
var autoRefreshOnForeground = true;

Step 6: Run your application. It will prompt you to login to Salesforce using your Salesforce username and password.

Step 7. It will ask to allow or deny access to the application you specified in step 1. Click ‘Allow’ to access the application.

Follow

Get every new post delivered to your Inbox.

Join 70 other followers