Getting Salesforce Data into Hadoop
Daan Debie / June 07, 2015
4 min read
A recent announcement about a partnership between Salesforce and - among others - Cloudera and Hortonworks, seems to be mostly about the possibility to get data from your Cloudera or Hortonworks Hadoop clusters into the Salesforce Analytics Cloud (a.k.a. Salesforce Wave). What would be far more useful to me (and hopefully lots of other people), is the ability to do the other way around: getting your Salesforce data into your Hadoop cluster. That would open up endless possilibities to combine and analyse your Salesforce data in combination with domain-specific data, log data, click data and all the other data that you collect in the operation of your business.
At The New Motion we were struggling with that exact problem. The New Motion operates tens of thousands of charge points and facilitates many more customers with charge cards, all with the goal of making electric driving as seamless an experience as possible. We record all charge session data, and eventually store it in our Hadoop cluster, together with other data generated by the business. A problem emerged when we moved from using our own homegrown CRM and Ordering system, to Salesforce. Proper analysis of our data requires us to have CRM and order data in our Hadoop cluster, next to the data generated by our own backend software.
My colleague and me set out to create a tool that allowed us to periodically import all data from our company’s Salesforce organisation into our Hadoop cluster, and we’re excited to have now open sourced that tool.
Going by the imaginative and exciting name Salesforce2hadoop, our tool comes in the form of a command line utility that can be run to do either a full import or an incremental import of data of one or more types from Salesforce into HDFS (or your local filesystem). You can import only what you want to import. Not only does it support standard Salesforce (data) types, such as Account, Opportunity, etc. but it also supports your own custom types (often recognized by the
__c suffix in the API). Salesforce2hadoop can be found on Github including a comprehensive user manual.
Salesforce2hadoop has been built on the shoulders of giants. First of all, we chose Scala as our programming language for this tool, firstly because we’re most familiar and comfortable with it, but also because of the many great libraries written for the JVM that we could leverage, that could help us with interacting with Hadoop/HDFS.
The most important library in that regard is KiteSDK. KiteSDK is a great library/toolkit built by the guys over at Cloudera that provides some nice abstractions on top of working with data in Hadoop/HDFS. It makes it really easy to create datasets with a specific schema, and read/write records from and to those datasets, without having to deal with the underlying low-level APIs. We use it to write and update datasets in HDFS, serialized in Avro format.
We use Apache Avro to serialize data when we write it to HDFS. One great advantage of that, is the ability to evolve the schema without having to reimport everything. With each import, Salesforce2hadoop will update the Avro schema, reflecting the contents of the Enterprise WSDL of your Salesforce organisation.
Data is extracted from Salesforce using WSC, a Java library for interacting with Salesfoce using SOAP. WSC is a higher level abstraction on top of the regular Salesforce SOAP interface.
Most of the heavy lifting of converting data from the XML that comes out of the Salesforce SOAP interface to Avro, is done by another tool that we created: WSDL2Avro (yet another marvellous name). WSDL2Avro allows you to convert any WSDL file (SOAP webservice definition) to a set of corresponding Avro schemas, keeping as much of the original data types and structure as possible. When you use Salesforce2hadoop, the Enterprise WSDL of your Salesforce organisation is converted to Avro schemas using wsdl2avro. The data that imported from Salesforce is then converted to Avro format using the schemas produced by wsdl2avro.
WSDL2avro is available on Github and can be used in any Maven or SBT powered project.
With the release of these tools, we hope to help many people that want to combine their Salesforce data with all other data in their Hadoop cluster. This should empower businesses to make the most out of their Enterprise Data Warehouses, leverage more data and create more actionable insights.
Feel free to comment and/or contribute!