Don’t drown in the big data lake

Program Director, Digital Experience, IBM

Having been around the big data space for a number of years now, you hear a lot of questions that revolve around “What do I do with this data that I’ve stored in Hadoop?’ or, more importantly, “How do I use this data that is now sitting here?” The fact is, there's a lot of hype surrounding big data and Hadoop in particular. 

On the average, Hadoop vendors have been pushing Hadoop technology as an end all to all big data challenges. The latest illustration of this is the emergence of the data lake terminology: “Just pour all of your data into our data lake and figure out what you want to do later. Granted, you will be paying us based on the total amount that you put in, but heck, you don’t want to lose that potential business insight do you?” Add to this even more fluff surrounding turning this data lake into some sort of enterprise data hub and sheesh, you still have not shown us how to use this data.


Enter the world of analytics, another relatively overhyped (and incorrectly used) concept. Analytics is many times positioned as the magic that you can apply to any data set to “turn coal into diamonds.”  Yet, the vast majority of the time this is not the case. 

There are a number of risks that surface once you want to tap into big data stores for analytics. Issues that organizations face include:

  • Incompatible architectures. The architecture of traditional analytic products is not suited to distributed computation
  • Incompatible algorithms. Out-of-the-box statistical algorithms are not designed to work with big data (these algorithms expect the data to come to them, but big data is too costly to move)
  • Lack of skills. Data scientists and unicorns anyone? Performing state-of-the-art analytics on big data requires new skills and intimate knowledge of big data systems. Very few analysts have these skills.
  • Scale. In-memory solutions work for medium-size problems, but do not scale well to truly big data.

Doom and gloom, right? Not so fast.

One of the benefits of working at IBM is that we have been facing these challenges for years (think way before big data and Hadoop became buzzwords) and have been building solutions to facilitate analytics on massive data sets. One of these solutions that solves the Hadoop analytics challenge is our SPSS Analytic Server.

Analytic Server provides a data-centric architecture that leverages big data systems, such as Hadoop MapReduce with data in the Hadoop Distributed File System (HDFS). Baked right into the solution is a defined interface to incorporate new statistical algorithms designed to go to the data (rather than bringing the data to the analytics). And while you might be still learning the nuances of Hadoop and the many new open source tools and projects associated with it, Analytic Server maintains the recognizable SPSS user interface that hides the details of big data environments so that your analysts can focus on analyzing the data—not some oozie zookeeper swimming with jaqls in the data lake.

The server sits between a your application and Hadoop data store and you essentially just direct your program (think SPSS Modeler or even SPSS Analytic Catalyst) to the Hadoop data and Analytic Server orchestrates the job to run, in Hadoop, and then sends the results to the defined analytic application and, voilà! You have Hadoop analytics, refining and further analyzing as your heart desires.

I’m not going to deep dive into the nuances of Hadoop here (we can save for another day), but I wanted to point out that bringing analytics to the data is a key aspect of IBM InfoSphere BigInsights for Hadoop. BigInsights and SPSS offer a clear answer to “How do I use this data that is now sitting here?” The magic of analytics comes across when you start using the two together to:

  • Perform sentiment analysis to know more about your customers
  • Gain operational efficiency by analyzing machine logs
  • Automate predictive maintenance based on analysis of sensor data

So if you want to get hands-on (I’m not sure why you wouldn’t!) I'll direct you to a few areas. 

  • If you have not played around with Hadoop yet and want to know what the fuss is all about, we have a non production version of our Hadoop offering: BigInsights QuickStart. The nice thing about QuickStart is that it not only comes with all of the cool technologies, like text analytics, that come with our Enterprise Edition, but it is downloadable as a virtual image, which means no configuration! Yeah, the download is a beast, but you get the full experience of a full-fledged Hadoop environment, with tutorials to boot!
  • If you have already done that Hadoop Thing, or just want to fast-forward to the analytics, you can opt for one of two routes:
    1. SPSS Statisitics is going to drop you headfirst into one of SPSS’ keystone products (SPSS Statisitics defines Statisitics).
    2. Otherwise you can opt to check out SPSS Analytic Catalyst in our Analytics Zone. The hosted (yet live) demo here allows you to upload .csv data, specify a column to analyze and visually comes back to you with all of the relationships that the data set have with the column that you selected—very cool stuff, and very helpful in showing you patterns that you might not know exist, right in front of your nose.

And if you were wondering, the SPSS Catalyst demo is using our own InfoSphere BigInsights for the Hadoop data store in the background.

Don’t take my word for it—Get your hands on it and go try it for yourself! Start to see what big data and analytics can do for you, today.