Big data analytics supercharging enterprise interest in R

General Manager & VP Analytic Solutions, Netezza, an IBM Company

The buzz around big data is driving further interest in the entire analytics market.  Applying analytics to big data is the driver behind creating new, game-changing business value for enterprises. New analytic techniques and tools are being introduced into the enterprise to help spur on the big data analytic challenges. At a market buzz level, many of these tools and approaches appear equivalent, but when you start to look into the details there are distinct benefits, both today and with the direction these tools are taking in the future, that will constrain your big data analytic capabilities.

R is one of the tools on a fast path to adoption by enterprises because it is the most comprehensive open source tool and has similar capabilities to traditional enterprise analytic tools. However, my experience is that R is not readily understood in the marketplace, and there is a variety of support for R in the marketplace that makes it more confusing. In this posting, I’ll describe R and clearly differentiate the options available to minimize market confusion and to help our customers make the best decision for their company.

R is an open source statistical language that was launched in the 1990’s. In addition to the language, there are also several open source analytic development environments (ADE) or front-end GUI tools for creating analytics and R-based models. Often when people refer to R they are referring to the huge library of open source analytics called CRAN (Comprehensive R Archive Network). There are thousands of analytics bundled into packages in this library which can be overwhelming.  The R CRAN library includes general interest categories such as statistics, time series, econometrics, machine learning, high-performance computing and natural language processing.  The library also includes very deep industry specific analytics such as:

  1. Clinical trial design, monitoring and analysis
  2. Chemometrics and computational physics

One of the more appealing features of R is that when researchers today invent new analytic techniques they are often simultaneously developing them in R and placing them into this open source repository. That means that the best and latest thinking is available almost instantly to the marketplace.

The R language started out with one deployment approach - running on the desktop - and has evolved from there into several common deployment approaches:

Single server approach - On a single server implementation all three major R components – the language, the GUI and the analytics – run on the single machine. For smaller data sets, this is a simple, easy configuration where all three components are tightly integrated. However, as data sets get larger, this configuration starts to fail. The analytics in most R packages are designed to run in-memory which means that the data to be processed by the analytic is put in memory. As the data to be processed by an analytic gets larger, the data won’t fit into the memory on a single server. Yes, you can add memory and people do that, but eventually this becomes a choke point.

There are also “big data packages” in the CRAN library. The big data packages include a variety of approaches to mitigate or minimize the memory choke point. For example, one package (the “foreach” package) extends the R language to include a looping construct so that a subset of the data to be processed is brought into memory, processed and then moved out of memory to make room for more data to be processed. Another package (Rmpi package) includes an interface-to-message-passing-interface (MPI), which is a technique used in high- performance computing for sharing data between independent nodes in a parallel, shared nothing environment. While each of these enables processing of larger data sets on a single server, it does not scale to the big data sizes that the marketplace is demanding today.

Pass thru approach – The pass through approach is often used to integrate other software to R. In this approach the other software package collects the data and then passes the data off to R for analytics processing. This effectively moves the data to the analytics. As an example, SAS has taken this approach with SAS/IML Studio integration with R. Data movement or I/O becomes the bottleneck as the data scales up to larger data sets in this approach

Two-tier architecture - This is a client server configuration. The client is primarily used to run the R GUI front-end tool or a web service for the R front-end visualization tool. The server runs the R language and executes the analytics which have been pushed down from the client. By moving the analytics processing closer to the data this approach reduces the I/O impact of data movement and lets the analytic benefit from the compute power of a large server. As applications scale to big data, the parallel compute environments, such as a massively parallel processing (MPP) architecture or a distributed compute environment (grid) running Hadoop, become the prevalent server technologies. With both MPP and Hadoop, the R language is running on each of the nodes in the parallel environment and takes advantage of the parallel data distribution that is provided by each.

The analytics in the CRAN repository (which are not “automatically” converted to parallelized analytics) still place the data into memory for the analytics processing. But now each node is working simultaneously on a smaller data set and is brought together after each worker node has completed its task. This helps to minimize the memory choke point and takes advantage of parallel computational processing. So, R running on an MPP or Hadoop environment can have significant scale and performance benefits.

However, there are many commonly used analytics in the CRAN repository that can’t take advantage of the MPP or distributed processing since the computational steps in the analytic are not parallel, either inherently or by design.  A simple example of this issue is illustrated by comparing the calculations of “average” and “median” in parallel.  The workload for the analytic function “average” is completely independent. When you have a group of numbers and want to compute the average you can divide and conquer the workload by averaging random subsets of the group. Then you can subsequently perform a weighted average for each and you’ll get the correct average for the entire group. However, if you try to distribute the calculation of median by determining the “middles” for random subsets of the data in parallel, and then take the middle of all the “middles,” you will not get the median of the full list (try it!).

For this category of analytics, the analytic must be designed and programmed to process in a parallel environment. This type of parallel programming is a highly specialized form of mathematics and software development. Whether programming with mapReduce in a Hadoop environment or using user-defined extensions in an MPP environment, the analytic must be written using parallel programming principals to truly benefit from these environments.

This need to “parallelize” analytics directly carries over to those CRAN packages that you may want to use in Hadoop and MPP environment. So, even though vendors offer the ability to execute the CRAN packages in these environments, to fully benefit from the distributed server environments the CRAN packages need to be rewritten for parallel execution or will need to be replaced with equivalent analytics that are designed for parallel environments.  Understanding these needs, vendors offer a breadth of alternatives to close the gap created by moving CRAN to distributed, parallel environments, each of which is optimized for their platforms.

For example, IBM® InfoSphere® BigInsights, which is a commercial distribution of Hadoop, includes pre-built, parallelized MapReduce analytics that are usable with Hadoop and also through R. These libraries of parallelized text and machine learning analytics can be used with other open source CRAN analytics to create and deploy R models in Hadoop.

The IBM Netezza® data warehouse appliance is a purpose built, MPP analytic appliance that uses hardware acceleration to eliminate the I/O bottleneck and optimizes the memory and CPU processing for analytic workloads. The IBM Netezza data warehouse appliances include a large and growing library of pre-built, in-database analytics that can be used with the R language or with many other languages and tools such as SPSS Model, Microsoft Excel, etc. This library covers a broad range of in-database analytics including:

  • Transformations
  • Statistics
  • Data mining
  • Predictive analytics
  • Spatial analytics

In addition to the pre-built analytics, IBM Netezza data warehouse appliances include a parallelized linear algebra capability that helps mathematicians create in-database analytics using matrix math that can be accessed through R without any explicit parallel programming.

For R, the IBM Netezza data warehouse appliance uses R Enterprise from Revolution® Analytics and BigInsights can as well. Revolution R Enterprise is the enterprise ready, commercial distribution of R that includes performance enhancements, increased security and reliability. For enterprise customers, it is critical to have a commercial distribution of R that maintains continuity with the open source distribution in order to quickly obtain the new innovations available in the open source distribution. The recent announcement by Oracle that 11g supports Enterprise R is a fork or divergence from the open source path, which will make it difficult for Oracle to maintain interoperability with the open source updates.

The combination of Revolution R Enterprise, open source CRAN analytics and the rich, powerful in-database and MapReduce analytics available on both IBM Netezza Analytics and IBM BigInsights respectively makes it easier and faster to deploy production ready R-based analytic models.

As you evaluate the use of R for your organization, it is important that you consider the following;

  1. How much data do I need/want to use when building R-based models?
  2. How quickly do I need to go from model build to production deployment?
  3. Do I want tools that help me with “quick to fail” analytic model building strategies?
  4. Does the environment enable me to create analytic models faster by supporting parallel analytics in the build stage?
  5. How does the environment mitigate the big data processing bottlenecks for I/O, memory and CPU?
  6. How do I get support for R?
  7. How difficult is it for me to create my own parallel analytics for use in R?
  8. How quickly will the platform support new versions of R?
  9. What is the strategy for maintaining continuity with the open source version of R?

The IBM Big Data platform, including IBM Netezza data warehouse appliances and IBM InfoSphere BigInsights, supports R, enabling our customers to leverage the strengths of each of these parallel computing environments. Our Big Data platform provides flexibility to address increasing data volumes, velocity and variety of data.

For More Information: