Post a Comment

Are you ready for big data? R is.

May 9, 2012

Further to news of SUNY’s exploration of big data to understand possible causes of multiple sclerosis, I spoke with David Smith, VP of Marketing at Revolution Analytics, for a briefing on some advantages of R for analysis of large data sets.

Mike: Dr. Murali Ramanathan, from the SUNY research team at University at Buffalo, identifies the open source R project’s flexibility as contributing to the team’s ability to investigate large numbers of data sets. Can you give us an overview of R’s flexibility?

David: The big difference between R and other data analysis software is that it’s a fully-fledged language, not a collection of black-box procedures or a point-and-click tool. That means you have unlimited flexibility to combine data sources, select and transform variables, and visualize data. The only limit is your imagination.

In addition, R is extraordinarily comprehensive. Not only is every standard data analysis procedure built-in to the language, but you can download even more capabilities from on-line open-source repositories like CRAN and BioConductor, all for free. The BioConductor project extends R to provide cutting-edge tools for the analysis of genomic data, which is critical for researchers like the SUNY Buffalo team working with genetic sequences.

Mike: The SUNY scientists research multiple variables among very large data sets as they investigate significant interactions between thousands of genetic and environmental factors. What makes R so well-suited to analysing big data?

David: While the R language itself is designed to work with in-memory data, several extensions exist for R to work with out-of-memory data, which is essential for the analysis of Big Data. Revolution Analytics extends R to multi-gigabyte data sets, allowing you to create, manipulate and analyze big data objects in the R language. You can also connect Revolution R to Big Data infrastructures like Hadoop and IBM Netezza appliances  to analyze terabyte-class data.  For data analysts, this is the best of both worlds: the flexible R language combined with high-performance Big Data analytics.

Mike: From your knowledge of the work of the SUNY team and other R users, what are some benefits of R’s flexibility?

David: The flexibility of the R language allowed the SUNY team to incorporate many different kinds of data into the analysis, which increased the possible scope of discoveries of factors related to MS. R made it easy to mix-and-match discrete and continuous data and use different statistical distributions (like Poisson and Gaussian) just by modifying a few lines of code. (Without R, they would have had to rewrite the entire algorithm from scratch — a process that can take weeks.) And because there are so many variables to test, running R within the IBM Netezza appliance means that many dozens of studies can be processed simultaneously, instead of waiting for them to complete one by one. This cuts down the overall processing time by a factor of 100x or more.

Mike: For the SUNY team the combination of Netezza and Revolution R is a breakthrough or disruptive technology: allowing them to use new algorithms and add multiple variables that were previously unthinkable. Can you cite other examples of R’s disruptive nature creating new value?

David: The R language is one of the driving forces behind the Data Science movement: combining the talents of computer scientists, statisticians and domain experts to make sense of Big Data. Data Science and R (often in combination with other technologies at the data layer and the presentation later) are revolutionizing how just about every industry deals with Big Data, from government to finance and insurance, life sciences and even journalism.

For More Information:

  • SUNY Buffalo, IBM Netezza, IBM Big Data & MS Society Tweet Jam: 5/10/12 at 12pm EST - Follow #IBMDataChat on Twitter