Hadoop and the big data opportunity

Program Director, Marketing, IBM Analytics

According to Forrester Research in their recent Hadoop Solutions Wave, “Hadoop is a solution to the problem of big data.” I have a slightly different take on this statement: "Hadoop is a solution to the opportunity of big data." Hadoop is effective for managing vast, diverse and varied data, and, when combined with data warehousing, security and visualization capabilities, Hadoop creates a modern data management strategy. 

Whether your business objectives are to monitor log files, understand customer sentiment, analyze click stream data or a number of other use cases, Hadoop may be the missing link in your architecture to capitalize on the opportunities presented by big data.  

The skills gap

One of the key barriers of entry for Hadoop to many has been the skills gap. Every organization wants to leverage the skills they have in-house for new projects. Dealing with new types of data, such as machine and sensor data or social media data, adds to the complexity. Good news for developers, and ultimately business leaders who are making decisions on all this data, is that common tools and languages such as SQL and R are making their way to Hadoop distributions.

Let’s take the latest version of IBM InfoSphere BigInsights, which is all about analytics and usability, enabling more organizations to get started with Hadoop or build on their existing projects. Released on March 28, BigInsights v2.1.2 adds:

  1. A broader and deeper analytics ecosystem with Big R
  2. Data management improvements to increase usability
  3. two-click movement of data in and out of Hadoop
  4. Enhanced administrative capabilities to take the complexity out of Hadoop
  5. Continued commitment to open source

Deep analytic ecosystem

Deep analytics requires the right data to be in the right place at the right time. Moving data around can be costly and time consuming, slowing down business operations. Performing analytics on big data where it sits without having to move it first is easy with IBM’s in-Hadoop analytics. In BigInsights v2.1.2, we’re expanding our analytics ecosystem with a new feature called Big R.  

Big R is end-to-end integration of R into BigInsights. R is a free software programming language and software environment for statistical computing and graphics that’s commonly used by statisticians and data miners for developing statistical software and data analysis. It empowers users familiar with the R language, allowing them to explore, visualize, transform and model big data right from their R environment and without any explicit programming using MapReduce or Jaql.

Need more info on R? Check out these Big Data University courses:


To glean value from the data stored in Hadoop, it needs to perform at the levels typical of traditional data management solutions. BigInsights has enterprise features that enhance performance and, in v2.1.2, includes enhancements designed to strengthen our data warehouse modernization use case with high performance data import and export. By importing from remote file systems and supporting generic JDBC, not just DB2, Netezza and Teradata only, BigInsights becomes an even more attractive part of a modern warehouse infrastructure.

BigInsights 2.1.2 also includes enhancements to HBase, making it easier and faster for users to manage the backup and restore process as well as recover data and visualize replication details. In addition, Big SQL enhancements improve usability (auto memory management), data warehouse modernization (referential constraints) and performance (pre-split).

Information integration

Getting data in and out of Hadoop quickly and in an automated fashion means more time analyzing and less time managing. One of the differentiating factors for BigInsights has been its broad set of integrations, which extend the power and possibilities of Hadoop. In BigInsights v2.1.2, we’ve added key information integration and governance features to help more easily find and integrate the data you need to connect your big data project with the rest of your organization, making it more successful.

  • InfoSphere Data Click allows you to move your data into and out of InfoSphere BigInsights with simplicity and speed
  • The Information Governance Catalog is akin to a data dictionary: it centrally manages metadata about information sources and assets, as well as captures and catalogs big data assets and traditional data assets. Collecting and managing this metadata in one place makes it easy for anyone in the organization to access and search for data that’s normally scattered.

Closing the skills gap

As I mentioned, one of the main barriers to using Hadoop is the skills gap. In order to make Hadoop easier to work with, BigInsights comes with administrative features that take the complexity out of Hadoop, and v2.1.2 adds to those capabilities with enhancements, including logging and alert management and active file management (AFM).

Commitment to open source

IBM’s BigInsights Hadoop distribution is built on 100 percent open source Apache Hadoop, which is part of a long and successful history with open source (Linux, Apache web server, Eclipse). In BigInsights v2.1.2, we include the following GA versions of open source.

Hadoop HDFS HBase Hive PIG Oozie Avro Flume Sqoop Zookeeper Chukwa Lucene
2.2.0 0.96.0 0.12.0 0.12.0 3.3.2 1.7.4 1.3.1 1.4.3 3.4.5 0.5.0 3.3.0

With Big R, HDFS v2, data management improvements and enhanced enterprise integration, InfoSphere BigInsights v2.1.2 distances itself from other Hadoop offerings. How do you think it stacks up?

Want to know more? Check out what’s new with InfoSphere BigInsights and review the FAQs there (published by IBM Support) for more info. 

Ready to get started? Download your free trial today!