Bringing Big Data Up to the Big Leagues

Hadoop is integral to strategic analytics, but enterprise-scale data integration is required for confidence in big data

Product Strategy & Marketing - InfoSphere, IBM

Not so long ago, data grew slowly over time, and it grew in a linear manner. Today, data volumes are exploding in every facet of our lives. Business leaders are eager to harness the power of big data. However, before setting out into the big data world, they should understand that as the opportunity increases, ensuring that source information is trustworthy becomes exponentially more challenging than it used to be. If this trustworthiness issue is not addressed directly, end users may lose confidence in the insights generated from their data, which can result in a failure to act on opportunities or against threats.

To make the most of big data, organizations have to start with data they do trust. But the sheer volume and complexity of big data means that the traditional, manual methods of discovering, integrating, governing, and correcting information are no longer feasible. Information integration and governance (IIG) needs to be implemented to support big data applications, data warehouse, and data warehouse augmentation initiatives to provide appropriate governance and rapid integration from the very start. When organizations get behind, this effort is like trying to catch a wolf by its tail.

Recognizing Hadoop’s place in the stack

Today, performing big data analytics is strategic, and being able to supplement—or augment—data warehouses with this key information is essential. Apache Hadoop technology absolutely is an integral part of this process. Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer. Hadoop clearly changes the economics and the dynamics of large-scale computing.

While Hadoop and Hadoop-based solutions clearly have their advantages when it comes to addressing big data volumes, Hadoop is not designed to be a data integration solution. A recent Gartner research paper states the following on this point: “As use of the Hadoop stack continues to grow, organizations are asking if it is a suitable solution for data integration. Today, the answer is no. Not only are many key data integration capabilities immature or missing from the stack, but many have not been addressed in current projects.”*

Using Hadoop to build out a data integration solution is like building a house of straw. The basic shape may be right. But after factoring in the costs to develop all the necessary data integration functionality—broad connectivity, complex transformation logic, metadata management, appropriate data delivery styles, data quality, and governance—this house may as well blow away in the wind.

Embracing IT agility for performance and scalability

Several best practices for integration and governance help organizations take strategic advantage of big data. Big data streams in at a high velocity—so performance is key. Data changes rapidly, and it must be fed to various applications in the system quickly so that business leaders and decision makers can react to changing market conditions as soon as possible. To successfully handle big data, organizations need an enterprise-class data integration solution that has the following characteristics:

  • Dynamic and agile, to meet current and future performance requirements
  • Extendable and partitioned, for fast and easy scalability
  • Hadoop-leveraged, as part of an integration architecture—because Hadoop itself is not an integration platform

Scalability is one of the most challenging big data integration requirements because business requirements can evolve very quickly (see Figure 1). Consequently, when tackling big data integration, having a product that can achieve data scalability across all architectures with the same function and with linear speedup—scaling n way without issue—is important.

Bringing Big Data Up to the Big Leagues – Figure 1

Figure 1. Data scalability across hardware architectures


Working smarter, not harder

Employee time is a valuable and costly resource. An integration solution for big data that supports employee productivity and efficiency helps to improve the enterprise’s bottom line, eliminate bottlenecks, and enhance agility.

For IT departments, service-level agreements (SLAs) are often impacted by inefficiencies. As data volume, variety, and velocity grow, the time required to process data integration jobs frequently exceeds the window allowed by SLAs, meaning that IT can no longer meet the needs of internal customers.

To help improve productivity, create design logic for Hadoop-oriented data integration efforts that uses the same interface, concepts, and logic constructs used for any other deployment method. This logic can eliminate the need to invest time and resources in learning new coding languages as they evolve, or to fall back on legacy methods of performing manual coding for data integration work (see Figure 2).

Bringing Big Data Up to the Big Leagues – Figure 2

Figure 2. Enhanced productivity through streamlined design logic for data integration


Supporting a variety of big data sources and types

Organizations exploring big data analytics and using technologies such as Hadoop for data at rest, or streaming technology for data in motion, face many of the same challenges as in other analytical environments. These challenges include determining the location of the information sources needed for analysis, how that information can be moved into the analytical environment, and how it should be reformatted so that it becomes easy and efficient to explore. Determining what data should be persisted to quickly get to the next level of analysis can also be challenging. Getting stuck trying to resolve any of these issues manually is not the way to go.

To achieve vastly enhanced efficiency, the information integration platform should be able to effectively handle the wide—and growing—complexity of heterogeneous enterprise information sources and types within a common, seamless architecture. Toward this end, supporting new and emerging big data source types is essential. They should range from the Hadoop Distributed File System (HDFS) for massively scalable and resilient storage to NoSQL for read- or write-optimized record storage to IBM® InfoSphere® Streams software for supporting massive-scale, real-time analytics.

As new applications begin leveraging these technologies, organizations need to ensure their information integration platform supports systems and data types as well. These systems and data types include Apache Cassandra, Apache HBase, Apache Hive, Java Message Service (JMS), Mongo DB, JavaScript Object Notation (JSON), and so on.

Gaining confidence to act on big data

While the term big data has only recently come into vogue, IBM has designed solutions capable of handling very large quantities of data for decades. It has provided a range of leading-edge data integration, management, security, and analytics solutions that are designed to be highly reliable, flexible, and scalable.

InfoSphere Information Server offers end-to-end information integration capabilities designed to help data analytics professionals understand, cleanse, monitor, transform, and deliver data as well as collaborate to bridge the gap between business and IT. InfoSphere Information Server enables organizations to be confident that the information driving their business and strategic initiatives is trusted, consistent, governed, and available when and where it’s needed, depending on their business requirements. These initiatives range from big data and point-of-impact analytics to master data management (MDM) and data warehousing.

Please share any thoughts or questions about this topic in the comments.

* “Hadoop is not a data integration solution,” Gartner research report, ID #G00249138, January 2013.