Part IV: A Flexible Platform-Based Approach to Big Data
In Part I of this series, we looked at the key considerations for an analytic enterprise to stay competitive in today’s world, and in Part II we discussed how those translated into imperatives for a supporting big data platform. In Part III we covered how IBM applied those considerations and imperatives to create its big data platform.
In this concluding post of the series, we will introduce the IBM big data platform, a solution that provides you with a flexible adoption plan for big data that’s aligned with specific business challenges. The real benefit of the platform is the ability to leverage reusable components (analytics, accelerators and policies) as you adopt new capabilities from one implementation to the next. The platform gives you the ability to manage all of your enterprise data with a single integrated platform, provides multiple data engines (each optimized for a specific workload), and uses a consistent set of tools and utilities to operate on big data. The pre-integrated components within the platform also reduce implementation time and costs. IBM is the only vendor with this broad and balanced view of big data and the needs of a big data platform.
IBM has delivered its big data platform through a “Lego block” approach. For example, if a customer has challenges with an existing data warehouse and is unable to process structured data at scale, their initial starting point would be a purpose-built data warehousing appliance. This can then be expanded to either include Hadoop to analyze raw data at rest, or stream computing to analyze data in motion. The following components make up the IBM big data platform:
Visualization and Discovery: IBM InfoSphere Data Explorer Powered by Velocity
For organizations that need to understand the scope and content of their data sources, IBM InfoSphere Data Explorer (Data Explorer) is a good starting point. It enables them to unlock the data that they have available, both inside and outside of the enterprise, through federated search, discovery and navi- gation. It enables everyone—from management to knowledge workers to front-line employees—to access all of the information that they need in a single view, regardless of format or location. Rather than wasting time accessing each silo separately, Data Explorer enables them to discover and navigate seamlessly across all available sources, and provides the added advantage of cross-repository visibility. It secures the information, so that users see only the content that they are permitted to see as if they were logged directly into the target application. In addition, Data Explorer gives users the ability to comment, tag and rate content, as well as to create folders for content that they would like to share with other users. All of this user feedback and social content is then fed back into Data Explorer’s relevance analytics to ensure that the most valuable content is presented to users. Enterprises like Procter and Gamble have been able to streamline their support and operations by providing employees visibility into over 30 data repositories using this technology.
Data Explorer is a great place to start the big data journey, because you can quickly discover and examine the data assets that you have on hand to determine what other parts of the platform you will need next.
Hadoop System: IBM InfoSphere BigInsights
Hadoop is an ideal technology for organizations that want to combine a variety of data types, both structured and unstructured, in one place for deep analysis. It also enables organizations to reduce the cost of their data management infrastructure by offloading data and workloads. IBM’s InfoSphere BigInsights builds on top of open-source Hadoop and augments it with mandatory capabilities that are needed by enterprises. It has optimizations that automatically tune Hadoop workloads and resources for faster performance. It has an intuitive spreadsheet-style user interface for data scientists to quickly examine, explore, and discover data relationships. It provides security and governance capabilities to ensure that sensitive data is protected and secured. BigInsights is prepackaged with development tooling that makes it easier for technical teams to create applications without first having to go through exhaustive training to become Hadoop experts.
Stream Computing: IBM InfoSphere Streams
Stream computing enables organizations to immediately respond to changing events, especially when analyzing stored data isn’t fast enough. It also enables them to be more efficient in pre-filtering and selectively storing high- velocity data. IBM InfoSphere Streams (Streams) delivers this capability by enabling organizations to analyze streaming data in real time. Streams has a modular design that has unlimited scalability and can process millions of events per second. It has the ability to analyze many data types simultaneously and to perform complex calculations in real time.
Data Warehouse Appliance: IBM PureData System for Analytics Powered by Netezza Technology
We often find enterprises struggling with the complexity of their data ware- housing environments. Their data warehouses tend to be glutted with data and not suited for one particular task. Gaining deep analytical insights might be too complex or expensive. IBM addresses this pain point through the analytics-based IBM PureData System. The IBM PureData System for Analytics (the new name for the latest generation Netezza appliance) is a purpose-built appliance for complex analytical workloads on large volumes of structured data. It’s designed with simplicity in mind and needs minimal administration and no performance tuning. It uses a unique hardware-assisted query processing mechanism that enables users to run complex analytics at blistering speeds.
One of the objectives of IBM’s platform-based approach is to reduce time to value for big data projects. The IBM big data platform achieves this by packaging prebuilt analytics, visualization and industry-specific applications as part of the platform. The analytic accelerators contain a library of pre-built functions that analyze data in its native format, where it lives, and with the appropriate engine: InfoSphere BigInsights, InfoSphere Streams or an analytics-based IBM PureData System. The library of prebuilt functions includes algorithms for processing text, image, video, acoustic, time series, geospatial and social data. The functions span mathematical, statistical, predictive and machine-learning algorithms. The accelerators also include the appropriate development tools to enable the market to develop analytic applications for the big data platform.
Information Integration and Governance
The IBM big data platform includes purpose-built connectors for multiple data sources. It has adapters for common file types, databases and sources of unstructured and streaming data. It also has deep integration between its data processing engines, which enables data to be moved seamlessly between an analytics-based IBM PureData System, InfoSphere BigInsights and InfoSphere Streams.
Security and governance are key aspects of big data management. The big data platform might contain sensitive data that needs to be protected, retention policies that need to be enforced, and data quality that needs to be governed. IBM has a strong portfolio of information lifecycle management, master data management, and data quality and governance services. These services are integrated and available for use within the big data platform.
This series is an abridged excerpt from our new book, Harness the Power of Big Data. In the next and concluding part of this series, we will look at IBM's flexible, platform-based approach to big data and the key components of the offering.
Read Part I: Key Considerations of the Analytic Enterprise
Read Part II: The Big Data Platform Manifesto
Read Part III: IBM’s Strategy for Big Data and Analytics