In Part I of this series, we looked at the key considerations for an analytic enterprise to stay competitive in today’s world, and in Part II we discussed how those translated into imperatives for a supporting big data platform. In this post we will cover how IBM has applied those considerations and imperatives to create the IBM big data platform.
Sustained Investments in Research and Acquisitions
IBM believes that big data has the potential to significantly change how organizations use data and run analytics. Business analytics and big data are strategic bets for IBM, which recognizes the huge potential to demonstrate leadership in this space and create shareholder value.
IBM is driving its commitment to big data and analytics through sustained investments and strategic acquisitions. In 2011, IBM committed a $100 million investment in the research and development of services and solutions that facilitate big data analytics. IBM has spent over $16 billion, across 30 analytics-based acquisitions, in the last five years. Big data is about analytics, and no other solution provider is strategically investing in analytics like IBM. IBM’s two most recent acquisitions, Netezza and Vivisimo had developed market-leading innovative technologies that have now been integrated into the IBM big data platform.
IBM also has the largest commercial research organization on earth: hundreds of mathematicians and data scientists developing leading-edge analytics. Finally, IBM has the largest patent portfolio in the world, almost exceeding the combined total of the next four biggest patent-receiving companies! Many of the research themes and innovations that pertain to unstructured data management, text analytics, image feature extraction, large-scale data processing have been incorporated into IBM’s big data platform.
Strong Commitment to Open Source Efforts and a Fostering of Ecosystem Development
The open source community has been a major driving force for innovation in big data technologies. The most notable of these is Hadoop, a software framework that enables the processing of data-intensive computational tasks, in parallel and at scale. The Hadoop ecosystem consists of other related open-source projects that provide supporting utilities and tools. These projects provide specialized functions that enable better access to data in Hadoop’s distributed file system (HDFS), facilitate workflow and the coordination of jobs, support data movement between Hadoop and other systems, implement scalable machine learning and data mining algorithms, and so on. These technologies are all part of the Apache Software Foundation (ASF) and are distributed under a commercially friendly licensing model.
Apache Hadoop is still in early stages of its evolution. While it does provide a scalable and reliable solution for big data, most enterprises may find that it has missing features, lacks specific capabilities, or requires specialized skills to adopt for their needs.
Hence, technology solution providers, including IBM, are making efforts to bridge the gap and make Apache Hadoop easier for enterprise adoption. These technology solution providers can take one of two different approaches to achieving this goal. The first approach is to take the Apache Hadoop open source code base as a starting point, and then modify it appropriately to address gaps and limitations. In software development parlance, this process is known as forking. Vendors adopting this approach effectively create a vendor-specific proprietary Hadoop distribution that’s somewhat closed and insulated from the innovations and improvements that are being applied to the open source components by the community. This makes interoperability with other complementary technologies much more difficult.
The second approach is to retain the open source Apache Hadoop components as is, without modifying the code base, while adding other layers and optional components that augment and enrich the open source distribution. IBM has taken this second approach with its InfoSphere BigInsights product, which treats the open source components of Apache Hadoop as a “kernel” layer, and builds value-added components around it. This enables IBM to quickly adopt any innovations or changes to the core open source projects in its distribution. It also makes it easy for IBM to certify third-party technologies that integrate with open source Apache Hadoop.
This modular strategy of incorporating other open source–based Hadoop distributions into its own offering enables IBM to both maintain the integrity of the open source components and to address their limitations. BigInsights is an IBM-certified version of Apache Hadoop. Moreover, many of the value-added components that BigInsights offers (such as BigSheets and the Advanced Text Analytics Toolkit) are supported for use on Cloudera’s distribution of Hadoop components.
IBM also has a very strong commitment to the open source movement. A number of IBM’s engineers contribute pieces of code to the Apache Hadoop open source project and its associated ecosystem. IBM has a long history of inventing technologies and donating them to the open source community as well; examples include Apache Derby, Apache Geronimo, Apache Jakarta, DRDA, XERCES etc. The de facto tool set for open source, Eclipse, came from IBM. Text analytics is a major use case around big data, and IBM contributed the Unstructured Information Management Architecture (UIMA). Search is another big data prerequisite enabler, and IBM is a major contributor to Lucene search technology. Needless to say, IBM is 100 percent committed to open source and 100 percent committed to Hadoop.
IBM has also built a strong ecosystem of solution providers in the big data space. Currently, its partners – including technology vendors and system integrators that are trained and certified on the IBM big data platform – number in the triple digits.
Support Multiple Entry Points to Big Data
Big data technologies can solve multiple business problems for an organization. As a result, organizations often grapple with the best approach to adoption. As practitioners, we sometimes see IT organizations embark on a big data initiative as though it were a science experiment in search of a problem. Lack of focus and unclear expectations typically lead to unfavorable outcome. In our opinion, the most successful big data projects start with a clear identification of a business problem, or a pain point, followed by the application of appropriate technology to address that problem. Getting started in the right place is crucial. In fact, the success of your first big data project can determine how widely and how quickly this technology is adopted across the rest of your organization.
We’ve identified some of the most common pain points (that effectively act like big data project “triggers”) that we come across in our client engagements. The IBM big data platform provides quantifiable benefits as you move from an entry point to a second and third project, because it’s built on a set of shared components with integration at its core. We find it typical that assets being used in one big data project not only increase your chances of success with downstream projects, but accelerate their delivery as well. For example, you might develop a sentiment analysis package for an application that crunches millions of emails for a service desk application. A subsequent project could take this asset (data harvested at rest) and transparently deploy it to a data-in-motion application that assesses a live Twitter stream for trending sentiment.
Finally, it’s worth noting that the IBM big data platform is not an all-or-nothing proposition. Quite the contrary: you can get started with a single product, a subset of products, or the entire platform, giving you the flexibility and agility that you need to successfully deliver that initial project, and then incrementally “up” your big data IQ from there. The following are the IBM big data platform entry points.
Unlock Big Data
How many times have you not been able to find content on your laptop that you knew you had? How many times have you stumbled across one file that you had completely forgotten about while you were looking for another file? If you can’t really get a handle on the data assets that are on your own laptop, imagine this pain point for a large enterprise! To be honest, we find that enterprises are often guilty of not knowing what they could already know because they don’t know what they have. In other words, they have the data, but they can’t get to it. This is the pain point that’s associated with “unlock big data.”
This problem is compounded for organizations that have access to multiple sources of data, but don’t have the infrastructure to get all of that data to a central location, or the resources to develop analytical models to gain insights from it. In these cases, the most critical need might be to quickly unlock the value resident in this data without moving it anywhere, and to use the big data sources in new information-centric applications. This type of implementation can yield significant business value, from reducing the manual effort to search and retrieve big data, to gaining a better understanding of existing big data sources before further analysis. The payback period is often short. This entry point enables you to discover, navigate, view and search big data in a federated manner.
Reduce Costs with Hadoop
Organizations might have a particular pain point around reducing the overall cost of their data warehouse. They might be retaining certain groups of data that are seldom used but take up valuable storage capacity. These data groups could be possible candidates for extension (sometimes called queryable archive) to a lower-cost platform that provides storage and retrieval, albeit with poorer performance. Moreover, certain operations, such as transformations, could be offloaded to a more cost-efficient platform to improve the efficiency of either the ETL (extract, transform, and load) process or the data warehouse environment. An entry point for this type of problem might be to start with Hadoop, a data-at-rest engine. The primary area of value creation here is cost savings. By selectively pushing workloads and data sets onto a Hadoop platform, organizations are able to preserve their queries and take advantage of Hadoop’s cost-effective processing capabilities for the right kinds of data and workloads.
Analyze Raw Data
There might be situations in which an enterprise wants to extend the value of its data warehouse by bringing in new types of data and driving new types of analysis. Their primary need might be to analyze unstructured data from one or multiple sources. They might want to overcome the prohibitively high cost of converting unstructured data sources to a structured format by analyzing data in its raw format. A Hadoop-based analytics system could be the right entry point for this type of problem. Enterprises often gain significant value with this approach, because they can unlock insights that were previously unknown. Those insights can be the key to retaining a valuable customer, identifying previously undetected fraud, or discovering a game-changing efficiency in operational processes.
Simplify Your Warehouse
We often come across situations in which business users are hampered by the poor performance of analytics in a general-purpose enterprise warehouse because their queries take hours to run. The cost associated with improving the performance of the data warehouse can be prohibitively high. The enterprise might want to simplify its warehouse and get it up and running quickly. In such cases, moving to a purpose-built, massively parallel, data warehousing and analytics appliance could be the perfect entry point to big data. Many organizations realize a 10- to-100-fold performance boost on deep analytics, with reduced cost of ownership and improved employee efficiency, by using our technology.
Analyze Streaming Data
An organization might have multiple sources of streaming data and want to quickly process and analyze perishable data, and take action, in real time. Often they will be unable to take full advantage of this data simply because there might be too much of it to collect and store before analysis. Or the latency associated with storing it on disk and then analyzing it might be unacceptable for the type of real-time decisions that they want to drive. The ability to harness streaming data and turn it into actionable insight could be another big data entry point. The benefits would be the ability to make real-time decisions and to drive cost savings by analyzing data in motion and only storing what’s necessary.
This series is an abridged excerpt from our new book, Harness the Power of Big Data. In the next and concluding part of this series, we will look at IBM's flexible, platform-based approach to big data and the key components of the offering.
Read Part 1: Key Considerations of the Analytic Enterprise
Read Part 2: Part II: The Big Data Platform Manifesto