IBM in the Hadoop world

Product Strategist, IBM Business Intelligence, IBM

For many people, big data is synonymous with Hadoop. Certainly, the ability to store and processing vast amounts of data on commodity hardware has fueled a new generation of applications. The synergies of Hadoop, fast and reliable internet and the increasing love for all things mobile have channeled heady investments in software as a service (SaaS) startups.

IBM understands the opportunity with Hadoop and addressing big data challenges. Early and continued investment in this area is why IBM has been consistently rated as a leader in numerous analyst reports, including the recent Forrester Wave on Big Data Hadoop Solutions

The starting point is to get the data

In the big data realm data means obtaining data as close as possible at the source, and then analyzing it for immediate action. IBM InfoSphere Streams enables tapping into streaming data or data in motion where the throughput rates are from the thousands to millions per second. Whether that data originates from medical devices to monitor neonatal infants or from sensors to predict manufacturing yields, InfoSphere Streams has opened the door to exciting new applications of analytics.

Hadoop Marketecture.jpg

Once data becomes “at rest”, IBM InfoSphere BigInsights, IBM’s commercial distribution of Hadoop, provides the landing zone for collecting data, structured or unstructured, for integration, cleansing and analysis.  Its Big SQL interface allows for seamless access by existing analytic tools which use SQL and its built-in query accelerators enable tapping into unstructured data quickly. This Hadoop offering can be installed as on-premise software or as an appliance, sold as PureData System for Hadoop.

Of course, all of this software needs to run on some type of hardware and system infrastructure. IBM Power System and System x are optimized for large-scale deployments and cluster management to achieve high performance computing (HPC).  

Next, clean and assess

Data wrangling is a new term that has been coined to convey the challenges with sifting through the morass of big data to glean insights. IBM SPSS Modeler provides the advanced and predictive capabilities needed to assess data stored not only in Hadoop but also in RDBMS and files. The SPSS Modeler models can also be imported into InfoSphere Streams, providing a single interface for time-pressed data scientists to define their analytical models for both streaming data and data at rest.

To accelerate the analytic process, SPSS Analytic Catalyst automates portions of data preparation, automatically interpreting results and presenting analyses in interactive visuals with plain language summaries. This product is aimed at those that understand how to interpret statistical results but lack the time or skills to define the analytic models.

When the data contains large amounts of unstructured text, IBM Content Analytics is used to automate the text analytics to rapidly classify text and provide industry-specific metrics.

Then, share results

Analytic results are typically shared in two ways:

  • Through decisioning interfaces which drive immediate response in business processes
  • Through dashboards and reports, enabling human decision making

For automated responses, SPSS Decision Management injects predictive analytics into business processes to drive offers, prioritize maintenance tasks or identify potential fraud.

For human decision making, Cognos Business Intelligence delivers reports, both offline and on-demand, and interactive dashboards to the web and to the mobile device. InfoSphere Data Explorer enables content-driven views for business people, combining the power of enterprise search with business intelligence.

When needed, manage, audit and optimize

Often, projects involving Hadoop, start off as pilots with open access by a few individuals, experimenting with tools and exploring data. Once success takes off, the challenges of managing access and understanding where the data originates increases as more data and more users are added. In addition, regulated or large, complex environments require the traditional, enterprise capabilities to manage its information environment, such as tracing data through its transformations, managing the impact of changes and auditing access.

IBM InfoSphere Integration Server addresses these information management and governance needs across the enterprise, including databases and Hadoop.

As success grows, so generally does the need to optimize data access for performance. PureData for Analytics is an optimized data appliance that uses a massively paralleled processing architecture with patented data filtering, over 200 in-database analytic functions and workload balancing that delivers lightning performance for large analytic workloads. For DB2 customers, DB2 with BLU Acceleration extends DB2 with a dynamic, in-memory columnar store to support interactive reporting and analysis. BLU Acceleration is simply load and go—it does not require SQL or schema changes.

IBM is the only vendor which provides end-to-end capabilities for Hadoop with the flexibility of fitting into any enterprise data environment. No wonder IBM is rated a leader by so many industry analysts!

Related resources

Big Data, Hadoop