Business Intelligence in the Hadoop Era

Drilling through petabytes of actionable intelligence

Big Data Evangelist, IBM

To date, only a few visionary users have begun to connect Hadoop directly to their companies' business intelligence (BI) strategies. However, their numbers will certainly grow as Hadoop matures into a more robust platform for traditional operational BI applications, while continuing to serve core applications in advanced analytics.

Interactive data exploration: The soul of the new knowledge worker

It's not far-fetched to start thinking of Hadoop platforms, such as IBM InfoSphere BigInsights, as the nucleus of your next-generation enterprise data warehouse (EDW). Likewise, you should view Hadoop's primary user and developers—data scientists—as the harbingers of the next generation of BI users: business analysts who demand the ability to explore an order-of-magnitude more data and to build, score, and deploy more complex analytic models in a growing range of mission-critical applications. The new generation of knowledge workers insists on having their own "sandboxes" of deep data to explore as they see fit.

Traditional BI tools are not vanishing, but assuming a niche in big data

Even by the end of this decade, the majority of traditional BI requirements will be adequately served by traditional BI tools and supporting EDWs, cubes, marts, and other analytic databases such as IBM Netezza. Even in the year 2020, your primary Hadoop uses may be for operational applications that are either outside the traditional BI realm, such as customer experience management and marketing campaign optimization, or in a supporting role, such as unstructured content transformation.

But Hadoop will almost certainly creep into a broader range of leading-edge BI requirements—especially those that involve statistical analysis, predictive analytics, or natural language processing. You'll probably also incorporate other big data tools and platforms—such as massively parallel in-memory, document, and graph databases—into hybrid environments where Hadoop plays an important (but not paramount) role.

Scope of traditional BI will expand to include advanced analytics

Clearly, the definition of "traditional BI" will continue to grow as more big data-centric approaches come into the mainstream. People often use the term "business analytics" to allude to the growing functional scope of mainstream BI tools.

Several types of emerging requirements will drive adoption of Hadoop and kindred approaches throughout the rest of this decade, with traditional business analytics—i.e., BI that provides decision support—as a key focus:

  • Whole-population analytics: Any application that requires interactive access to the entire population of analytical data, rather than just to convenience samples or slices, is a strong candidate for big data. Most notably, microsegmentation for determining next best offers thrives on access to a 360-degree view of the entire customer population that is being targeted.
  • Multistructured analytics: Any application that requires unified access to structured, unstructured, and other data types requires a big data platform that can discover, acquire, store, and analyze any kind of data with equal agility. For example, customer influence analysis often needs to mine unstructured social media alongside semi-structured call-center logs, structured transaction data, and various geospatial coordinates. These and other data sources can help you build a more powerful relationship graph model for behavioral segmentation.
  • Omnitemporal analytics: Any application that requires a converged view across all time-horizons—historical, current, and predictive—requires a big data platform with the storage and horsepower to process these disparate workloads. For example, multichannel customer experience optimization applications require decision automation infrastructure that leverages historical transactions, real-time portal clickstreams, and predictive behavioral models to support continuous tuning of customer interfaces and interactions.
  • Multivariate analytics: Any application that requires detailed, interactive, multidimensional statistical analysis and correlation requires a big data platform that can execute these models in a massively parallel manner. Regression analysis, market basket analysis, and other mainstays of advanced analytics all fall into this category.
  • Multi-scenario analytics: Any application that requires you to model and simulate alternate scenarios, engage in free-form what-if analysis, and forecast alternative future states requires a big data platform that supports fluid exploration without needing define data models up front. This is the proverbial "spreadsheet on steroids" use case. In the Hadoop context, this involves aggregating disparate data sources in a file system such as HDFS and then delivering it downstream to in-memory clients with flexible, ad-hoc, client-centric visualization.
  • Semantic analytics: Any application that requires semantic exploration of unstructured data, streaming data, and other sources demands a big data platform with a rich metadata layer. One of the key features to look for is triple-store capabilities for managing semantic metadata in the Resource Description Framework (RDF) standard. This RDF triple-store is a feature of DB2 v10, which is a key component of the IBM big data portfolio, along with InfoSphere BigInsights and InfoSphere Streams.

Likely evolution of the next-generation BI toolset

Clearly, we're well beyond old-school BI here, but the world has evolved into the age of petabyte-scale advanced analytics at our fingertips. A new BI future is rapidly emerging.

However, that doesn't mean you need to jettison established tools and approaches if they're still addressing your organization’s needs. If your needs in these areas are specialized and demand the full arsenal of the professional data scientist, you will almost certainly need to use power tools such as IBM SPSS. But if you need basic features in any or all of these areas that work out of the box with your reporting, query, and other traditional BI tools, the next-generation Hadoop-enabled BI platform is for you.

And let's not forget the simplicity equation, without which big data may very well become a haystack for buried (albeit golden) nuggets of intelligence. One potential downside of big data is that the sheer volume, velocity, and variety of data can easily overwhelm the poor analyst who is trying to find an actionable kernel of intelligence. Humans can't easily navigate petabytes, and information overload is always a very real risk when you're simply dumping data indiscriminately into your Hadoop clusters.

As you implement the next-generation of Hadoop-enabled BI environments, you must take pains to ensure a simple, seamless, and productive experience for the average knowledge worker. Line-of-business users will balk at big data if you don’t deliver targeted intelligence to their tablets, smartphones, and other devices for fast consumption.

Many of the usability features of today's top BI platforms, such as IBM Cognos, will be fundamental to this new era. The new era of Hadoop-centric BI will rely on self-service, in-memory, predictive, and portable, and personalizable client tools. The emphasis will be on interactive visualization, semantic search, and data virtualization to ensure simple but rich exploratory experiences.

Visual, collaborative BI development will be order of the day in big data

Don't worry. Your average user won't need to learn how to program in MapReduce, Pig, or any of the other Hadoop specifications. All of this big data "plumbing" will be submerged in a highly visual next-generation BI experience similar to what you've grown accustomed to on Cognos and other analytics tools. And it's a fair bet that your next-generation BI platform will come with productivity accelerators: in other words, embedded MapReduce and other big data models, views, and tools geared to common analytical needs.

The next-generation Hadoop-enabled BI platform will also be extensible. These environments will also support collaborative development of MapReduce and other analytic models by data scientists, business analysts, and other knowledge workers working in social collaboration contexts. Developer productivity will grow as next-generation big data platforms automate more of the grunt work of data discovery, preparation, aggregation, segmentation, modeling, and scoring.

In other words, Hadoop and other new big data technologies will be the foundation for an evolved massively parallel processing (MPP) EDW, not dissimilar from IBM Smart Analytics System or IBM Netezza.

Taken together, all of the new approaches that we've discussed are transforming your BI environment, now and through the rest of this decade, into an ever more powerful brilliance infrastructure.

[followbutton username='jameskobielus' count='false' lang='en' theme='light']
[followbutton username='IBMdatamag' count='false' lang='en' theme='light']