For many people, big data = Hadoop. The most common question I get from customers is some form of “Do you support [BigInsights, Cloudera, AWS or other Hadoop-based offering]”?
Saying “yes” is incomplete. The ability to access a data store does not imply that all business intelligence capabilities are readily available or even appropriate.
Over the past 20 years, a number of different data structures and technologies have been introduced to increase performance or enable a BI capability; many of these are self-service oriented, and they all deliver different levels of capabilities depending on the problem they are intended to solve.
For example, the decision to move and transform operational data to an operational data store (ODS), to an enterprise data warehouses (EDW) or to some variation of OLAP is often made to improve performance or enhance broad consumability by business people, particularly for interactive analysis. Business rules are needed to interpret data and to enable BI capabilities such as drill up/drill down. The more business rules built into the data stores, the less modelling effort needed between the curated data and the BI deliverable.
Figure 1 - The relative BI modelling effort needed for an ODS, EDW and OLAP data store.
Hadoop is another data storage choice in this technology continuum. The Hadoop Distributed File System (HDFS) or Hive is often used to store transactional data in its “raw state.” The map-reduce processing supported by these Hadoop frameworks can deliver great performance, but it does not support the same specialized query optimization that mature relational database technologies do. Improving query performance, at this time, requires acquiring query accelerators or writing code. In other words, retrieving a list of transactions for specific dates, geography and so forth may be fast and simple but aggregate-oriented calculations – average same-store sales or sales by square feet , for example – will likely require programming skills to obtain the desired performance.
Hadoop-based data tends to be limited to reporting capabilities in a business intelligence application due to its batch oriented processing. Good performance for interactive capabilities may be achieved for specific areas, but performance for general ad hoc queries may not be satisfactory due to the overhead in setting up jobs for processing. Contributions, such as Impala, to the Apache open source project establish a starting point for delivering better performance for interactivity, but this technology needs to evolve and mature before broad adoption is feasible.
Leveraging systems that are optimized for interactive analytics is recommended when data is frequently analyzed or being delivered to interactive dashboards. The diagram below extends the previous diagram to convey where Hadoop-based data fits in the data store continuum.
Figure 2 - Inclusion of Hadoop to illustrate the relative effort to describe data in a BI application
In conclusion, the key question isn’t “Does your BI tool support [my Hadoop technology]?” It really needs to be “What is the best way to leverage an Hadoop infrastructure with my BI tool?”
Cognos TechTalk “Capture more valuable insights from your big data”
Find out how the combination of the big data processing capabilities of IBM InfoSphere® BigInsights™ with the self-service business intelligence reporting of IBM Cognos® gives organizations a powerful solution to translate large amounts of data into valuable, actionable insights. Join this session to learn about:
- Interoperability between InfoSphere BigInsights and Cognos software, which can provide actionable, business-value insights from data
- Valuable strength from Cognos software in the areas of dashboarding, distribution and visualization
- Massive parallel-processing power from InfoSphere BigInsights software
See more topics from Cognos TechTalks