Will Hadoop replace or augment your Enterprise Data Warehouse?

Senior Managing Consultant, Big Data and Analytics Practice, IBM

There is all the buzz about Hadoop these days and its potential for replacing the enterprise data warehouse (EDW). The promise of Hadoop has been the ability to store and process massive amounts of data using commodity hardware that scales extremely well and at very low cost. Hadoop is good for batch-oriented work and not really good at OLTP workloads.

The logical question then is do enterprises still need the EDW? Why not simply get rid of the expensive warehouse and deploy a Hadoop cluster with Hbase and Hive? After all, you never hear about Google or Facebook using data warehouse systems from Oracle or Teradata or Greenplum.

Before we get into that, let’s get a little bit of overview on how Hadoop stores data. Hadoop comprises two components. The Hadoop Distributed File System (HDFS) and the Map-Reduce Engine. HDFS enables you to store all kinds of data (structured as well as unstructured) on commodity servers. Data is divided into blocks and distributed across data nodes. The data itself is processed by using Map-Reduce programs which are typically written in Java. NoSQL databases, like HBase and Hive, provide a layer on top of HDFS storage that enables end users to use SQL language. In addition, BI reporting, visualization and analytical tools like Cognos, Business Objects, Tableau, SPSS, R etc., can now connect to Hadoop/Hive.

A traditional EDW stores structured data from OLTP and back office ERP systems into a relational database using expensive storage arrays with RAID disks. Examples of this structured data may be your customer orders, data from your financial systems, sales orders, invoices, etc. Reporting tools like Cognos, Business Objects and SPSS are used to run reports and perform analyses on the data.

So are we ready to dump the EDW and move to Hadoop for all our Warehouse needs? There are some things the EDW does very well that Hadoop is still not very good at:

  • Hadoop and HBase/Hive are all still very IT focused. They need people with lot of expertise in writing MapReduce programs in Java, Pig, and other specialized languages. Business users who actually need the data are not in a position to run ad-hoc queries and analytics easily without involving IT. Hadoop is still maturing and needs lot of IT hand holding to make it work.
  • EDW is well suited for many common business processes, such as monitoring sales by geography, product or channel; extracting insight from customer surveys; and delivering cost and profitability analyses. The data is loaded into pre-defined schemas/data marts and business users can use familiar tools to perform analysis and run ad-hoc Sql Queries.
  • Most EDW come with pre-built adaptors for various ERP systems and databases. Companies have built complex ETL functions, data marts , analytics and reports on top of these warehouses. It will be extremely expensive, time-consuming and risky to recode that into a new Hadoop environment. People with Hadoop/MapReduce expertise are not readily available and are in short supply.

Augment your EDW with Hadoop to add new capabilities and Insight

For the next couple of years, as the Hadoop and big data landscape evolves, you can augment and enhance your EDW with a Hadoop/BigData cluster as follows:

  • Continue to store summary structured data from your OLTP and back office systems into the EDW.
  • Store unstructured data into Hadoop that does not fit nicely into “Tables.” This means all the communication with your customers from phone logs, customer feedbacks, GPS locations, photos, tweets, emails, text messages, etc. can be stored in Hadoop. You can store this a lot more cost effectively in Hadoop.
  • Co-relate data in your EDW with the data in your Hadoop cluster to get better insight about your customers, products, equipment, etc. You can now use this data for analytics that are computation-intensive, such as clustering and targeting. Run ad-hoc analytics and models against your data in Hadoop, while you are still transforming and loading your EDW.
  • Do not build Hadoop capabilities within your enterprise in a silo. Hadoop and other big data technologies should work in tandem with and extend the value of your existing data warehouse and analytics technologies.
  • Data warehouse vendors are adding capabilities of Hadoop and MapReduce into their offerings. When adding Hadoop capabilities, I would recommend going with a vendor that supports and enhances the open source Hadoop distribution.

In a few years as newer and better analytical and reporting capabilities develop on top of Hadoop, it may eventually be a good platform for all your warehousing needs. Solutions like IBM's BigSQL and Cloudera's Impala will make it easier for business users to move more of their warehousing needs to Hadoop by improving query performance and SQL capabilities.

Related resources