The truth about big data integration

InfoSphere Information Server Product Management, IBM

Without a doubt, data integration is essential to the success of big data projects. However, some folks in the big data vendor community, including data warehouse, Hadoop and data integration vendors, are telling a very confusing story about the fitness of Hadoop as a data integration platform. According to this story:

Any Hadoop distribution + Any non-scalable ETL tool + ETL pushdown = Comprehensive big data integration 

This is a convenient narrative for data integration vendors that lack a scalable architecture, but this narrative does not help customers, it simply promotes confusion. Companies need scalability between their applications databases, warehouses and other analytical environments (including Hadoop) in order to solve all challenges with the performance that business demands.

The dizzying industry narrative about ETL and ETL Pushdown

During the mid-to-late 1990s, data volumes used for data warehousing started to grow quickly. These vendors note that leading relational databases evolved towards shared nothing, massively parallel processing architectures and claim that no ETL tools embraced this architecture. The story goes on to say, that, by the late 1990s, when data volumes had grown beyond the capabilities of ETL tools, companies had to push transformations into the massively parallel database (at an enormous cost burden) for processing transformations that ETL tools simply couldn’t manage.

Clear thinking about ETL and ETL Pushdown

The first thing you should know is that shared nothing, massively parallel data integration platforms do exist and have existed for quite a long time (InfoSphere Information Server is a prime example). Further, by leveraging a shared nothing ETL architecture, these organizations have greatly reduced the cost associated with pervasive ETL Pushdown. 

Yes, there are a variety of patterns where ETL Pushdown yields great improvements in process time (Information Server fully embraces that capability), but pushdown has severe limitations when it comes to many other data integration patterns. For example, ETL Pushdown is not an appropriate option when:

  • Data is not stored in relational tables
  • Data has not been cleansed
  • Data must be integrated from heterogeneous data stores (SAP, flat files and so on)

Additionally, some integration logic, such as that for data cleansing, data profiling and capturing changed data, simply cannot be pushed or cannot be efficiently pushed into the database or into Hadoop. Then consider that database performance will not always run faster than a shared nothing, massively parallel ETL engine. For some data integration processes the database will run faster, while for other processes the database will run much slower. Hadoop also has specific latency issues since it takes I/O hits between each map and reduce step. While the Hadoop community is making gains in this area, massively scalable ETL has specific features that are not even being incubated yet.

To learn more about the full requirements for big data integration solutions that will give you the necessary flexibility, scalability and performance you need, read our whitepaper.