Migrating production Hadoop clusters on-premises and to the cloud

Can it be done without downtime and data loss?

Chief Product Officer and Co-Founder, WANdisco, Inc.

There are plenty of compelling business and technical reasons for migrating from one Hadoop distribution to another, whether on-premises or into the cloud:

  • A different Hadoop distribution—or an updated version of the same distribution—which effectively becomes a migration if the underlying Hadoop file system format changes between releases—may offer improved functionality and performance.
  • Competing Hadoop distribution vendors might cut their support costs.
  • Choosing a single Hadoop distribution or cloud platform can facilitate enterprise-wide consolidation.
  • Cloud-based Hadoop options such as the IBM BigInsights on Cloud solution, built on Bluemix, can offer attractive economies of scale. Using BigInsights on Cloud, customers can take advantage of a wide range of IBM Watson Analytics applications, Apache Spark–based services, and third-party tools that would be infeasible to deploy and maintain in house.

Rethinking data migration paradigms

However, there are major obstacles to obtaining these benefits, including the extended downtime and resulting business disruption caused by the limited one-way batch-oriented tools typically used. Built on DistCp, these solutions require significant administrator involvement for setup, maintenance and monitoring. Replication takes place at pre-scheduled intervals in what is essentially a script-driven batch mode of operation that doesn’t guarantee data consistency. Any changes made to source cluster data while a DistCp migration process is running won’t be captured and must be manually identified and moved to the new target cluster. This means data has to remain static in both the source and target clusters during migration. In addition, DistCp is ultimately built on MapReduce and competes for the same resources production clusters use for other applications, and would severely impact their performance during migration distribution vendors have also added support to their DistCp solutions for moving data to the cloud, but the same challenges faced in on-premises Hadoop migration remain. For large-scale data migration, some cloud vendors offer an appliance-based approach in which, typically, a storage appliance is delivered to the customer’s data center and data is copied from the customer’s servers to the appliance, after which the appliance is then shipped back to the cloud vendor for transfer to its servers—a process that often takes a week or longer.

Normal operations can’t continue when using these technologies, pausing for as long as a week in many cases. In addition, there’s no way to know with any certainty whether all the data migrated successfully, or whether applications will function as expected in the new environment, until after migration completes. This is unacceptable in virtually any production environment, and in the case of cloud migration means that only cold, static data can be moved.

Taking a transactional approach to data migration

The only way of overcoming these obstacles is by using technology that’s transactional and multidirectional, allowing old and new clusters to operate in parallel while data moves between them as it changes in either environment. Applications can be tested to validate performance and functionality in both the old and new environments as they operate side by side until migration is complete. Data, applications and users move in phases, and problems can be identified as they occur instead of after a period of downtime, by which time they might be impossible to resolve without restarting the entire migration process.

In addition, such a tool must function equally well without regard to underlying Hadoop distribution and version, storage type and, during cloud migration, cloud vendor’s object storage. Such a multidirectional—not merely bidirectional—tool should also support complex consolidation projects that move data from multiple clusters running on a mix of Hadoop distributions and storage onto a single on-premises or cloud platform or a hybrid of the two.

After migration, this same technology can be used to distribute and synchronize data in any way required, enabling a true hybrid cloud deployment. Hardware and other infrastructure used for backup and disaster recovery before migration can be brought into full active production use after migration, allowing Hadoop deployments to scale up dramatically at no additional cost.

To learn more about how IBM and WANdisco are working together to deliver a seamless migration experience that eliminates downtime and data loss while offering a host of other benefits after migration, read the whitepaper Why a Transactional Data Migration Tool Is Required.