IBM Big Replicate: Complete resilience through active-active replication
With the release of IBM BigInsights 4.2, IBM is making self-service, powerful advanced analytics—including Apache Spark—available on an optimal Apache Hadoop distribution. Additionally, the offering now includes IBM Big Replicate, which provides the core functionality supporting continuous availability and performance with data consistency across clusters that are any distance apart, on premises or in the cloud.
We spoke recently to Jim Campigli, chief product officer and cofounder, at WANdisco, Inc. about the Big Replicate technology. Big Replicate is an active-transactional replication technology for continuous availability, streaming backup, uninterrupted migration, hybrid cloud and cloud bursting, and data consistency across clusters that are any distance apart. IBM Big Replicate is a WANdisco original equipment manufacturer (OEM) and rebranded technology by IBM.
Campigli has over 25 years of software industry experience at both early-stage and public companies. In his current role, he is responsible for overseeing WANdisco’s product strategy and has held senior product management, strategy, marketing and consulting roles at companies such as BEA Systems, NetManage and SAP.
To kick off this interview, can you explain what is IBM Big Replicate?
Big Replicate is the IBM version of WANdisco Fusion, which IBM has OEM’ed and now offers to the market under the IBM Big Replicate 2.0 brand. It is based on patented technology from WANdisco that enables active-active, or what we also like to call active-transactional, replication across data sources any distance apart. This technology has taken the Paxos algorithm and enhanced it to enable active-active replication between a variety of data sources such as Hadoop clusters, cloud environments, [network-attached storage] NAS filers and so forth. It enables continuous data access in the face of network outages, hardware failures and entire data centers going up and down, so that you get complete resilience that otherwise is impossible. Actually, computer scientists said it would never be possible because of the challenges involved in achieving this [resilience] across distributed data sources connected over a wide area network.
“You get complete resilience that otherwise is impossible; and actually, computer scientists said it would never be possible with distributed systems connected over a wide area network.” —Jim Campigli, Chief Product Officer, Cofounder, WANdisco
Can you simplify the active-active replication capability down to a simple scenario for those readers who may be new to this topic? How would you describe active-active replication?
What it really enables from a practical perspective is that Big Replicate gives you the same scenario you’d have if everybody was working from one location, off of one Hadoop cluster or one database—whatever it may be—even though they are actually working across multiple data sources at different locations, any distance apart. It gives them access to the same data, the same view of the data, read and write access to the same files—just as if they were working against a single data source at a single location.
I had never heard of the Paxos algorithm before today. Can you say more about it?
The original white paper was written by a computer scientist named Leslie Lamport at Microsoft. What this concept is about is the ability to maintain one-copy equivalence based on quorum agreement. What that means for multiple data sources that you want to sync is that a quorum of those data sources have to agree to any new transactions that are proposed at any one of them. The data sources have to say “yes, there is no conflict with my data, and I can accept this ordering of the transaction relative to everything else.” Then once you get the quorum of those participants—the participating Hadoop clusters, databases, or whatever—agreeing to that transaction, then the transaction is written by all of them—effectively, what we like to say, at the same logical time. Of course, there are limitations in terms of the speed of the network and so forth, but there are a number of efficiencies built into our solution that help to overcome those issues as well.
What are the benefits of Big Replicate?
The first is continuous availability with performance; that is, you end up with local network speed read-and-write access to the same data in every location. Additionally, you can selectively replicate—you don’t have to replicate your entire Hadoop cluster. In the case of Hadoop, replication is done at the [Hadoop Distributed File System] HDFS folder level, but the other point about this is that the data is replicated immediately as it is ingested. For example, for Spark streaming fast data kinds of applications, unlike the standard tools that you use with Hadoop for replicating data across clusters, you don’t have to wait for files to be completely written and closed. It immediately replicates the data as it is ingested. This means you are getting a kind of built-in, continuous hot backup by default because every time a transaction changes in one participating cluster, it is replicated into the others that are participating, and they can also update those same files as well. And the clusters can be on premises or in the cloud and run on any distribution. Big Replicate is agnostic to the underlying Hadoop distribution and version, unlike the tools provided by the Hadoop distribution vendors.
IBM Big Replicate replicates the same data volumes up to 90 percent faster than distributed copy (DistCp)–based solutions such as Apache Falcon or Cloudera Backup and Disaster Recovery (BDR), without impacting the performance of the other applications running on the clusters as those solutions do.
Would you say that Big Replicate is akin to a data insurance policy?
Yes, you’re exactly right. If a site goes down, with Big Replicate installed with each cluster, or in each cloud environment such as BigInsights on Cloud, each cluster knows the last good transaction that it processed. So when it comes back online, it is able to reach out to the other Big Replicate servers installed with the other participating clusters, grab all the transactions that it missed during the time slice it was offline, and apply them and re-sync automatically. You eliminate the risk of human error in recovery, and it also ensures that there is no data loss. Now, if you look at the standard DistCp solutions that come with Hadoop, these are batch oriented and they are not continuous replication, which means that any data that is added since the last time you did a cluster backup with DistCp can potentially get lost.
The other issue with these standard DistCp [solutions] is that they consume so much cluster resource, they compete with the other applications that run on a Hadoop cluster. What ends up happening is that any kind of replication jobs of any size with DistCp-based solutions end up being done during off-peak hours. And full-cluster backups have to be done after normal business hours. So again, if anything goes down during the day, you run the risk of losing the entire day’s worth of data.
We don’t have that problem. We replicate every transaction as it is created; every change is replicated to all the other participating clusters. In a nutshell, with active-active replication, you’re protecting your business and your data, and if you value your data, you’re going to want a solution like this.
Is there anyone else in the market today that has an active-active replication service such as IBM Big Replicate?
No, no other truly active-active replication solution is available today. That’s why we have seven patents, and 25 pending patent applications on this technology. The way IBM Big Replicate is deployed, the way it is architected, we can replicate across any clusters running any Hadoop distributions, as long as they are compliant with the HDFS API.