Big Replicate: A big insurance policy for your big data
Dwaine Snow is a Global Big Data and Data Science Technical Sales Leader at IBM. He has worked for IBM for more than 25 years, focusing on relational databases, data warehousing, and the new world of big data analytics and data science. He has written eight books and numerous articles on database management, and has presented at conferences around the world.
As organizations’ maturity levels around big data technologies increase, we’re seeing a rise in the use of Apache Hadoop as a component of business-critical systems. Operationalizing the use of Hadoop will potentially unlock huge opportunities to derive more business value from unstructured and non-traditional datasets – but it also poses new challenges, and raises new risks from an IT management perspective.
To explore this increasingly hot topic, we spoke to Dwaine Snow, Global Big Data and Data Science Technical Sales Leader at IBM.
Andrea Braida: Dwaine, great to have you with us. As the Hadoop market has evolved, why have topics like cluster management and business continuity become such important topics?
Dwaine Snow: There’s really a shift in what companies are doing with Hadoop. Historically, big data architectures have often been used for offline analytics – acting as a sandbox to help data scientists explore new datasets, run experiments, and identify new areas of business value, for example.
Today, there’s an increasing awareness that these technologies can also be used operationally. They can provide a foundation for high-speed operational analytics, much closer to the point of transaction, taking into account far more than just the transactional data itself. This is especially true for applications that use Apache Spark – for example, ingesting incoming streams of data at high velocity, and then aggregating, correlating or transforming them for analysis. Spark also allows you to run advanced analytical and machine-learning algorithms on your data to derive even more value and insight.
As users start to become reliant on these types of applications, we need to start treating them like what they are: tier-one, business-critical, enterprise-grade systems. And that means they need the same kind of protection as any other tier-one system, in terms of high availability and disaster recovery capabilities.
Businesses have started to ask questions like: how do we back up our Hadoop cluster? If we need to recover from a disaster, how long will it take to restore? How much data – or how many transactions – are we going to lose? What if we need 24/7 availability, or we can’t tolerate any data loss?
Andrea Braida: And I guess traditional backup strategies can’t provide a good answer to those questions?
Dwaine Snow: Yes, exactly. The problem is, it’s so much more complex. Backing up a database that lives on one server and holds a couple of terabytes of data is one thing; backing up a Hadoop cluster with potentially hundreds of nodes and petabytes of data is a completely different story.
Of course, there are tools available – there is a utility called DistCp that comes with Hadoop. It is designed for large-scale copying of data between clusters. The problem is, DistCp only gives you a point-in-time snapshot – anything that happens on the cluster since the last time you took a snapshot is at risk of being lost.
That’s a problem when you want to use a Hadoop environment for anything that requires continuous streaming or monitoring of data from sources outside of Hadoop – Internet of Things devices, for example, or social media sites. If your cluster goes down and you have to restore from yesterday’s snapshot, there may be no way to get today’s data back.
And if you’re using Hadoop as your data lake – as a single source of truth for all the information that your business depends on – you really can’t afford any data loss at all.
Andrea Braida: So you need an insurance policy.
Dwaine Snow: You need a big insurance policy! You need something that’s going to protect your data continuously, and that’s where IBM Big Replicate comes in.
Unlike DistCp, Big Replicate is not a snapshotting tool – it’s an active-active replication utility that helps protect your data by maintaining multiple copies across separate clusters, or even replicating it to an Object Store. By running two or more clusters at different data centers and keeping them continuously in sync with Big Replicate, you can dramatically reduce the risk of data loss. Even if one data center suffers a catastrophic outage, you have a full copy of all your data at the other site, so you can continue working.
Andrea Braida: That sounds great – but what about applications where you can’t afford to lose data, but you don’t need 24/7 availability and you can’t justify the overhead of running two clusters?
Dwaine Snow: Well, one of the really neat things about Big Replicate is that the clusters you’re replicating between don’t have to be the same. In fact, they don’t even necessarily both need to be Hadoop clusters.
One of our Hadoop clients, a major global hotel chain, has several important applications that are just as you describe – their data is business-critical, but the applications could be offline for a day or more without too much impact on the business. They don’t really need to have a second Hadoop cluster on standby at all times, because it’s not necessary to be able to restore within minutes – and so they don’t want to pay for extra compute capacity that they won’t use.
So we’ve been working with them to use Big Replicate to sync their Hadoop environment with IBM Object Storage (Cleversafe). Object storage is just storage – it doesn’t provide the processors and memory of a Hadoop cluster – so it’s much more cost-efficient per terabyte.
As a result, they get all the same data protection benefits at a much lower total cost of ownership.
Andrea Braida: Is this a one-off, or do you see other companies doing the same thing?
Dwaine Snow: With Hadoop today, the storage is directly attached to each compute node. But I’ve seen trends in the industry that lead me to believe that in the future, most big data environments will be in two separate layers: storage and compute.
Object storage is already accessible from Hadoop and Spark, the two big data engines, so it’s definitely something that we seem to be moving towards. And I’d say this is another indication that Big Replicate is ahead of the game, since it enables you to start your journey to this new paradigm now, without impacting your current Hadoop cluster.
Andrea Braida: So where should our readers go to learn more?
Dwaine Snow: Well, I noticed another interview that you did with Jim Campigli about Big Replicate a few months back – that might be a good place to start! Or if they would like to see Big Replicate in action, there’s a great demo video. Of course, there’s also the product page on ibm.com. And finally, if they would like to learn more about big data architectures in general, they could check out the webinar on Hadoop and Spark that I recently took part in.