How to advance your data consistency strategy

Offering Manager, Hadoop, Analytics, IBM


Data application resiliency is high on the agenda for CIOs. Gary Brunell from IBM looks at the in-house solution developed by, and how the IBM Big Replicate offering compares. Gary Brunell is a Certified Thought Leader, and a Distinguished IT Specialist at IBM. Gary works with a wide range of clients in the financial services sector, from startups to Fortune 50 companies.

Jessica Lee: has developed an intrepid approach to data consistency data management challenges. What’s the story behind its data strategy?

Gary Brunell:, an Expedia subsidiary, works with partners and affiliates globally to connect tourists to accommodation providers. The company was running Hadoop infrastructure with Hive nodes for data analysis. For a business based on deep knowledge of customer behavior, these solutions are critical for success. felt that existing tools available for ensuring data consistency, however, were inadequate for a business with so many locations and sources. The company was originally focused on improving its disaster recovery capabilities by implementing failover to a secondary data center, but this soon morphed to include reporting services. At the time they were looking, there simply wasn’t much on the market, and open source was the logical choice. This decision spawned the creation of’s own solution, named Circus Train, which was later released to the open source community. To develop the solution, leveraged capabilities within the vanilla Hadoop architecture, gradually tailoring the functionality to meet their exact needs.

Jessica Lee: If I were the CIO of, my question would be, how many people do I need to support this?

Gary Brunell: If I’m an accommodation-booking company, the margins for my overall business are probably under some pressure. How much of my business will I devote to supporting the open source community and how much to my own, specific developments?

As you might expect, Circus Train is very specific, designed to serve a niche market, and long-term, I guess it will be relatively difficult to build robust resources. All the same, the lesson for me is that a self-help strategy can offer some really neat stuff, like Circus Train. The challenge arises on the business management side: vendor products like IBM Big Replicate offer enterprise resources, strength and depth that stack up to make a more compelling, long-term commercial proposition.

Jessica Lee: How do you assess the costs of the two approaches?

Gary Brunell: Most organizations operate 24/7 , and that means you need four people simply to cover night shifts, illness and holidays. That’s not accounting for the times when things go amiss and you need more folks in the room. A low-end cost estimate for each person would be around $200,000 annually, and that’s probably not fully loaded.

For companies that run maybe 20 nodes, the open source route can seem very attractive, and is probably the right initial path. If you don’t have much volume, building your own solution can make sense. But I think it is only a matter of time before issues come up that you don’t have the in-house resources or expertise to solve.

For example, another Fortune 50 customer I’m familiar with ran its Hadoop platform, leveraging data recovery tools and more to achieve the resilience and disaster recovery it wanted. As business needs piled on, the IT team put together a plan to double capacity. The full analysis showed that they were looking at a $10 million investment on their own infrastructure, including resources, training and additional tools. For half the price, they realized that IBM Big Replicate powered by WANdisco Fusion could solve their capacity and performance issues. What’s more, IBM Big Replicate offers standardized recovery, replication, management and more without the commitment to in-house salaries, infrastructure and maintenance.

Regardless of your environment — on-premises, hybrid, or cloud — IBM Big Replicate solves for data consistency challenges that can take in-house teams too long to figure out, or maybe cannot figure out, and at a lower total cost. Specifically, IBM Big Replicate enables a new way of working, now standardly termed “livedata,” with key data continuously replicated and synchronized. With true data consistency and managed replication, WANdisco Fusion eliminates many issues around resilience and disaster recovery and enables cloud burst analytics and other opportunities.

Jessica Lee: Why is control so important? How do in-house and vendor solutions differ, and what does that mean for customers?

Gary Brunell: For many people, first steps into Hadoop area are often small pilot ventures, using existing infrastructure, which can introduce other issues such as control. For example at an investment bank when we wanted to reboot, to try something from scratch, it took almost two weeks to arrange the schedule, often at very specific and horrible times at the weekend.

My first reaction to the Circus Train solution was to be impressed. They created a great product and built functionality that you don’t often see in the Hadoop world. They developed, experimented, grew and established the product, but for a very specific need. Then as their world changed, and the pressure to move to the cloud grew, it introduced more challenges, stretching their resources. If it’s all in-house, who do you call? And if you bring in consultants, how does that stack up against engaging with a third party in the first place?

The additional aspect is that IT landscapes are constantly changing, and developing deep in-house expertise in one area can catch you off-guard when the market moves on. IBM Big Replicate can handle Hadoop, cloud, object storage and more, and it is in the interest of IBM (and WANdisco) to keep developing and extending the solution. This can easily overstretch an in-house operation, turning cutting-edge into trailing-edge operations.

Jessica Lee: What is the endgame for businesses if they start in-house?

Gary Brunell: started with the intention of migrating the data platform transparently, but that soon translated to moving data to the cloud. In just moving the data, they found that the tools they had at their disposal were insufficient. If you look at their industry and the likely resources available, I’m guessing there are maybe 30 contributors to their specific branch of Hadoop, and some of those are branching off to other areas. By comparison, IBM has 300 contributors in just three Apache areas.

It’s true that is principally interested in data consistency, and that having the same values available everywhere at the same time is not on the priority list. But if synchronization is a time-scheduled operation, how much activity will have taken place between replications? If my data is off by a minute or two, maybe an hour, there’s an overhead to determine if I am using the correct, current data for my app, analysis or whatever. Can I go with possible mismatched data, or do I request an immediate sync? faced these questions, as will everyone running a 24/7 operation. The Circus Train product continues to function well for The more general case is that solutions like IBM Big Replicate deliver scalability and resilience at an enterprise level, and for some clients that would be the optimal path.

To learn how IBM Big Replicate can help your organization with data resiliency, schedule a consultation with an IBM expert.