Hadoop your way to a hybrid cloud

Taking data warehousing to the hybrid cloud

Offering Manager, Watson Data & AI, IBM

Recently there’s been a change in the number of organizations starting to integrate big data technology into their day-to-day operations. A few years ago, the big trend was to offload data to appliances. Now, everything is about Apache Hadoop. So what kinds of challenges are cropping up now, and how are companies keeping pace?

Hadoop your way to a hybrid cloudEvolving data warehouse space

Architectures are changing because the data itself is evolving. The move from traditional relational databases (RDBMSs) to appliances and columnar databases was all about accelerating SQL queries and reducing costs for analyzing structured data. As a result, many organizations now have robust tools that can handle many of the challenges presented by structured data. However, their capabilities around semistructured and unstructured data are generally much less mature. 

In the last five years, as NoSQL databases have become more popular, increasing amounts of data are being stored in semistructured formats such as JavaScript Object Notation (JSON) and XML. This rise in semistructured formats is particularly the case when organizations are sharing data with their suppliers or customers, or using external, third-party data sets such as macroeconomic data, stock market data or weather data. Building a new relational database schema for each data set is not practical, so organizations need a more flexible approach. 

Companies that want to analyze completely unstructured data, such as text from social media, face similar issues. That kind of data simply can’t be efficiently analyzed at scale using a traditional relational database, regardless of whether it’s row-based or in columnar form. By contrast, Hadoop has no problem with either semistructured or unstructured data, so it’s a well-suited platform for these types of use cases.

Transforming from a fallen monolith to a hybrid cloud

Many data management professionals used to have a monolithic, one-size-fits-all data warehousing philosophy. Now, a definite shift toward a specialized approach is underway, one that’s nimble and adaptable to new use cases. And a need exists for these new strategies to be implemented without sending costs spiraling, which is why a hybrid cloud approach is so compelling. You can keep your traditional data warehouse and appliances for traditional reporting, planning and forecasting, and structured data analytics, which you know you’re always going to need. 

Then, for new use cases around semistructured or unstructured data, which may be a bit more experimental, you can leverage cloud data services to quickly spin up whichever technologies you need. These technologies include a Hadoop service such as IBM BigInsights on Cloud, an Apache Spark service such as IBM Analytics for Apache Spark, a NoSQL document store such as IBM Cloudant, or a combination of several such technologies. 

The hybrid cloud approach gives you flexibility and agility in financing and architecture. With a cloud service for Hadoop or Spark, you only pay for what you use, and no up-front infrastructure investment is necessary. There’s no reason not to try it out; very little financial risk exists. And if your use case produces valuable results for the business, then you can quickly and easily scale up from a pilot to a full-scale implementation. 

Factoring data variety into the equation

The variety of data is driving change, and as a related issue, the types of analyses that businesses want to perform are becoming quite varied as well. Traditional data warehouse architectures are excellent for standard reporting and dashboards, as well as core financial analytics tasks such as budgeting, forecasting and consolidations. Meanwhile, appliances and columnar databases are good for very complex queries and sophisticated ad hoc analysis. 

But increasingly, those are capabilities that many businesses already have: they’re no longer a significant differentiator or a source of competitive advantage. Instead, the emphasis is shifting to the realm of the data scientist, who uses techniques such as data mining, statistical analysis, natural-language processing and machine learning. Once again, Hadoop is a much more efficient platform for these algorithmic analysis techniques. 

In general terms, we’re seeing the continuation of a journey from a monolithic, one-size-fits-all data warehousing philosophy, to a more specialized approach that treats each business use case and each data set on its own merits. In fact, although we often talk about moving to Hadoop as though it’s a single environment, in reality we are talking about a wide range of technologies that can play different roles depending on the business use case. 

For example, Hadoop was essentially designed for batch processing, so it’s great for large-scale analyses that you need to run periodically. But if you want to do real-time analytics or handle streaming data, you may want to add Spark or IBM InfoSphere Streams to the mix. 

Increasingly, we’re seeing businesses focus on making their data warehouses nimble, adaptable and extendable to new use cases, rather than trying to build a general-purpose solution that can handle everything from day one.

Modernizing the data warehouse with Hadoop

We’re seeing that using Hadoop to warehouse structured data simply can be much more cost-efficient. Traditional data warehouse infrastructure is very expensive because it typically requires enterprise-class servers and tier one storage. Typically, storing more than a certain number of years of historical data on such platforms becomes too expensive, and the older data gets archived. But once it’s archived, you can’t easily analyze it. 

Hadoop solves this problem by providing a highly cost-effective platform for storing and analyzing this data; and cloud versions of Hadoop such as BigInsights on Cloud give you all the benefits with none of the ancillary costs—such as purchasing, installing and maintaining infrastructure; hiring skilled resources; and maintaining security. You can also take advantage of low-cost object storage services to turn a Hadoop cluster into a repository for data that doesn’t need to be intensively analyzed, or is low-touch data. This approach is becoming increasingly popular, especially for midsized businesses that don’t want the trouble of setting up their own Hadoop cluster.

Updating existing architecture today

If you have a few initial use cases in mind, and want to modernize your own data architecture, getting started right away is possible. By signing up for the IBM Bluemix platform, you get a 30-day complimentary trial, allowing you to take a look at many of IBM’s cloud data services. You can see what combination of technologies—BigInsights, Spark, Cloudant—and so on would work best for each specific instance.