10 expert tips to boost agility with Hadoop as a service

Portfolio Marketing Manager, IBM

Recently, a group of Apache Hadoop and Apache Spark subject matter experts from IBM Analytics hosted a public CrowdChat discussion about using cloud-based Hadoop and Spark services as a lever for business agility. Here is a top-ten list of hot topics and themes that emerged from that discussion.

Despite years of effort centralizing information in data warehouses, organizations are still challenged to gain insight into large segments of their data, especially dark, unstructured data that isn’t amenable to database storage or analysis. Solving this problem is critical to business agility because, as one commentator noted, you don’t really know the value of data before you have it. Data is an investment, but once you have it you can do amazing things with it.

Adopting Spark as a service helps break open this last silo of enterprise data because Spark enables quick access, join, correlate, analyze, predict, and optimize results based on all your data—no matter where it is stored or in what format it is in. Essentially, a flexible, cloud-based Spark cluster enables you to bring powerful analytics to bear on almost any data source, even dark data, and start mobilizing its business value immediately.

Failing fast—the ability to set up and conduct experimental projects quickly and at minimal cost—was repeatedly cited as a key enabler of agility. If businesses cannot afford to take the risk of starting a project until they know it’s going to deliver results, then they will miss out on opportunities that more aggressive competitors will seize eagerly.

So, how do you fail fast? Make sure that you have a platform in place that makes it easy to initiate, manage and close off projects with the minimum possible investment of time, resources and capital. Hadoop as a service is a well-suited playground for this kind of rapid experimentation for several reasons: 

  • It handles almost any kind of analysis or transformation of almost any type of data.
  • It enables you to spin up new environments in minutes, without any help from your IT team.
  • It offers a flexible, pay-as-you-go commercial model, with no up-front investment. 

As one participant put it, enterprises should be using Hadoop as a service to institute data science–driven, real-world experiments across all business processes—customer facing, internal and business to business (B2B)—so you can fail fast and iterate 24x7.

Asked whether they would advise other organizations to invest in Hadoop or move to Spark, most respoinders felt that this question took the wrong approach. Hadoop and Spark are complementary, not mutually exclusive.

As one participant wrote, Hadoop helps storage and processing of data; Spark helps streaming and real-time analytics. We need both. Another commentator agreed: Hadoop and Spark make up a better-together story—Hadoop for files and storage and Spark for awesome, fast analytics.

Some of the general confusion about these technologies stems from people thinking of Hadoop as just being the MapReduce data processing engine, which has been largely superseded by Spark. However, Hadoop is actually an ecosystem that includes both MapReduce and Spark, as well as dozens of other projects for large-scale data ingestion, processing and analytics.

Hadoop today is a very different animal to the Hadoop of five years ago. Components that were originally considered an essential part of Hadoop may no longer be the best option. The rise of Spark at the expense of MapReduce is the best-known example of this evolution; but increasingly, companies are swapping other technologies in and out of their Hadoop landscapes too.

For example, as one commentator explained, nowadays even the use of Hadoop Distributed File System (HDFS) is questionable. We tend to use the native object store offered by the cloud provider for data at rest. Another agreed. We’re seeing a move away from HDFS storage in the core data platform to the use of a native object store. In fact, that object store is built into IBM Watson Data Platform.

The use of standardized Hadoop-as-a-service offerings can greatly help enterprises stay current with the most important new trends in the Hadoop space. Services such as IBM BigInsights on Cloud provide a carefully curated and managed set of the best projects that the ecosystem has to offer and augment them with well-integrated proprietary components. This approach helps eliminate the need for companies to spend weeks researching the latest and greatest new Apache projects, and wasting time implementing flavor-of-the-month technologies that don’t live up to the hype. Moreover, a well-integrated Hadoop-as-a-service environment can be a core component of a truly comprehensive big data landscape, such as IBM Watson Data Platform, that enables companies to put each tool to its best use.

As one participant recommended, use Hadoop as a data refinery and query-capable archive, and use Spark as a data science development platform. Use an in-memory relational database management system (RDBMS) for online transaction processing (OLTP), and use a massively parallel processing (MPP) RDBMS for data warehousing. Use NoSQL for mobile and Internet of Things appliations, and use stream computing for real-time processing.

In response to a question about why one might decide to put a Hadoop cluster in the cloud rather than running it on premises, the responders were almost unanimous. A cloud deployment model helps eliminate infrastructure management complexities, empowering enterprises to focus on what really differentiates them from the competition.

Do you prepay for something that will be obsolete in four years, or pay as you go and stay current? The answer to that question seems like a no-brainer, as one commentator put it. Another added that cloud computing means you do not have to wait weeks or months for hardware, networking and so on. And you can spend your resources on analyzing data—not managing the cluster. A third commentator pointed out that for most companies, competitive advantage is probably going to be from using Hadoop and not from managing Hadoop.

One of the main reasons why companies are wary of cloud-based analytics services is a concern about data privacy and security. One of the few commentators who raised an argument for running Hadoop on premises cited security as one of the key topics. If unstructured data is born on premises and needs to stay there for regulatory, security and other reasons, keeping the Hadoop cluster in a private cloud behind a firewall makes sense.

However, most CrowdChat participants agreed that in general, the biggest risk to information security comes from poorly protected or improperly patched on-premises systems. Very few companies have the expertise and resources in house to achieve the same level of information security as specialist cloud providers.

One commentator wrote that Steven Sinofsky’s (@stevesi) blog, “Why Sony’s Breach Matters,” is one of the best wake-up call articles from an extremely knowledgeable source that explains how vulnerable on-premises software is to hackers. Another commentator agreed, saying the biggest data breaches tend to occur with on-premises solutions. Eventually the world will get its head around the fact that the cloud is actually safer.

As guidance for companies who are ready to adopt a cloud-based Hadoop platform, one commentator advised picking a cloud provider that you can trust. And know that they have large teams with many years of experience protecting data.

The Hadoop ecosystem includes an impressive range of extract, transform and load (ETL) platforms, message queues, stream processing engines and other tools for moving data in and out of a cluster. But companies needn’t be overwhelmed by all the choices.

Some participants suggested using tools such as Apache Flume, Apache Sqoop and IBM Streams, or copying data to an object store and then copying it from there to Hadoop and HDFS. However, others pointed out that much simpler options can work too, such as FTPing it, using most any data movement and ETL utility, and even mailing a hard drive. One expert noted seeing use of ingestion tools, classic data replication tools now adding cloud destinations, vendor migration offerings and even an 18-wheel truck that delivers hardware for data copying.

The key takeaway is that getting data into a Hadoop cluster is not challenging or complex, even if that cluster is in the cloud. If you’re not confident setting up a data pipeline with Apache Kafka to stream data in real time, that lack of confidence should not be a barrier to entry—almost any kind of data transfer will work. The biggest barrier could even be economic rather than technical. One participant advised that the first step toward getting data into Hadoop in the cloud is to swipe your credit card.

That nobody really knows how voluminous the big data is going to get in any one organization is another strong argument for adopting a cloud-based Hadoop platform, rather than building an on-premises cluster. One commentator noted that Hadoop in particular is designed to ingest unknown data sets. Up-front capacity planning is nearly impossible. Elasticity and scalability are top priorities.

If you build your own cluster, you take responsibility for managing and adding new nodes as the volume of data that you manage increases. And as many companies are only just now starting to understand how much unstructured data they need to analyze over the coming years, letting a cloud provider take care of the scalability problems is likely to be more cost-effective and convenient.

As one participant explained, we have no idea today how big our data lake will grow or how many different forms of data we will eventually have in it. Cloud computing enables us to grow as needed, where needed, far easier than an on-premises solution. So we can rightsize the solution. Cloud computing offers the flexibility needed in today’s unknown environment.

Hadoop and Spark are not the only components well suited for the analytics landscape. The best architecture is generally recognized as the one that has the agility to ingest and store any data, execute any analytics against it and deliver the results downstream to other platforms and applications in a hybrid architecture.

This concept has come to be known as a data lake. From a functional perspective, as one commentator put it, the data lake is the new single-family home for application development, business analysis, data engineering, data science—and they all live under one roof.

Critically, Hadoop as a service is going to be a key enabler as businesses begin building data lakes faster and more flexibly than ever before. According to one participant, data lakes are not new; we have done it for many years. However, the difference today is that Hadoop as a service helps improve the speed to create the data lake.

Several CrowdChat participants agreed that the rise of Spark has made it possible for data lakes to truly come of age because they can unlock the ability to query and analyze unstructured data at similar speeds to querying structured data. According to one participant, while Hadoop is important, Spark is crucial for data lake success. Data lake creation is also being driven by new use cases, such as the Internet of Things. One commentator made the connection explicit: Internet of Things makes an unstructured data lake mandatory, and Hadoop and Spark are at the core.

Building a data lake may be the ultimate long-term goal, but companies don’t need to wait until their data lake is in place before they start benefiting from Hadoop and Spark. As one participant put it, with today’s technology such as Spark, you do not even have to wait to get it into a data lake. Doing so might make things easier, or faster for exploration, but you do not need to wait for it. Spark doesn’t care what format the data is in or what language you use to get insight from the data. You can explore and find the value before you start storing the data. Another participant agreed, saying Spark doesn’t immediately need the data in a data lake to leverage it all together for deeper insights. A data lake might be good over the long term, but Spark starts you off right away.

Take a deeper dive into the conversation by reading the full CrowdChat transcript, or learn more about Hadoop and Spark and how IBM BigInsights on Cloud helps contribute to your data analytics strategy.