5 ways to define a Spark-enriched big data strategy

Vice President of Product, Platfora

Strategy. Who has time for it? When implementing a big data solution, a lot of businesses skimp on the up-front strategic work. But without a clear-cut, big data strategy, falling into one of two major traps is easy. The first trap is to treat big data as an end unto itself—“everybody else is doing it; we better do it too.” The second trap is to focus only on fixing a particular problem at hand without regard to larger business concerns or what comes next.

A coherent big data strategy

The truth is, many businesses that handle data—which is pretty much everybody these days—need a big data strategy. And many of those strategies should probably include Apache Spark. A coherent strategy can put big data in context and help your organization stay ahead of rapidly changing business and technological requirements. Consider five steps to help you get your big data strategy in place.

1. Begin with the business in mind

Any new technology you implement is ultimately for the pursuit of business objectives. So before saying “we need a big data solution to do this” or “Spark will allow us to do that,” slow down and ask two basic questions: 

  • What business results are we looking for?
  • How is our current infrastructure providing those results, and how is it falling short? 

Maybe you’re an online service provider looking to track sales and customer lifecycle. Or, possibly you’re a retailer who needs to manage inventory and point-of-sale (POS) data. Perhaps you’re a logistics company with a rapidly growing customer base and fleet. Whatever the circumstances, begin your strategy by understanding and articulating your business needs. You will only know if the strategy works in the end if you know what it is trying to achieve from the beginning.

2. Assess the current and future data landscape

You need a clear picture of the data you’re working with, where you will put it all and what you are going to do with it when you get it there. A critical driver for many new big data implementations is the need to bring together disparate data types from a number of sources to enable analysis that otherwise would not be possible. 

In addition, you need to think about what kinds of data you will be handling in the near and midterm. Will you incorporate social media data to provide a complete picture of your customer base? Will you be adding sensor or telemetry data to enhance security or better manage your facilities? From the outset, major considerations exist around scalability, speed of access and data preparation or integration. These requirements become only more complex with each new data source and data type that you add to the mix.

3. Define use cases and analytical models

Having a feel for what the data is—and what it is going to be—is a great start. Now you need to get specific about what you are going to do with the data. Are you doing customer analytics? Security? Internet of Things? Real-time or streaming analysis? Maybe you need to do sophisticated segmentation of data or perform behavioral analysis. Or possibly you need to analyze event streams. Perhaps you need to perform sophisticated statistical modeling and predictive analysis. 

In addition, you have to give a thought as to whom is using this analysis and how. Are you producing reports or feeding data into dashboards? Will open-ended exploration of the data take place? Will you be feeding analytical results into whole new applications? You need to know whether data scientists will be doing some of the work, and what role will business analysts and other business users play. You also need to know how hands-on do they need to be? And you have to also be thinking about how to answer these questions today and how to answer them in the future. Define the solution stack to fully address those use cases

Once you have identified the business requirements, surveyed the data landscape and outlined the use cases, you can begin looking at the technology stack that will best address your organization’s needs. The reason Apache Hadoop has experienced such explosive growth in recent years is that it addresses two of the core data challenges that businesses face: 

  • The rapid and open-ended growth of data sets
  • The proliferation of data types and structures

A good place to begin defining the technology stack is with those needs in mind. Hadoop continues to gain momentum even as new technologies are challenging the established formula of the Hadoop Distributed File System (HDFS) and MapReduce. 

In particular, Spark brings a level of flexibility to big data environments that wasn’t previously available, and it enables a whole new workflow for big data analysis. The full technology stack will include these core technologies plus the specific analytics tools needed to achieve business goals. Assuming that getting those results requires simplifying data preparation, enabling highly robust data discovery and putting analytics capability into the hands of more business users without requiring programming skills, Spark is likely to be a part of the overall picture.

5. Implement strategy the smart way

While beginning with the end in mind is important, it is equally—and paradoxically—vital to keep in mind that no end exists. Your business needs, your data requirements and the technology landscape will continue to evolve rapidly and often in unexpected directions. Your strategy needs to enable you to implement big data solutions with the confidence that you’re addressing today’s needs while maintaining the flexibility to make course corrections as required. Just as Spark currently provides greater flexibility than the original Hadoop packages, future developments with Spark and new emerging technologies are expected to offer capabilities that aren’t on your radar right now, but may be next week—or even later today.

An intelligent, flexible strategy

A smart strategy for big data needs to include all of these steps. But to implement the strategy the smart way, you will need to be ready to revisit, reverse or restart that sequence at any time. And be sure to learn more about IBM Business Partner Platfora to take that next strategic step toward Spark-driven results. In addition, experience the power of cloud-based IBM Analytics for Spark.