How to get started with Apache Spark today

Post Comment
Big Data Evangelist, IBM

Indulging in one-size-fits-all thinking is never prudent. Where big data analytics is concerned, every tool has its sweet spot of applications for which it is well-suited. If you incorporate complementary tools into a hybridized, big data analytics tooling and platform strategy, you come out ahead.

Apache Spark is one of the more recent big data analytics approaches, and, as previously discussed in a recent blog post, it shows great promise. At a high level, Spark’s sweet spot is any application that requires iterative, distributed, parallelized algorithmic program execution entirely in memory. Consequently, Spark is well-suited for low-latency applications—such as for Internet of Things data—in which much or most of the data analysis is performed on cached, live data rather than on stored, historical data.

But many other low-latency tools exist—such as stream computing—that address these use cases quite well. So you may be wondering when exactly you should be using Spark rather than any of the alternatives. With that thought in mind, some of the practical Spark applications that are possible to use today can deliver value to your business now.

Practical applications

For starters, consider Spark to be a power tool for data scientists working on advanced analytics projects that involve in-memory, streaming, graph and machine learning approaches. Specifically, Spark is well-suited for exploratory analytics by teams of data scientists using Apache Hadoop and other big data clusters as data lakes or data reservoirs for statistical modeling.

In this use case, data scientists can use Spark to rapidly model and simulate alternate scenarios, engage in free-form what-if analyses, and forecast alternative future states. They can engage in a schema-on-read environment, which frees them from needing to define data models up front, prior to statistical modeling and exploration.

As data scientists explore the alternative approaches for doing schema on read, such as MapReduce modeling in Hadoop, they’ll need to clarify exactly which of their requirements would make Spark the best fit for the job. Recognizing when statistical modeling and exploration requirements align with the core capabilities of Spark is important.

Spark’s business value sweet spots are many, depending on what you’re attempting to accomplish. If you’re a data scientist building models in Spark, you may want to build a starter in-memory analytics platform with 10–25 TB of priority core data. Then you have the ability to scale it out over time as your investigations call for exploration of additional sources and modeling of more variables and scenarios. You may also want to include Spark’s streaming analytics capabilities, if you’re doing low-latency, event-processing, mobility-enabling, Internet of Things and other applications that operate on live, in-motion data.

Spark’s graph analytics can be applied to anti-fraud, influence analysis, sentiment monitoring, market segmentation, engagement optimization and other applications in which complex patterns need to be identified rapidly. And Spark’s machine learning tools are fundamental for boosting data scientists' productivity by helping them uncover hidden patterns they may have otherwise overlooked.

Industry-specific use cases

Clearly, these scenarios describe many projects in social analytics, mobile analytics, Internet of Things data analytics and other new leading-edge frontiers fueled by big data. And as you consider the core Spark capabilities, you can begin to identify the industry-specific analytics use cases that are most relevant to your strategic initiatives. Several chief industry applications may be well-suited to Spark in your strategy, given that many involve blends of in-memory, streaming, graph and machine learning analytics: 

  • Energy and utilities: Smart grid monitoring
  • Finance: Customer service and market-data analysis
  • Fraud prevention: Multiparty fraud and real-time fraud detection
  • Health and life sciences: Intensive care unit (ICU) monitoring and remote healthcare monitoring
  • Insurance: Call center optimization, cargo protection, fraud detection and telematics
  • Law enforcement, defense and cybersecurity: Cybersecurity detection, real-time surveillance and situational awareness
  • Manufacturing: Predictive maintenance
  • Media and entertainment: Ad optimization
  • Telecommunications: Call data record processing, churn prediction, geomapping and social data analysis
  • Transportation: Automotive telematics and intelligent traffic management 

Another place to spark your imagination—yes, pun intended—on candidate use cases is the “Powered By Spark” wiki page at the Spark website that points to what early adopters in different industries are doing with Spark. In addition, an interesting place to size up what early adopters are doing with Spark is the customer case study page at Databricks, which is one of the most prominent start-up solution providers in this growing market. And keep your mind open to new Spark possibilities that can benefit from distributed in-memory analytics. Spark’s core value is sparking imagination. Please let us know what disruptive innovations it can unleash in your organization.

Get started learning more about Spark today, and register for Spark Summit in San Francisco, California, June 15–17, 2015. Also check out IBM BigInsights 4.0, an enhanced solution with Spark support.