8 ways to turn data into value with Apache Spark machine learning

Chief Data Scientist, Analytics Services, IBM

Even as Apache Spark becomes increasingly easy to use, it is also becoming organizations’ go-to solution for executing big data computations. Not surprisingly, then, more companies than ever are adopting Spark.

Building an analytics operating system

When Databricks looked into 900 organizations’ use of Apache Spark in July 2016, an even clearer picture emerged. Spark played an essential role in building real-time streaming use cases for more than half (51%) of respondents, and 82% said the same when asked about advanced analytics. Similarly, use of Spark’s machine learning capabilities for production purposes jumped from 13% in 2015 to 18% in 2016.

Within the computing community, increasing numbers of corporations, IBM among them, have helped enhance the capabilities of Spark. In particular, IBM backs Spark as the “analytics operating system” and accordingly has become one of the top contributors to Spark 2.0.0, as well as one of the biggest contributors to Spark’s machine learning capabilities.

Data compiled by the IBM WW Competitive and Product Strategy Team.

In the wake of much favorable media attention paid to Spark, many corporations have adopted Spark on paper—or have at least downloaded it with an eye to future use. Yet only a fraction have actually used Spark, let alone implemented it as their core analytics platform.

Turning data into value through machine learning

In the modern business environment, implementation of any platform, Apache Spark or not, requires practical justifications. Accordingly, the foundation for any serious Spark adoption is, as always, Spark’s power to turn data into value. Drawing on my own consulting experience as well as on some of my own research, I’ll share eight ways of using Spark’s machine learning capabilities to turn data into value.

1. Obtain a holistic view of business

In today's competitive world, many corporations work hard to gain a holistic view or a 360 degree view of customers. In many cases, a holistic view was not obtained, partially due to the lack of capabilities to organize huge amount of data and then to analyze them. But Apache Spark’s ability to compute quickly while using data frames to organize huge amounts of data can help researchers quickly develop analytical models that provide a holistic view of the business, adding value to related business operations. To realize this value, however, an analytical process, from data cleaning to modeling, must still be completed.

2. Enhance fraud detection with timely updates

To avoid losing millions or even billions of dollars to the ever-changing fraudulent schemes that plague the modern financial landscape, banks must use fraud detection models that let them quickly adopt new data and update their models accordingly. The machine learning capabilities offered by Apache Spark can help make this possible. 

3. Use huge amounts of data to enhance risk scoring

For financial organizations, even tiny improvements to risk scoring can bring huge profits merely by avoiding defaults. In particular, the addition of data can help heighten the accuracy of risk scoring, allowing financial institutions to predict default. Although adding data can be a very challenging prospect from the standpoint of traditional credit scoring, Apache Spark can simplify the risk scoring process.

4. Avoid customer churn by rethinking churn modeling

Losing customers means losing revenue. Not surprisingly, then, companies strive to detect potential customer churn through predictive modeling, allowing them to implement interventions aimed at retaining customers. This might sound easy, but it can actually be very complicated: Customers leave for reasons that are as divergent as the customers themselves are, and products and services can play an important, but hidden, role in all this. What’s more, merely building models to predict churn for different customer segments—and with regard to different products and services—isn’t enough; we must also design interventions, then select the intervention judged most likely to prevent a particular customer from departing. Yet even doing this requires the use of analytics to evaluate the results achieved—and, eventually, to select interventions from an analytical standpoint. Amid this morass of choices, Apache Spark’s distributed computing capabilities can help solve previously baffling problems.

5. Develop meaningful purchase recommendations

Recommendations for purchases of products and services can be very powerful when made appropriately, and they have become expected features of e-commerce platforms, with many customers relying on recommendations to guide their purchases. Yet developing recommendations at all means developing recommendations for each customer—or, at the very least, for small segments of customers. Apache Spark can make this possible by offering the distributed computing and streaming analytics capabilities that have become invaluable tools for this purpose.

6. Drive learning by avoiding student attrition and personalizing learning

Big data is no longer solely the province of business—it has come to play a central role in education, particularly as universities seek to combat student churn, including by providing personalized education. In the modern educational environment, a combination of Apache Spark–based student churn modeling and recommendation systems can add significant value, both material and nonmaterial, to educational institutions.

7. Help cities make data-driven decisions

Pursuant to laws and regulations enacted at various levels of government, US cities are increasingly making their collected data publicly available—the portal is a well-known example. Certainly, as seen in New York, the open data thus disseminated is an important enabler of data-driven decision making at the municipal level. But US cities are only just beginning to generate value in this way, partly because of the difficulties of organizing this mass of data in easily used forms and the challenge of applying suitable predictive models. However, as we’ve already observed in open data meetups, including an IBM-sponsored meetup in Glendale, Apache Spark and other open-source tools, such as R, are indeed helping municipalities derive increasing value from open data.

8. Produce suitable customer segmentations using telecommunications data

Many giant telecommunications companies, in the United States as well as around the world, have collected huge amounts of data, some of which they make available to their partners and customers. But using this data to create value often remains a significant challenge: The data is stored using special formats and chiefly comprises text, not numeric, information—and that’s apart from any special data issues that may arise, including those involving missing cases or missing content. Fortunately, Apache Spark, when used together with R and IBM SPSS, can help companies work effectively with special data formats while handling special data issues and providing modeling algorithms suited for work with both numbers and text—bringing software solutions together to offer additional ways of creating value.

For more information about these ways of using Apache Spark, including detailed plans of action, check out my book Apache Spark Machine Learning Blueprints, available on Amazon.

Reflecting IBM’s focus on Apache Spark, the machine learning capabilities of Apache Spark will be a main focus at the IBM Insight at World of Watson 2016 conference, scheduled for 24–27 October in Las Vegas. I hope to see you there, where I’ll be joining my colleagues. Look out for me at select events and in the IBM bookstore for a chance to meet up at one of my book signings.