Machine learning and the Spark of accelerated insight

Big Data Evangelist, IBM

Big data is just too huge, complex and dynamic for humans to sift through unassisted. And for that reason, many data scientists are embracing machine-learning tools such as those tools that are built into Apache Spark, Apache Hadoop and other big data analytics platforms.

Without machine learning algorithms that can dynamically respond to myriad concurrent data streams, the human race risks drowning in its own big data. For data scientists, machine learning is a productivity tool par excellence. Data scientists use it to infer, predict, search, sift, sort, and otherwise make sense of the growing volumes, varieties and velocities of data that is out there. Machine learning helps them to build, iterate and continually refine advanced statistical models to sense and predict dynamic patterns at lightning speed. These patterns include real-time phenomena such as terrorist activity, weather conditions, traffic congestion, energy grid utilization and the stock market.

As with any advanced analytics tool, machine learning helps data scientists to discover anomalies, correlations, trends, and other patterns in big data. What distinguishes machine learning from other analytics approaches is its ability to learn automatically and directly from the data, either entirely unsupervised or with a modicum of up-front training and preparation. It allows data scientists to train a model on a data set example, and then leverage algorithms that automatically generalize and learn both from that example and from fresh feeds of data. Machine learning can do this learning process without constant human intervention and without explicit programming. automation capability is fundamental to machine learning’s value in a world where big data just keeps pushing into higher volumes, more heterogeneous varieties and faster velocities than ever. Scaling big data analytics applications is expected to become infeasible, unless the industry can achieve continued critical advances in machine learning—leveraging in-memory processing, massive parallelism, federated clouds, quantum computing and other frontiers. They enable data scientists to auto-generate more models from more data more rapidly than ever before, and hopefully iterate more swiftly in search of fundamental patterns in the data.

To varying degrees, you’ll see the terms unsupervised learning, deep learning, computational learning, cognitive computing, machine perception, pattern recognition and artificial intelligence used in this same general context as machine learning. Unsupervised learning is the practice that most analysts think of when the discussion turns to machine learning. However, automated systems can’t do it all without some supervision, training and feedback from data scientists and—through crowdsourcing—ordinary people exercising human judgment on the analyses being algorithmically distilled. In many scenarios, speed-of-thought human judgment is needed to distinguish patterns, trends and events that even the most advanced machine learning algorithms might miss.

Most branches of machine learning require a tight, iterative engagement of humans and automated systems. Iteration through data-driven machine learning models often requires the expert guidance of data scientists in league with subject-matter experts (SMEs). These are the people who provide the übercontext associated with the problem space you’re exploring and whose judgment on the validity of the underlying model variables and relationships is essential.

Ideally, the latest and greatest machin -learning algorithms would be cloud services that you access as needed and on demand. Data scientists, and those developers who simply require the full algorithmic fruits of data science, usually prefer to execute machine learning models rapidly to show rapid results. If asked, they’d probably say they prefer to have all the principal machine-learning algorithms, models, execution platforms and application programming interfaces (APIs) available to them as needed, rather than having to assemble them from scratch and manage them in-house.

Availability of machine learning models thrives on an open environment, and the open source provenance of Spark is one of its great strengths in this regard. Spark’s rich machine learning library (MLlib)—is fundamental to its productivity-boosting value to data scientists. MLlib consists of common learning algorithms, utilities and APIs in support of classification, collaborative filtering, dimensionality reduction, clustering, regression and other analytics functions.

As stated in a recent Big Data & Analytics Hub blog post, openness accelerates collaboration, innovation, reuse and sharing wherever it’s found, whether it is in data science teams or across entire industry ecosystems. A key part of the value of MLlib is that it’s open source. Consequently, it provides a common algorithmic resource available to data scientists working on any platform that incorporates these code bases. Open, shared, machine-learning libraries such as MLlib—or Mahout, which is in the Hadoop open data platform—can accelerate insights uncovered by teams of data scientists far beyond what each data scientist could distill separately using preferred, siloed machine-learning libraries.

Learn more about Spark and the open data platform by joining IBM and its partners at Spark Summit, June 15–17, 2015 in San Francisco, California.

Join fellow data scientists June 15, 2015 at Galvanize, San Francisco for a Spark community event. Hear how IBM and Spark are changing data science and propelling the insight economy. Sign up to attend in person, or watch the livestream and sign up for a reminder to receive notification on the day of the event.