Learning to fly: How to predict flight delays using Spark MLlib

Architect, IBM Cloud Data Services, IBM

If you’ve ever found yourself stranded at an airport, either on a tarmac or in a terminal, you’re not alone. Delays cost airline passengers about 115 million minutes of travel time annually. Very often, flights are delayed because of poor weather conditions that impact flight routes.

Avoiding surprises

What if a way was available for travelers to use data in a flight predictor application to predict whether or not their scheduled flight will be delayed because of weather? In the past, building a flight predictor application that validates this hypothesis would be considered a rather enormous project that involves time, infrastructure and the right skill set—all of which can be quite costly. 

The rise of cloud computing, however, is changing the status quo in regard to dramatically reducing these three cost factors—time, infrastructure and skills. For example, by leveraging the IBM Cloud ecosystem of data services, I was able to build the flight predictor application in less than two weeks. And the results surpassed my expectations.

Getting the details

Are you interested to learn more about how cloud computing empowers the many to do things that have been traditionally reserved for the elite few? Then attend Strata+Hadoop World, 29–31 March 2016, in San Jose, California. I’ll be there, presenting the details behind the design of the flight predictor application, which included this get-build-analyze methodology:

This session also explores the flight predictor architecture:

And it covers how to complete the following tasks:

  • Gather data from a variety of sources, and get it into the cloud. Learn how publicly available flight data from can be used, cleansed and enriched with weather data from the IBM Weather service.
  • Organize the data into three sets—training, test and blind—and store them in a NoSQL operational data store (ODS) using the IBM Cloudant database as a service (DBaaS).
  • Use the IBM Analytics for Apache Spark managed service with interactive IPython Jupyter Notebook and Spark MLlib to build the predictive models that can predict whether a flight will be late because of bad weather. 

Whether you are a data scientist or a data engineer, you can directly apply these patterns to your use cases. Moreover, you won’t have to start from scratch because the source code for this application is available on the GitHub website. Just fork the project, and start experimenting right away. And be sure to register for Strata + Hadoop World.

Engaging with innovation

If you have any questions or just want to chat, I’ll also be at the IBM Cognitive Studio at SXSWi, demonstrating a game of Rock, Paper, Scissors played by Marvin the Robot. This demonstration features another fun application that is also powered by IBM Analytics for Apache Spark, Message Hub built on Apache Kafka and Cloudant. 

Whether you want to discuss cloud architecture or simply play a game of Rock, Paper, Scissors with Marvin—and have your picture taken—just drop by the IBM booth. I look forward to meeting as many people as possible. In addition, read this paper, Using a predictive analytics model to foresee flight delays, which describes how data scientists and developers can build an application to predict flight delays using a Get-Build-Analyze methodology and IBM Analytics for Apache Spark , a managed Apache Spark service, with interactive Jupyter Notebooks..