Predicting Relationships between Social Signals and Box Office Sales

Senior Content Marketing Manager, Communications Sector, IBM Analytics

In my last post, we explored how audience data sources from inside and outside of the media organization can be “unified and utilized” for game-changing applications such as demand forecasting for Opening Weekend Box Office (OWBO) using big data analytics.

Our results showed that IBM achieved high levels of model fit and forecast accuracy up to 8 weeks out where marketing campaigns can still be changed to suit audience needs.  In fact, our approach resulted in the highest prediction accuracy vs. current industry benchmarks including, but not limited to, major studios internal predictions,, the LA Times and more.

Case in point: The IBM model gave the most accurate prediction compared to various industry tracking sources for 7 out of 10 summer 2013 releases.

Throughout this process, we learned that big budget action films are the most accurately predicted of the film genres. Our model predicted XL and L movies very accurately because of the enormous size of the audience, fan engagement and buzz. In short, there was a tremendous amount of relevant data to efficiently analyze.

Conversely, some S and M size movies had 50+% prediction errors. Why? Simple: the smaller they are, the harder they are to predict.  But that’s not the only challenge! Fall and summer release films are more accurately predicted compared to spring or winter and holiday releases when the buzz is significantly reduced.

We also learned that “intent to watch” extracted from social buzz does not necessarily equate to positive sentiment. The trends simply do not line up. Tracking the percent of Audience Intent by week for different movies could enable better prediction of a movie’s relative performance. We do, however, see that a movie’s net sentiment polarity is correlated to its profitability. For example, a big budget (XL) action movie that has a high degree of normalized net sentiment was relatively easy to accurately predict for profitability. But a live action kid’s movie is much more difficult. Why? Simple: kids don’t tweet. As such, a movie like that might have a large amount of negative sentiment but still perform very well at the box office. How so? Consider all the parents that tweet out something along the lines of “OMG, I have to go see ______ with my kids AGAIN? This is torture! #terriblemovies”  They might dread the experience but chances are the kids loved the movie and will want to see it again and again.

Along these lines, horror movies also taught us an important lesson. In several cases, we over-predicted their OWBO salesdue to the high degree of buzz. In this genre, sentiment didn’t translate well because, as we learned, horror movies tend to have smaller sized audiences. They might talk a lot about their favorite genre, but they don’t carry the sales that we see with other, larger genres such as action and animation.

Since predictive modeling is an iterative process, our next step is to improve forecast accuracy. For example, we hypothesized that adding YouTube variable data could improve prediction accuracy. Our forecast accuracy without official trailer data was at or around 72%. Once we added YouTube Trailer Data (number of views for the top-viewed trailer for each movie)to a subset of 74 movies, the predictive accuracy improved by 13%. We could potentially improve that accuracy even further by adding in YouTube Trailer Data from unofficial movie trailers such as this fan made trailer which was created to show that Ben Affleck really does have some potential to play Batman in the next Man of Steel film. (I encourage you to debate that possibility in the comments below.)

What does it take to perform these advanced analytics?

IBM’s technical approach is to extract / integrate movie audience behaviors and then build a predictive model to represent a target outcome. A key use case might be mapping differences in sentiment across geographical regions so movie studios can enable location-specific marketing campaigns or shift budget to areas that need more attention from marketing to improve “intent to watch.“

We start by extracting massive quantities of intent and sentiment from social data using IBM Infosphere Streams and then loading and cleansing the data into tables for analysis. This typically occurs within IBM PureData for Analytics.Next, we use IBM SPSS Automated Data Preparation to identify the most important variables and transform them to improve model accuracy. Finally, IBM SPSS Auto Classifier builds the model for Per Theater Average Prediction, composed of the average of the top 3 most accurate predictive algorithms, resulting in improved accuracy overall.  We also use IBM InfoSpehere Data Explorer to drastically improve data query and visualization capabilities.

With IBM’s Big Data and Analytics Platform in place, profiling a content franchise's audience enables business teams to more precisely target their efforts.

Coming soon, we’ll examine the Movie Marketing Solution Roadmap. We’ll also work to understand additional the measures required for evaluating predictive results especially as it pertains to the importance of continuing to identify and combine multiple internal and external data sources.

If you’d like to make a fan-made trailer for my next post, I recommend casting Anthony Michael Hall as me.