The Data Story Behind "Keys to the Match"
It seems to me that we just finished up this summer’s Wimbledon tournament and now the US Open is already upon us.
There are a number of differences between the two tournaments, and the most obvious of these is the playing surface; the US Open is played on a hardcourt surface while Wimbledon is played on grass. This affects the way that the players approach the game. On average, each point played at Wimbledon this summer consisted of just 3.0 strokes, while last year’s US Open had an average of 4.0 strokes for each point. Extrapolated out to the entire tournament for the Men’s and Women’s singles (254 matches in total), this means that Wimbledon had 135,500 strokes and US Open had 51,500 strokes more, bringing the total to 187,000. This is a significant difference by any measure.
In case you were wondering, the number of strokes in each point is counted during the US Open and the other three Grand Slam tournaments: Australian Open, Roland-Garros and Wimbledon. Each tournament keeps track of the numbers. The chair umpire keeps score, a radar sensor measures the service speed, and courtside statisticians count the number of strokes in each point, volleys and baseline strokes, winners and unforced errors. All of this data is streamed directly to the IBM data center and analyzed.
One of the applications that utilizes this information in real-time is the IBM SlamTracker, which shows the current score for each match being played, and the “Keys to the Match” that is part of IBM SlamTracker.
For the “Keys to the Match,” the number of strokes in each point is just one of the indicators that are used, and it often picks out the player’s winning percentage based on points of different lengths. Generally, we see that the very short points with just a serve, a return, and a follow-up shot (3 or fewer strokes) are the most important points at Wimbledon, while the slightly longer points with between 4 and 9 shots are more important at the US Open. The likely cause of this is what I mentioned above; there are fewer strokes played at Wimbledon and there are more of the short points; therefore, they are more important to win.
The difference in the number of strokes also affects the length of the matches. The average match at the US Open is 2 hours 29 seconds, while the average at Wimbledon is just 1 hour 57 minutes 9 seconds. It is just a difference of 3 minutes 20 seconds, but it is a statistically significant difference.
The surprising fact, however, is that this difference in the length of the matches comes almost entirely down to the Gentlemen’s Singles. The Gentlemen play for 2:43:51 at Wimbledon, but for 2:50:29 at the US Open, which is a difference of a full 6:38. The ladies, on the other hand, play for 1:31:33 at Wimbledon and for 1:31:49 at the US Open; a statistically insignificant difference of 16 seconds for an average match. The Ladies do have fewer sets over which to accumulate a difference, though, as they play best of 3 sets compared to the Gentlemen’s best of 5. When adjusted for the difference in the number of sets played, we find that both the Ladies and the Gentlemen play longer sets at the US Open than at Wimbledon, but the Ladies make up for it by playing fewer sets on average at the US Open, which brings the duration of the match down to the same level as that for Wimbledon. The gentlemen play the same number of sets in both tournaments.
The “Keys to the Match” uses these statistics and many more, and it incorporates the inherent differences between the four tournaments to identify three key actions players can take on the court to enhance their chances of winning.
The “Keys to the Match” also takes into account the player and the specific opponent to identify the three keys. If the two players have met each other many times before, finding data from these previous matches is simple enough, but if they have only met a few times or not at all, it gets more difficult to identify comparable matches from which to derive the keys. This led to a need to identify players who would be considered similar or comparable, so that we can draw on the data from those matches in addition to any previous head-to-head matches.
Tennis has a number of reasonably well-defined styles of play, such as serve-and-volley, aggressive baseliners, or counterpunchers to name a few. However, combing through eight years of matches to assign the right style of play to each of the more than 800 players that have appeared in at least one match was not a viable option, and each player’s style of play would potentially have to be reassessed on a regular basis – possibly even after each match. The solution was to utilize the data itself and the statistical techniques available in IBM’s analytical software.
For each player, we generated a profile based on the statistics from his or her previous matches in each of the four tournaments. The profile includes, among a number of other statistics, the average number of shots per point. Based on the profile, we then trained a clustering model to find groups of players with similar profiles and compared the results to the known styles of play. While there is not a perfect match between those styles and the model‑based profile groups, there is much overlap between the two. We also have different model-based styles of play for gentlemen and for ladies.
The advantage of having a model-based algorithm for the styles of play is that we can now reassign any player to a different style of play at any point in a tournament if the data warrants a reassignment. In practice, we utilize the data from previous matches to assign a player to the most suitable style of play prior to each match. For established players, such as Roger Federer and Serena Williams, it is unlikely that they will be assigned to different styles from one round to the next, but for players who are new to the Grand Slam tournaments or players who are still developing their game, it is a great advantage to have the ability to redo the assignment after each match as we get more data about their style of play.
Finally, we combine the players’ style of play with their seeding and their results in previous Grand Slam tournaments. If two players’ have the same model-based style of play and the same level of results, we consider them to be similar or comparable for the purposes of the “Keys to the Match.” This allows us to utilize data from multiple opponents who are similar to the one that a player is facing off against to find the keys for the player in that match. Using both the style of play and the results ensures that we do not consider Novak Djokovic (currently ranked 1) and David Goffin (currently ranked 72) to be similar players, even though the clustering models places them in the same style of play. Both players are assigned to a style that generally have strong serves, but not to the extent where they rely on their serve to win points, and have a good returning game that allows them to play very few points with 3 strokes or fewer, when they are the returning player. While they may exhibit similar styles of play, their execution is, for the time being, at very different levels.
The “Keys to the Match” does not predict or pick the winner of a match or even of a set. The objective is to pinpoint three performance indicators and track these to see how each player is measuring up against previous performances and the performances of comparable players.
"Data is a Game Changer"
Visit the "Data is a Game Changer" site to learn more about how analytics, cloud and mobile are playing a major role in this year's US Open. Be sure to follow IBM Sports on Twitter and Instagram - and join in the conversation using hashtag #ibmsports
Read more on my background and interests, and follow me on Twitter: