How AI picks the most exciting moments at Wimbledon without bias
Note: This blog post was authored by Aaron Baughman with Stephen Hammer, Eythan Holladay, Eduardo Morales and Gary Reiss.
Wimbledon is one of the most prestigious major events in the world. With over 675 matches played and over 147,000 tennis points played, its size and scale are substantial. In fact, even if fans diligently watch their favorite players, they will miss a high proportion of the played points. Wimbledon uses IBM digital and AI capabilities to provide rapid access to match highlights to serve up the best content to fans.
The AI system clips and creates candidate highlight videos across matches on 10 courts, and assigns a fair excitement score, all within two minutes of a match completing. For the big picture of how it works, see part 1 of this blog post series. Part 2 of 2, below, takes a closer look at how AI picks the most exciting moments without bias.
The AI Highlights system at Wimbledon uses several deep learning and machine learning techniques to determine the excitement level of a video. Each video is split into its video and sound components. The sound is converted into the MP3 format and placed on a disk store. A Python process picks up the MP3 and sends the content into a Convolutional Neural Network (CNN) called SoundNet with the PyTorch library. The last layer of the CNN is removed to retrieve the spatial representation of the sound. The feature vector is input into a Support Vector Machine (SVM) that was trained on the domain of tennis. Two SVMs are applied to produce a crowd cheer and commentator speech excitement score. The score is further scaled to compensate for video sound changes year over year at Wimbledon.
The visual aspects of the video are analyzed from extracted video frames. Each image is sent into the VGG-16 neural network model within the Caffe deep-learning framework. This solution was pre-trained on ImageNet, a large visual database. The VGG-16 model was adapted to recognize exciting tennis movements. An action excitement score is scaled to provide a score for a tennis player. The same set of images are used to determine the reaction of a tennis player as well as body part detection. Portions of the body such as the head and torso are tracked to determine the speed of motion and gestures.
Each of the individual scores from cheer, action and body motion contributes to the crowd cheering and player gesture scores and the overall excitement score shown in Figure 1. Each excitement score is saved into the Cloudant data store for downstream processing by the debiasing app.
The debiasing Python application deployed as a Cloud Foundry application on the IBM Cloud polls Cloudant for records from the Cloudant context queue to remove unintended bias and alter potentially unethical excitement levels. The application is scaled out into four instances to maintain near-real time debiasing capability for the large volume of ranked AI Highlights.
Several AI technologies within IBM Watson OpenScale detect bias and correct the overall context excitement level with mitigation techniques while monitoring model accuracy. Two variables, average player rank in a match and court, are used to measure and remove bias. For example, the excitement level from the sound of cheering might be biased because fan favorite players tend to have larger crowds and followers than lesser-known players. As a result, the cheering sound predicts a popular player will have a more exciting tennis shot than a lesser known player because the cheer is louder and the web traffic to the player’s content is relatively high. Secondly, high-level courts of play can bias the crowd and player reaction to a tennis point by amplifying player emotions within a situation.
To remove a potential bias, the Python application creates an overall context excitement score by applying a trained SVM that was deployed on Watson Machine Learning. Each of the scoring payloads is sent to OpenScale for continual bias detection and mitigation. Throughout the debiasing process, OpenScale trains a post-process debias model that removes bias from the score given a set of monitored attributes.
To determine which attributes that should be monitored, we created a tennis domain ontology for each of the 39 predictors. The player popularity metrics track how many times a player’s profile is visited from the United Kingdom (UK) based traffic versus worldwide page views. We combined the popularity analytics with tournament features. Within each match, excitement levels are related to a type of hit, win, ball and player tracking statistics. In addition, the tournament round and court can be factors that contribute to the ranking of a highlight. The multimedia measures such as crowd cheer, gestures, player expression and speech tone provide significant insights into the importance of a point. Finally, we included player biographic information so that country of origin, rank, and age could have an opportunity to influence the highlight rank. Many of the values within the tennis domain ontology were protected attributes with specific privileged values that may not have group fairness. We identified considerable bias when examining tennis court of play and average team rank.
Throughout Wimbledon, IBM Watson OpenScale monitors the bias of context scores based on two selected attributes: court of play and average team rank. Traditionally, Centre Court, the largest court that hosts the finals of the main singles and doubles events, can seat up to 15,000 fans. The No. 1 Court holds about 11,500 people followed by No. 2 Court with a capacity of 4,000. The other courts seat considerably less people and showcase smaller draws. With the bigger crowds and high-profile events, we made the reference group include the top three courts at Wimbledon. The crowd scores for those courts are generally higher and players are more animated with a loud crowd. The other courts that include 3 through 18 were monitored for bias with respect to the highlight score. The highlight score was categorized into 5 bins where 0 is least exciting and 4 is the most. With an analysis of 2018 Wimbledon video, 7 percent of matches on courts 3-18 received favorable highlight scores of 3 or 4. However, 17 percent of the top 3 courts received a score 3 or 4. As such, the court bias found by IBM Watson OpenScale slightly changes the output of the overall context score so the source of biased decreases.
We also found that the higher a player’s rank, the higher the highlight score. This was particularly true during the later stages of previous Wimbledon tournaments. We decided to average all of the players’ ranks together within a match to monitor matches of highly ranked players. The reference group included all matches with an average rank equal to or greater than 10. The rest of the average ranks were put into a reference group. Over time, the excitement scores of matches with lower level players are slightly boosted to mitigate bias based on player rank. Highlights from lower ranked players will be included with higher ranked players to achieve group-based parity.
Here’s an example. On July 3rd, we debiased the context excitement model with a post processor on IBM Watson OpenScale. Court fairness was at a low of 71 percent where most of the favorable outcomes of a high score were for Centre Court, Court No. 1 and Court No. 2. Only 13 percent of courts 3-18 had a favorable excitement score of 3 or 4. After debiasing, 34 percent of highlights on the lower level courts had a favorable excitement score. Court fairness increased by 82 percent. However, player rank fairness decreased slightly by 5 percent but was within an acceptable range of greater than 80 percent. The tradeoff between a court fairness improvement of 82 percent outweighed the 5 percent decrease in rank fairness. During the debiasing process, the accuracy of the model maintained 90 percent accuracy. Throughout the tournament, we expect to have many more samples for all courts, which will change the fairness metrics through additional debiasing.
We have a clear understanding why OpenScale chose to debias a match when we look at the predictive power results. In one highlight example, a match was debiased based on playing in a low court from a low excitement score to the highest excitement score of a 4. The original confidence value for an excitement value of a 4 was only 29.65 percent. The small crowd at the lower court contributed 12.29 percent against a lower excitement level between 0 and 3. In addition, the small number of worldwide profile visits for the team gave 9.93 percent support for a low excitement level. However, the analytic score that includes the tennis score during the highlight gave 20.87 percent of support for the highest excitement level. The mitigation technique within IBM Watson OpenScale selected this match to override the evidence by assigning an excitement level of 4. The mitigation technique helps to provide equal excitement equity across matches on low level courts and with lower ranked players.
By the third day of Wimbledon, the AI Highlights dashboard had a good range of player highlights. The highest ranked highlight features Felix Auger-Aliassime with an Association of Tennis Professionals (ATP) rank 21 and Vasek Pospisil with ATP rank 45. Tied for the second highest excitement score features a highlight with ATP rank 1, Novak Djokovic, playing against Philipp Kohschrieber with rank 57. The top four highlights feature both popular and lower level courts such as Centre Court, Court 12 and No. 1 Court.
Each of the top highlights ranked by the context of play and multimedia excitement features tell the narrative of the 2019 Wimbledon. You might be surprised by some of the highlights. With our fair AI Highlights process, you can enjoy a more objective and broader approach to top tennis moments than fans have ever been able to see before.
Are you interested in the benefits that AI can bring to your organization? Watch this webinar to learn how to get started with AI virtual assistants—one of the most popular AI first steps. And learn how to navigate the ethical and governance responsibilities that must be met when delivering AI services at scale.