How AI picks the highlights from Wimbledon fairly and fast
Note: This blog post was authored by Aaron Baughman with Stephen Hammer, Eythan Holladay, Eduardo Morales and Gary Reiss.
Tennis during the Wimbledon Championships fortnight occurs on 18 courts with over 147,000 points contested. In many cases, fans watch one match at a time and catch up on the rest of the tournament by viewing preselected highlights that are generally about popular players.
Wimbledon’s digital content team seeks to serve up the best content to fans around the world who are looking for the latest news and scores. With great tennis played simultaneously across 18 courts during The Championships, it is a significant effort for video editors to watch all of these matches and create highlight videos using traditional tools. By reimagining the workflow of digital editors, Wimbledon uses IBM’s digital and AI capabilities to speed up the creation of match highlights. This leaves editors to choose which highlights to publish that fit their narrative—and more time to create additional content to serve up to fans on their digital channels.
For the 2019 Wimbledon Championships, IBM built an AI system that uses real-time clips of individual tennis scenes from 10 courts of live-produced match footage. Each scene is assigned a fair excitement score, which is a measurement of how notable this scene could be for downstream usage. Every highlight is ranked so that the most exciting points of the tournament can be discovered while minimizing the influence of player rank and crowd size. After the match, this system selects the most exciting tennis scenes and creates highlight videos, all within minutes of match completion.
In order to achieve this, IBM Watson was trained to better recognize acoustics and understand and remove inadvertent AI bias. The result is a higher-quality selection of sports highlights—and more of them.
Let’s take a closer look at what happens behind the scenes.
The mind of AI as editor: Picking out top plays
Video streams from Wimbledon courts are ingested and understood by machine multimedia comprehension algorithms. The computing techniques condense video from full-length tennis matches into a clipped highlight by using computer vision and sound to determine scene boundaries. The camera angle and transitions between different angles are helpful in selecting scenes. However, computer vision alone produces false positives when a player performs an unexpected action before or after the point, or if the broadcaster needs to change the viewing perspective to keep the content compelling.
The visual aspect of a tennis match, and any live sporting event, is highly-diverse with a variety of angles, variable looks, lighting, contrast and colors. While visual analysis works well, the incorporation of AI sound analysis was a natural progression. Generally, the sound in tennis is relatively stable and consistent in contrast to vision. For the 2019 Wimbledon Championships, we developed a system that detects events such as ball hits and point boundaries in a tennis match.
To classify the acoustic events from a stream of sounds in a tennis match, the audio needs to be segmented into small windows. We implemented a peak detection approach that finds the sound of interest and isolates it within a single window. When a sound peak is found, the audio is segmented over 0.5 seconds on both sides for a total duration of one second. The 1-second signals are input into an acoustic event recognition pipeline.
At the feature extraction phase, the 1-second sound clip is stratified into 20-ms frames. Mel-Frequency Cepstral Coefficients (MFCC) are extracted along with Delta Coefficients before being input into a Convolutional Neural Network (CNN). The classifier recognizes sounds such as announcer, applause, bounce, feet, racket hit, non-play noise and the line judge making a call. Overall, we maintained an F1-score of 87.76 percent over all sound recognition events. An F1 score is a measure of the accuracy of an algorithm in meeting success criteria.
Next, we wanted to find start and end boundaries of a point given a series of classified sound events. We developed a time series classifier that uses features derived from the basic sound event classifier outputs and confidence values of multiple neighboring sound windows. For example, when racquet strikes are being detected, tennis is being played. When these sounds are first detected, that indicates the beginning of the point. When the sounds cease, that indicates the end of the point. Overall, we achieved an average precision and recall of 80 percent with sound alone. When combined with visual analysis of the content using IBM Watson Visual Recognition, we increased accuracy of scene detection. The tennis point boundaries are then used as clues to clip a potential highlight.
The overall architecture of detecting tennis scenes from acoustic data uses Docker containers and Kubernetes clusters – virtual constructs of software code – as building blocks. The resulting service is called Tennis Event Detection as a Service (TEDaaS). A public Docker image repository on the IBM Cloud stores our images that includes binaries for IBM’s sound recognition toolkit. Each time a new image is built, the source code is pulled from two different GitHub source repositories. The Docker images are deployed as Kubernetes workers across a four-node cluster—see Figure 2. The AI Highlights client sends streams of sound to the service through a RESTful API. The client can wait for a synchronous response or callback an asynchronous service for results. The results are placed into a shared Cloudant data store so that all Kubernetes workers can retrieve the results of an individual job.
Once clips are selected, they are ranked in a workflow pattern that is built on top of IBM Cloudant, a distributed database that is built for heavy workloads. The Cloudant record created from the candidate highlight is enriched three times with additional data using Cloudant’s Update Handler features. In each step, Cloudant views are used to place records in processing queues. Records are picked from these queues and enriched with metadata.
First, the record is enriched with 39 tennis, crowd measures, biographic and statistical features that are sourced from a DB2 on Cloud database. One of the features, rank, is the average rank of the tennis players within the match. The average rank provides us with a single privileged value helping us identify and mitigate bias in highlight scoring. Each of the enriched records are placed into an emotion queue within Cloudant for highlight ranking. Next, a scene excitement ranking system pulls each new record from the emotion Cloudant queue and creates component level highlight measures using AI.
The highlight results are written back to the Cloudant record with an Update Handler and placed onto a queue for bias processing. An agent-based system, written in Python, pulls any new record and calls IBM Watson Machine Learning for an overall context highlight score. The context score is a trained predictive model that combines all of the 39 predictors into a single excitement rating. In the process, IBM Watson OpenScale continuously learns and debiases the context excitement score based on selected attributes such as court and average player rank in a match. This occurs for highlights on three of Wimbledon’s courts. The context and debiased context highlight scores are also stored in Cloudant.
As a fair highlight ranking finishes, a script pulls the debiased context score and assigns it to the highlight. This package takes into account business rules, video creative, television graphics and branding requirements. The finished highlight clip is uploaded to a Content Management System (CMS) for a final review by digital editors—and distribution to viewership around the world.
How can AI rank the most interesting sports highlights, without being biased by the rank of the player or the size of the crowd? Learn how in part 2 of 2.
See the big picture of how AI models provide value throughout the organization. From helping Wimbledon’s digital editors create highlights at scale to AI assistants providing better, faster customer service.