Delving deeply into the narrative hierarchies of computer vision analytics

Big Data Evangelist, IBM

Deep learning has become the next big awe-inspiring frontier in big data analytics. This emerging technology, which leverages deep convolutional neural-network and other machine learning algorithms, is all about giving computers the ability to process sensory patterns that any sentient creature can recognize from the moment it’s born.

Deep learning algorithms are growing progressively smarter at recognizing patterns in video, audio, speech, image, sensor and other non-textual data objects. The algorithms are also being embedded into the full range of devices that most of us carry around, wear or install in our cars, offices, houses and other environments. For example, as the recent MIT Technology Review article reports, deep learning algorithms will soon execute within the microchips inside our smartphones.

Algorithmically drilling beneath the visual surface of media streams

Streaming media, which represents the future of the media and entertainment industry, is the principal type of content object that deep learning algorithms analyze. Many advances in deep learning improve the technology’s ability to correlate and contextualize algorithmic insights across media objects. Likewise, deep learning is adept at analyzing other types of streams, such as the sensor data flowing from Internet of Things (IoT) endpoints. As demand for deep learning applications intensifies, we will all begin to take for granted their seemingly magical ability to recognize faces, voices, gestures and other distinctive characteristics of specific individuals. As author Lee Gomes states, “computers are getting better each year at AI-style tasks, especially those involving vision—identifying a face, say, or telling if a picture contains a certain object.”

Soon, though, that won’t be enough. Gomes points out that "What computers really need to be able to do is to 'understand' what is 'happening' in the picture.” It's not enough for a computer to see in roughly the same way that an organism, such as a bug or a bear, can see. Can an algorithm possess true insight into what appears in its field of view?

The more disruptive real-world applications of deep learning will be those that generate deeper situational insights through correlation with additional contextual variables. These variables might include the social, geospatial, temporal, intentional, transactional and other attributes of the individuals, activities and objects in the image or stream. This added context can help deep learning algorithms to unambiguously identify that a particular person is in a particular circumstance at a particular time and place.

Extrinsic contextualization of visual streams supplements deep learning algorithms

Some of the required context may be identified by the deep learning algorithms themselves. What the convolutional neural networks and other algorithms happen to overlook may need to be supplied by other analytic, metadata and data tools.

Reducing the need for extrinsic contextualization is a huge research focus in the deep learning community. Gomes highlights algorithmic efforts to extract situational attributes, such as the likely relationships of people, their respective sentiments and intentions and the nature of their interactions with the setting and various things around them, purely from photographic images.

This is a daunting technical challenge, and deep learning researchers aren’t promising that they’ll crack it any time soon. But this challenge has a clear path to a solution, through ongoing efforts in the deep learning community to leverage the extrinsic context that comes from other machine learning algorithms, such as those used for natural language processing, sentiment analysis and behavioral analytics. Correlation of deep learning model results with these other sources of contextual information can show, for example, how the information supplied by media and sensor streams such as body language, facial expressions, tones of voice, heart rate or perspiration aligns with verbal information (as revealed through non-media textual sources, such as those individuals’ tweets, and analyzed through natural language processing).

Wrapping algorithmic visual insights in narrative contexts

As deep learning researchers are able to identify more situational variables purely from media and sensor streams, they’re going to want to string those situations—past, present and likely future—into the larger context of narratives. Essentially, a narrative is a structured sequence of situations: observed, inferred and likely, which describes the larger “journey” in which the individuals in a given media stream might be involved. For example, the journey in question might be a literal one, such as the road trip of teenage buddies as revealed through their streaming media travelogue, or the journey might be more symbolic, such as one of the teenagers’ many frustrated attempts, in that same trip, to attract the opposite sex.

The narrative might not be a journey, exactly, with a distinct “dramatic structure” of episodes, milestones or moments, as much as a theme or experience elucidated by the sequence of activities depicted in the media stream. In that regard, I took note of this claim by a solution provider stating that their offering does more than face, voice or object recognition. The vendor claims that its deep learning technology “can analyze video clips to recognize…types of scenes,” as well as “more abstract concepts like ‘fun’ or ‘togetherness.’”

Really? I’ve never seen their technology at work, but somehow I doubt that it can disambiguate such scenes from others that they may resemble on the surface, but which, in fact, reveal different underlying meanings when you probe them a bit. Can their algorithms identify unambiguously when a scene depicts the actual fun and togetherness of, say, a man and woman with genuinely amorous feelings for each other? How can they distinguish that scene in practice from the bogus “fun and togetherness” of a smooth operator putting moves on a polite woman who down deep finds him repulsive?

If their technology can do that, it would be quite a feat of applied deep learning. It would be doubly impressive if those real-world scenes were also being assessed by otherwise perceptive human beings in the same room as the individuals in question, and if the humans missed subtle nuances that might have clued them into what’s really happening.

Human judgment as a reality check on algorithmic inferences of stream narratives

Human judgment is indispensable in framing and vetting any auto-generated narrative that deep learning algorithms may extract from media and other data sources. As I stated in this post several months ago, “identifying the semantics and relationships within pictorial source material demands the most sophisticated deep learning algorithms plus a good dose of human judgment. You could say that this judgment requires a focus on pictorial curation, addressing the critical need for knowledgeable individuals…to distinguish pictorial source materials by degrees of relevance to the narrative being constructed.”

Just as any competent data scientist must subject their statistical models to a reality check, in terms of whether the correlations being revealed coincide with some casual narrative, deep learning researchers will have to do the same with the narratives that their algorithms construct without human assistance. The reality check on any narrative involving human interactions must always be grounded in the social and behavioral sciences.

The core criterion for such algorithmically framed narratives must always be whether the actual people in this scene would agree that this is, in fact, what’s going on inside, among and around them. Or, if they aren’t available to respond to this question, would a reasonable person who is acquainted with similar situations agree with the algorithm’s narrative inference?

If an auto-generated narrative doesn’t feel plausible on a human level, it’s probably wrong.

Learn more about IBM Watson’s abilities in narrative recognition, leveraging new visual recognition and concepts insights features