Fathoming photos at algorithmic speed

Big Data Evangelist, IBM

I forget where, but I once overheard a great comic bit about “closed captioning for the thinking impaired.”

OK, please don’t misconstrue this as a slap against people with learning disabilities. It was directed at people with normal cognitive capacity who are likely to mentally tune out of video streams that don’t interest or engage them. And that includes people like me. I would love if I were able to call up closed captions that interpret every video stream I’m exposed to in any medium at the time it’s playing. And that’s simply because I don’t want to engage my head any more than necessary to fathom it unassisted.

Yes, I think I’m fairly good at concentrating when I need to. But if you’re like me (like most people, actually) you’re not paying close attention at every single moment to television, movies, streaming content and other video programming, even if it deeply engages your heart and mind. You may be thinking hard about the last thing you saw and, as a result, are momentarily inattentive to what’s happening at the present moment. Or you may be susceptible to fatigue, laziness or attention-deficit whatever.

Even if you are paying close attention, you can’t possibly grasp every nuance of the narrative, dramatis personae or situations you’re witnessing. A lot of the time, in most video, there’s too much going on. Perhaps the audio track is muddy and too busy with overlapping dialogue and ambient sounds. Perhaps the storyteller didn’t construct it for easy comprehension. Or, if it’s a non-fiction stream (like a breaking news story), perhaps nobody, including the people it involves, has a clue what’s actually going on.


That’s why, IMHO (in my humble opinion), the most important video innovation of the past 50 years was the “replay” option. Maybe the stream will blow right over your head on the first viewing, but rest assured that you can usually catch additional details and piece together the fuller context on repeat viewings—on the bold assumption that it’s not intrinsically a confusing mess.

If you’re ancient like me, you recall the era before mass adoption of videocassette recorders, when “replay” simply wasn’t possible unless you were filming the video directly off your TV screen (like the kinescopic process that NYC-based US TV networks used to record live 1950s shows like “Your Show of Shows” for later rebroadcast in other time zones). Absent the ability to watch the broadcast at least one more time, you may have missed out on the more subtle points of the performance (which, by the way, were entirely lacking, by design, from the oeuvre of Sid Caesar and other vaudeville-rooted comedians, a stylistic advantage for delivering punchlines that hit pay dirt immediately with everybody in the studio and broadcast audiences).

Photos aren’t the same as videos, of course. Photos tend to hold still and present a static object for our careful inspection. But photos aren’t necessarily any less vague or ambiguous in their subject matter than streaming images. And we viewers aren’t necessarily any more engaged in scrutinizing them than with the average video. Furthermore, the sheer volume, velocity and variety of photographic images that we all encounter in our normal “life stream” render them collectively tantamount to a never-ending video—albeit one composed of stills that flicker into an out of our consciousness at varying rates and may lack any common subject matter.

In this era of big media, replay is almost always an option for most video streams, unless they were specifically designed for evanescence. And more and more digital photos are, by default, being uploaded to the cloud soon after they’re captured. So they can be reviewed as often as we wish.

But who has the time and, considering the swelling magnitudes of video and photo contents in the world, where would we ever find enough humans to review, curate and caption it all? And even if those captions magically materialized from the images themselves, who, if anybody, would ever look at those either?

Actually, search engines would, if the “captions” (in other words, descriptive metadata) were accessible and interpretable according to well-formed taxonomies and ontologies. And that metadata wouldn’t actually need magic to conjure it into existence. Deep learning algorithms—specifically, artificial neural networks that are trained to extract meaning from video, image and other streaming objects—could do the trick. But how complex would that trick need to be?

In that regard, I came across an excellent recent GigaOm article on this very topic. The article describes efforts by researchers at Google and Stanford to build hybrid neural networks to interpret digital photos. By “hybrid,” it’s referring to a blend of two types of these algorithms: deep convolutional networks and recurrent networks. As the article states, the former is best suited to computer-vision apps and the latter to natural language processing.

A deep-learning expert who’s not involved in these specific projects summed it best when he said that this hybrid approach might be suitable for identifying the various elements and objects present in each photographic image and then “train[ing a recurrent neural network] to output a caption so that it can tell us what it thinks is there.”

That’s exactly right. The algorithm would need to leverage cognitive computing in its full power. To do them with images (still or moving), it would need to leverage contextual variables that help it understand the semantics both in-frame (such as identities and relationships involving any and all within-scene elements) and out-of-frame (for example, the scene and its elements’ meanings within a broader context or narrative).

As the article makes clear, the practical value of this would be immeasurable. “Models that can accurately assess the entirety of a scene rather than just picking out individual objects will deliver more-accurate image search results and content libraries. As it gets more accurate, this type of approach could help robotics systems in fields from driverless cars to law enforcement make better, more context-aware decisions about how to act.”

No one doubts that any of these can be accomplished with the right deep-learning technologies. The wildcard is the “as it gets more accurate.” No one in their right mind will go for a ride in a driverless car that has one percent chance of misinterpreting a traffic cop’s hand signals and consequently driving off a collapsed bridge.

Your autonomous vehicle, and any other data-driven artifact upon which you depend, needs to have these sorts of situations “closed-captioned” correctly and unambiguously in real time. Unless the mechanisms that power our new world have deep-learning algorithmic engines that purr flawlessly around the clock, they’re effectively “thinking-impaired.”