Using real-time analytics to identify who's scooping whom in online journalism

Big Data Evangelist, IBM

Working journalists are locked into a never-ending race against time. Not only are reporters always up against deadlines, but they are constantly scrambling to make sure they break the news before the competition. Even if they hit the newswires a mere split-second before rival news services, it matters. It's bragging rights, if nothing else.

The concept of "horserace journalism" has driven the news industry from its very start. Back in the day (in other words, long before your and my day), most large cities, and many smaller ones, had multiple daily newspapers, both morning and evening. They competed tooth-and-nail on their ability to break hot stories first and get their ink-impressed pulp-based product into the hands of newsboys on street corners. Hollywood didn't invent the phrase "stop the presses!", but it incorporated the inherent drama of the "scoop" into so many storylines so often that it became one of the most hackneyed cinematic clichés.

When journalism took root in radio and television, the "scoop" imperative ramped up to a new level of real-time insanity. These days, local broadcast outlets boast in their ads that their reporting staff breaks stories first, have won top awards for doing so and, consequently, should be your number one source of "news you can use." For their part, the cable news channels tend to jump at breaking stories well before many concrete details are available, and they tend to fill up the airwaves 24x7 with pure speculation until the details arrive. And every new detail or twist in a story is devoured the moment it appears by their entire on-air team (reporters, anchors, analysts, color commentators and more) as if their lives (and jobs) depended on it—perhaps because the latter does.

Image courtesy of Openclipart and used with permission

As more people turn to online news sources (including, but not limited to: traditional news websites, streaming broadcasts and mobile apps) it's a bit bewildering to figure out who is scooping whom when on which breaking topics. To the average news consumer online, it feels like news is bubbling up spontaneously from everywhere on the internet, without any identifiable source that can claim to have broken it first. And given the fact that new news is continuously streaming into the world at high velocity, it's not even clear whether "scooping" can be a meaningful competitive differentiator for news outlets anymore. 

My feeling is: "So you reported the story 385 milliseconds earlier than anybody else on the internet? Big deal! This isn't the Olympic sprinting finals."

Nevertheless, you can't deny that much of the "horserace journalism" culture has crept into the online world. Even traditional media outlets rely on the likes of Twitter as a new sort of real-time "ticker service" for real-time newsgathering. Before long, we can expect to see the news services' data scientists build streaming tools that analyze how fast they and the competition are breaking news online—and bragging with data when they find themselves doing the scooping. I also expect to see a new breed of data journalists who use the same tools to identify when, how and by whom stories are emerging and being developed in the online news cycle.

So it was with this perspective that I read this 2013 blog: "How to spot first stories on Twitter using Storm." The article summarizes the research project that author Michael Vogiatzis conducted in gaining his advanced degree in computer science, rather than as a "scoop certification" tool for use by an online news service (though, clearly, it could serve that purpose). He describes in detail, with programming code, diagrams and equations, no less, how he built a program that does "first story detection" on Twitter's streaming Storm infrastructure. "Specifically," says Vogiatzis, "I try to identify the first document in a stream of documents, which discusses about a specific event."

If you're interested in the technical details (from a data scientist or programmer's perspective) the blog is full of them (by the way, you can use other stream computing platforms, such as IBM InfoSphere Streams, not just Apache Storm, for this purpose). It's a fascinating read, what stopped me cold is this caveat near the blog's end: "It should be noted that an evaluation of the accuracy of this system (precision, recall) at such a scale would be impossible as it would require humans going through a very large number of tweets to manually label new events."

Come again? If it's impossible to double-check the quality of its output, how can we have any confidence in such a tool? Who, then, can we believe when one news outlet says its real-time analytics tool shows that they scooped a particular story online, and its archrival says that it's real-time analytics tool shows the other guy's first tweet on the topic came 3 seconds later than theirs?

The implication of this scoop-indeterminacy for online news outlets is sobering. If horseraces in online newsbreaking are constantly neck-and-neck, if split-second differences in time-to-tweet are irrelevant to the average news consumer and if no news outlet's "photo-finish" first-to-tweet snapshot can be trusted, no news organization can ever achieve a sustainable "scoop king" advantage in the online world.

If scoop culture endures in the era of real-time streaming journalism, who can have legitimate bragging rights over "first-to-tweet"? Will news outlets start bragging about their real-time scoop analytics tools? And, if so, will readers care?