Tweets Are The Fruit Flies Of Consumer-Facing Data Science—For Better and Worse

Big Data Evangelist, IBM

Businesses are plunging headlong into the age of social listening analytics without fully thinking through the many issues surrounding the quality of this intelligence. There is plenty of valuable customer intelligence to be had from filtering the social firehose. However, the overwhelming volume, velocity and variety of noise is always threatening to drown out the signals you’re attempting to isolate.

Data scientists are our experts in filtering the flow of social intelligence. Like any human who’s shouldering a heavy workload, the data scientist might find it tempting to follow the path of least resistance, especially when it involves access to low-cost, readily available sources of data.

fruitfly.jpgA frequent slap against social science is that so much research is on a handy but unrepresentative segment of the human population: undergraduates, or, more broadly, people who live on or near college campuses. Having attended a few research-intensive universities back in the day, and having lived elsewhere since, I can attest that this segment does not represent the world at large. Generally, it’s better educated, more affluent, more middle-class, more suburban, more white collar, and more cosmopolitan than the average person. I’ll bet these generalizations apply to communities all over the world that have substantial local institutions of higher education.

Likewise, a frequent criticism of social media listening programs is that the people who choose to share their thoughts on Twitter, Facebook and the like are a skewed slice of humanity. If you’re a data scientist who relies on this source of consumer intelligence to build and tweak your sentiment analysis models, you’re getting an inherently biased feed. To put it bluntly, you’re listening primarily to people like yours truly, rather than to a representative cross-section of the vox populi. And you’re not even getting the straight poop from us, because most of us aren’t being 100% candid on social networks about the stuff that truly interests us. For example, I choose not to share my every last thought on politics, religion, sex, family, work, etc., and I’ll bet most other responsible adults aren’t sharing all those thoughts either. Usually, we refrain from particular topics for very good reasons, many of which have to do with being mature individuals who pride ourselves on our ability to compartmentalize.

With that in mind, I chuckled at the core “tweets are fruit flies” analogy in this recent article. The analogy, as expressed by Princeton researcher Zeynep Tufekci, is straightforward. “[Fruit flies] are usually chosen because they’re relatively easy to use in lab settings, easy to breed, have rapid and ‘stereotypical’ life cycles, and the adults are pretty small.” In other words, fruit flies—which researchers often assume are sufficiently representative of some broader population of organisms—are a convenient source of high-quality data on the biological impact of a wide range of (laboratory-inflicted) conditions.

With irony aforethought, Tufekci describes fruit flies—and, by analogy, tweets, Facebook updates and other user-generated social-media content—as the “model organisms” in their respective fields of study. Consumer-focused data scientists tend to assume that social-media users are a cross-section of the general population. She challenges this assumption by noting how unrepresentative social-media users are of the larger population. Just as important, she notes other, more intractable quality issues associated with social-media intelligence (even if the individuals generating it were a representative slice of the populace):

  • Doesn’t provide contextual intelligence on how many people chose not to retweet or like a particular post, only on how many actively chose to take these actions
  • Doesn’t address the pragmatic ambiguity of retweet or sharing of a post, which might “be something far different than influence, ranging from ‘affirmation to denunciation to sarcasm to approval to disgust’”
  • Doesn’t incorporate any situational context of what particular social-media posts mean within the dynamics of the specific interactions, relationships and/or communities from which they arose
  • Doesn’t acknowledge that the influence(s) that spawned particular social-media posts may have originated, not from particular relationship, but from in “field effects [which are] large-scale societal events that impact a large group through changes within whole social, cultural and political fields”
  • Doesn’t reflect the reality of social media users “gaming” (in other words, actively skewing) the intelligence that might be derived from their posts, through such tactics as creating false hashtag trends, deliberately leaving out hashtags and @ signs to elude search engines, using bots to load socials with bogus machine-generated sentiment, and so forth

Clearly, these data quality issues strike at the very heart of consumer-facing data science in the age of the socials. Come to think of it, the life of the social-media data scientist would be much easier if social-media users were more like fruit flies. Many of the tricky aspects of social-media listening stem from the fact that we are an exceptionally diverse, social, intelligent, innovative and adaptive species. We resist being put in analytic jars, pinned to tidy dashboards, and dissected as if we were simple organisms that haven’t evolved since the age of the dinosaurs.

Even when your social listening tools provide you with a good-enough perspective on customer sentiment, it’s not always clear whether or how you should implement governance on this data. Here are some outstanding sentiment-data governance issues to consider:

  • Data quality requirements stem from the need for an officially sanctioned “single version of the truth,” but it’s highly implausible that any individual social-media message constitutes the “truth” of how any specific customer or prospect feels about you.
  • Social sentiment data rarely has the definitive, authoritative quality of an attribute—such as name, address and phone number—that you would include in or link to a customer record. Even when people are bluntly voicing their opinions, the clarity of their statements is often hedged by the limitations of most natural human language. Even the most powerful computational linguistic algorithms are challenged when wrestling sarcasm, elliptical speech and other peculiarities down to crisp semantics.
  • Even if every tweet were the gospel truth about how a customer is feeling and all customers were amazingly articulate on all occasions, the quality of social sentiment usually emerges from the aggregate. In other words, the “quality” of social data is in the usefulness of the correlations, trends and other patterns you derive from it. That’s why most social-listening tools, including IBM Cognos Consumer Insight, are geared to assessing and visualizing the trends, outliers and other patterns in social sentiment.
  • Even outright customer fibs, propagated through social media, can be valuable intelligence, if we vet and analyze them effectively. After all, it’s useful to know whether people’s words (e.g., “we love your product”) match their intentions (e.g., “we have absolutely no plans to ever buy your product”), as revealed through their eventual behavior (e.g, buying your competitor’s product instead). What you can learn from quality-uncertain social sentiment data is the situational contexts in which some customer segments are likely to be telling the truth about their deep intentions.

You should only apply strong governance to data that has a material impact on how you engage with individual customers, and social data rarely meets that criterion. The customer record is the gospel that determines how you target pitches to them, how you bill them, where you ship their purchases, and so forth. For these purposes, the accuracy, currency and completeness of customers’ names, addresses, billing information and other profile information is far more important than what they tweeted about the salesclerk in your Poughkeepsie branch last Tuesday. However, if you greatly misinterpret an aggregated pattern of customer sentiment, the business risks can be considerable. 

To the extent that you can speak about the quality of social sentiment data, it all comes down to relevance. Sentiment data is only good if it’s relevant to some business initiative—such as marketing campaign planning or brand monitoring—and if it gives you a good enough picture of how customers are feeling and how they might behave under various future scenarios.