Data Scientist: Exploration in the Age of the Unstructured

Post Comment
Big Data Evangelist, IBM

Big data is a bit like the solar system. It’s a brilliant orb of analytical fusion that emerges from the inchoate disk of gas, dust, rocks and crystals known as “data.” Data scientists are astronomers who explore the whole spinning system, much of which consists of scattered matter that we lump together under the term “unstructured.”

galaxy.jpgAs a catch-all, the term “unstructured” glosses over the many fine distinctions between diverse informational objects. However, it alludes nicely to the cultural notion of big data as a largely unexplored cosmos of breakthrough insights that will stretch our thinking outside traditional boxes.

You could extend the notion of “unstructured” to describe many aspects of the big-data paradigm. Here are the many layers of “unstructured” that enter into discussions of big data and its many applications:

  • Unstructured data: This is the core of the popular discussion. It refers to a key aspect of “variety,” which is a leg of the “3 Vs” big-data stool that also includes “volume” and “velocity.” Unstructured data, in the loosest sense, encompasses various sources such as enterprise content management systems, social media, text, blogs, log data, sensor data, event data, RFID data and more. Generally, unstructured data is understood on a spectrum alongside “structured” and (less often) “semi-structured” data, but the practical distinctions among them are fuzzy at best. Often, the degree of “structure” refers to the extent to which the data object’s full semantics are explicitly called out in a schema, metadata, glossary, ontology, markup language, or similar construct. To the extent that they are, such as in a classic third-normal-form relational structure, the data is structured. If, as in an Extensible Markup Language document, the object includes explicit tags on some data contents plus, optionally, some free-text or binary content, it is semi-structured. If the contents are entirely free-text and/or binary, the object is considered unstructured (a strict definition that would, of course, put many social media data objects, such as tweets, into the “semi-structured” category, given their incorporation of metadata and tags). To the extent that data is unstructured, data scientists must rely on some combination of manual tagging, natural language processing, text mining, machine learning and other approaches to extract the semantics of the content.
  • Unstructured governance: When unstructured data (however defined) enters the enterprise big-data picture, data management professionals begin to quake in their boots. Governance of structured data is an established body of practices and tools, focusing on enforcing controls on schemas and contents of data deemed to constitute an official system of records in some subject area, such as customers and finances. However, governance of unstructured data (understood in the strictest sense outlined above) runs into a fundamental problem: it has no schemas or metadata to leverage at the content level. Also, governance of unstructured data (in the looser sense above, in which some semi-structured formats include metadata, schemas and/or tags) also suffers from a fundamental issue: it is not necessarily linked to any official system of record in its raw format. Many of the “unstructured” sources are ephemeral data (e.g., logs, tweets, events) that data scientists aggregate and mine through their Hadoop, NoSQL, graph database, or other big-data platforms in search of patterns (e.g., correlations) to build into or tweak a statistical model. You would be more likely to implement governance controls around the models themselves (i.e., the chief intellectual property being developed) than the underlying raw data, much of which might never be stored or retained (being far too voluminous to store in perpetuity anyway). In an unstructured-data context, the relevant model/data governance controls might be either strict (crisp stewardship roles and workflows) or loose (collaborative, ad-hoc), depending on enterprise policies that are still evolving.
  • Unstructured collaborations: Big data is often a multidisciplinary mix master of an enterprise initiative. It may bring together data scientists—both statistical analysts and business analysts—who had never collaborated before and who don’t share a common frame of reference, vocabulary, skills or tools.  It may bring together business analysts from diverse functions—such as marketing, finance and human resources—to focus on projects that may span many functions and call for converged analytics approaches. It may involve diverse data specialists—in Hadoop, NoSQL, enterprise data warehousing, stream computing, etc.—who are jointly tackling new challenges that demand hybrid platforms and tools. When mutually unfamiliar data-science disciplines are thrown together, they must often improvise new repeatable collaboration approaches while maintaining a working environment that fosters the unstructured explorations that unlock creative breakthroughs. 
  • Unstructured processes: Big data applications are increasingly deployed into business processes—such as customer experience optimization, multichannel conversation management, and behaviorally targeted offers—that are dynamic, emergent, contextual and situational in nature. We might consider these “unstructured” processes in the sense that they can change from moment to moment through the magic of embedded predictive models, business rules, stream computing and machine learning. The stream of passing moments being optimized through big data may never recur in the same exact sequence, which is why “unstructured” analytics-driven next best actions may be most appropriate. Contrast that with the pre-structured orchestration models that power many traditional business applications, into which analytic-driven dynamic re-routing capabilities are usually integrated as an afterthought. In repeatable business scenarios, structured orchestrations are the appropriate process model.
  • Unstructured outcomes: Big data, as implemented by enterprise data scientists, is a process of continued, iterative exploration and experimentation that may have no fixed outcome. If you’re mining a never-ending stream of fresh sources looking for unprecedented insights, it can be difficult to promise any specific structured outcome—such as boosting customer retention—before you dive into the data. And if you’re tuning your big-data analytics through “real-world experiments,” you might be constantly swinging back and forth between disparate business outcomes produced by your process-embedded analytic models. Experiments, by their exploratory nature, have no specific outcome that can be anticipated with certainty in advance. For example, you might be experimenting with different predictive models to drive customer handling across different engagement channels. You might be playing with different models for differentiating offers by customer demographics, transaction history, times of day other variables. And you might be examining the impact of running different process models at different times of the day, week, or month in back-office processes, such as order fulfillment, materials management, manufacturing, and logistics, in order to determine which can maximize product quality while reducing time-to-market and life-cycle costs.

If you crave the structure of an established set of data management and governance practices, you should think twice before venturing out into the “unstructured” cosmos of big data. You won’t necessarily be hit by a meteor, but you may find yourself drifting in orbit around a planet with many unfamiliar new continents.

Related information

View the presentation "Data Scientists: Myths and Mathemagical Powers"

See more posts and other content about data scientists

Photo: NASA Goddard Photo and Video