Wrangling and governing unstructured data

Executive Architect, IBM

Unstructured data is the common currency in this era of the Internet of Things (IoT), cognitive computing, mobility and social networks. It’s a core resource for businesses, consumers and society in general. But it’s also a challenge to manage and govern.

Unstructured data’s prevalence

How prevalent is unstructured data? Sizing it up can give us a good sense for the magnitude of the governance challenge. If we look at the world around us, we see how billions of things become instrumented and interconnected, generating tons of data. In the Internet of Things, the value of things is measured not only by the data they generate, but also by the way those things securely respond to and interact with people, organizations and other things.

If we look into public social networks such as Facebook, LinkedIn or Twitter, one of the tasks will be to know what the social network data contains to extract valuable information that can then be matched and linked to the master data. And mobile devices, enabled with the Global Positioning System (GPS), generate volumes of location data that is normally contained in very structured data sets. Matching and linking it to master data profiles will be necessary.

The volume of unstructured information is growing as never before, mostly because of the increase of unstructured information that is stored and managed by enterprises, but is not really well understood. Frequently, unstructured data is intimately linked to structured data—in our databases, in our business processes and in the applications that derive value from it all. In terms of where we store and manage it, the difference between structured and unstructured data is usually that the former resides in databases and data warehouses and the latter in everything else.

In format, structured data is generated by applications, and unstructured data is free form. In addition, like structured data, unstructured data usually has metadata associated with it. But not always, and therein lies a key problem confronting enterprise information managers in their attempts to govern it all comprehensively.

Governance of the structured-unstructured data link

When considering the governance of unstructured data, a focus on the business processes that generate both the data itself and any accompanying metadata is important. Unstructured data, such as audio, documents, email, images and video, is usually created in a workflow or collaboration application, generated by a sensor or other device, or produced upon ingestion into some other system or application. At creation, unstructured data is often but not always associated with structured data, which has its own metadata, glossaries and schemata.

In some industries, such as oil and gas or healthcare, we handle the unstructured data that streams from the sensors where it originated. In any case, unstructured data is usually created or managed in a business process that is linked to some structured entity, such as a person or asset. Consider several examples: 

  • An insurance claim with structured data in a claims processing application and associated documents such as police records, medical reports and car images
  • A mortgage case file with structured data in a mortgage processing application and associated applicant employment status and house assessment documents
  • An invoice with structured data in an asset management application and associated invoice documents
  • An asset with records managed across different applications and associated engineering drawings 

Governance challenges enter the picture as we attempt to link all this structured and unstructured information together. That linkage, in turn, requires that we understand dependencies and references and find the right data, which is often stored elsewhere in the enterprise and governed by different administrators, under different policies and in response to different mandates.

What considerations complicate our efforts to combine, integrate and govern structured and unstructured data in a unified fashion? We must know how we control this information, how it is exchanged across different enterprises and what are the regulations and standards to secure delivery of its value and maintain privacy.

We also need to understand what we are going to do with the data that we collect because just collecting data for future use, just in case, is not the solution for any problems. We can easily shift from competitive advantage to unmanageable complexity.

Governance perspectives different industries in a complicated ecosystem of connected enterprises, we handle different types of information that is exchanged, duplicated, made anonymous and duplicated again. In analytics we handle predictive models to provide recommendations resulting in critical decision making. We need to think about models’ lifecycle and track the data sets used to develop such models as well as ownership changes.

How can governance be applied here? When we speak about information, integration and governance, we usually get different answers. Some, such as a legal record manager, focus on unstructured data curation, document classification and retention to comply with internal policies and external legislation. On the other hand, data warehouse IT groups focus on structured and transactional data and its quality to maintain the best version of the truth.

But the business usually doesn’t care about what type of information it is. What they want to see is the whole picture that will include all related information from structured, unstructured and other sources with proper governance around it. The importance for integrated metadata management became crucial.

Data lifecycle governance environments

To unify governance of structured and unstructured data, enterprises need to remove borders between information silos. In addition, organizations need to be connecting people and processes inside and outside the organization. And they need to make every effort to create trusted and collaborative environments for effective information configuration and management.

What should span all information assets, both structured and unstructured, is a consistent set of organizational policies, roles, controls and workflows focused on lifecycle data governance.

Learn more about data governance