When Is Big Data Worth Keeping and Governing?
Build quality into big data with an iterative, policy-based governance approach
The value of data is not always obvious at first glance. A lot of data is worthless, but arriving at that conclusion is often impossible until it is considered thoroughly from the point of view of the end user. What end users opt not to use is more often than not utterly useless and needs to be discarded.
Consider this analogy regarding the way television series are developed in the US. American television networks get hundreds of pitches for new television shows. Amazingly, the amount of development that gets discarded during pilot season before or just after the first episode is made is highly substantial. For example, I’ll go out on a limb and predict The Odd Couple will not go beyond pilot stage. My prediction is based in part on how hard it is to live up to the cast of the original sitcom, and partly on my opinion that Matthew Perry, who is playing the slob, should instead be playing the neatnik.
If that pilot does not become a television series, the network has written off a huge amount of development and casting costs. Those pilot development costs represent serious money that is going down the proverbial tubes and depressing the studios’ bottom lines.
Now, consider the process enterprise analytics organizations undergo when piloting business analytics outputs such as reports, dashboards, and so on for downstream end users. Do they undertake a data warehousing project with the expectation that the team will build 60 reports, of which 40 will be canceled because they’re unpopular with test audiences? Or imagine commencing an IT project in which everyone agrees that 75 percent of what is to be built is going to be a failure. Those scenarios are highly unlikely.
Even in an agile development environment, an organization should expect to keep a lot of what it builds and improve on it with each increment. Information management professionals would lose their jobs in a heartbeat if they squandered that much money on a regular basis.
The challenges of assessing data value
If an organization is implementing big data and developing increasing amounts of analytics, it is probably feeling great pressure to reuse more data than ever before and reduce the amount of work it wastes. Companies are using big data to try out a wide array of data without breaking the budget. But how can they do that with confidence when the sheer volume, variety, and velocity of it outstrips their ability to assess what truly delivers value, and what needs to be scrapped and decommissioned?
The plethora of new data sources and uses creates an environment in which assessing the value of all that data and analytics at the outset of any development project is far less likely for data professionals. Best practices in data governance for addressing this new big data reality have shifted away from the traditional pipeline approach—governing data to the highest standard, storing it, and then using it for multiple purposes. Instead, best practices are moving toward an exploratory approach—understanding data and usage, governing it to the appropriate level, using it, and iterating (see figure).
A big data approach to data management versus the traditional approach
In the era of big data, the purpose of data governance has shifted. In addition to its traditional role of helping discover and profile the potential value and usage of big data, governance is also now all about determining the current level of confidence in data and the required level of confidence for specific big data use cases.
In a traditional approach, organizations spend serious time, money, and effort in getting data into a state where it is well suited for use by the business. At that point, they have spent so much money that they cannot afford to discard that work. And if they receive bad feedback, they spend additional money trying to get it right.
In the new world of big data, organizations can explore and understand the data before committing to it. To return to the television analogy, this approach is equivalent to how the studios can afford to reject some pilot programs that don’t test well with audiences, as long as the producers have some successful concepts to build on. The big data approach aims to find the valuable insight by testing a wider range of content than is possible with a traditional approach.
An appropriate level of data quality governance
However, this approach takes hard work. Once an organization decides that something is worth keeping, there is still the grind of getting it right. The words “govern appropriately” (see figure) are likely to cause the most angst on big data projects for years to come. In terms of data quality, governing appropriately means that when organizations understand the value in the data, they should also get to know the risk. The following steps can be implemented to obtain the appropriate level of data quality governance on a big data platform:
- Keep a big data catalog. Governing information assets is impossible without having a catalog of those assets for reference. Key data quality concepts such as ownership, valid values, glossary definition, lineage, and relationships are tracked in the catalog.
- Discover data quality problems with data profiling. Exploring and understanding big data can include the data quality assessment; data is not a confirmed asset until the quality of what is seen is understood.
- Evaluate data quality problems from a perspective of business impact. An appropriate level of data quality control focuses on those problems that have a tangible business impact in which the cost of addressing the problem is less than the penalties of letting it go.
- Take a policy-based approach to data quality. A big data platform has many varieties of data—potentially a large volume. Trying to manage data quality rules at a granular level can make it difficult to determine if the data quality controls and processes are at an appropriate level, given the value and risk of the data. Rolling data quality rules up to a policy tree helps define the business context, value, impact, and ownership from a business perspective.
- Measure data quality through scorecards. Information governance in a big data environment without metrics and scorecards is like a television pilot season without ratings. No one knows which shows are working other than anecdotal evidence, and decisions are made based on guesswork. When judging whether a big data effort has an appropriate level of governance, do not leave it up to guesswork, which is how projects fail and data leaks and embarrassing data quality problems occur.
A policy-based approach to big data quality
Many organizations today are realizing the challenges involved with cost-effective implementation of big data initiatives, not to mention having confidence in the ever-growing volumes of data these initiatives require. These challenges are influencing a palpable shift in established best practices for applying data governance.
Today, exploratory approaches to understand data, its use, and risk level dictate governing data to an appropriate level before using it, rather than following a traditional course of applying governance to reach a very high standard before using it for any number of projects. In this approach, organizations must determine the current level of confidence in a data set by taking a number of steps to determine that appropriate level. These steps include keeping a big data catalog, profiling data for data-quality problems, evaluating those problems from a business-impact perspective, adopting a policy-based approach to data quality, and using scorecards to measure quality.
Please share any thoughts or questions in the comments.