There is a perception by some that big data and data quality are mutually exclusive. This perception arises probably from the thought processes that a little noise (as in poor quality data) would not make a difference to the quality of insights given the large volumes of signals (data).
The truth is that more and more organizations are seeing benefits from having data governance and quality functions. According to a survey conducted by The Information Difference, an analyst firm, 80 percent of survey respondents felt that data quality was important to their big data initiatives. Read their latest data quality report for more: The Data Quality Landscape - Q1 2014. If organizations move beyond viewing data quality from a cleansing, standardizing and matching dimension, they will realize that data quality is as important to big data as it is elsewhere, although some capabilities need to adapt or scale to big data requirements.
As organizations progress from search and survey to exploratory and production or operational stages in their big data initiatives, different dimensions of data quality come to the play prominently. For example:
- Discovering data relationships: discovering new sources of data and the hidden links between data that is spread across heterogeneous sources
- Profiling data: Assessing data quality and suitability of data for specific needs
- Defining an enterprise business vocabulary: Creating, managing and sharing an enterprise-wide controlled vocabulary that gives a semantic context for data usage for different purposes and by different people.
- Monitoring data quality: Empowering data stewards to continuously monitor data quality
All these dimensions are important for big data initiatives, especially as they mature from exploratory to operational and production stages. Even cleansing, standardizing and matching capabilities are critical where a repeatable analytic performance is expected.
I stumbled upon this philosophical question somewhere on social media: What happens when big data meets bad data? The answer obviously is: not-so-accurate insights, bad decisions and loss of confidence in big data. Trust in your data will never go out of fashion and, as long as there is a need to make trusted decisions (always) based on information and insights, there will be a need for data quality and governance.
- eBook: Integrating and governing big data
- IBM Redbook: Information Governance Principles and Practices for a Big Data Landscape
- Video: IBM InfoSphere Information Server for High Quality Dependable Data
- Blog: Adapting information integration and governance for the era of big data