Blogs

Data quality: The key to building a modern and cost-effective data warehouse

Offering Manager, Hybrid Data Management, IBM

Turning raw data into improved business performance is a multilayered problem, but it doesn’t have to be complicated. To make things simpler, let’s start at the end and work backwards. Ultimately, the goal is to make better decisions during the execution of a business process. This can be as simple as not making a customer repeat their address after a hand-off in a call center, or as complex as re-planning an entire network of flights in response to a storm. The end goal in all cases is to make a decision that improves the future health of the business, and that requires decisions that are both accurate and timely.

Trusted data: The bedrock of good decision making

In order to be accurate, the decision has to be based on good and trusted data. Everyone who works in an office has experience with bad data, and most have seen it lead to bad decisions. A report from Aberdeen Group, Modern MDM: The Hub of Enterprise Data Excellence, states that executives realize the challenge of data disparity has reached critical levels, so let’s break down what “good and trusted” data means.

Data Quality: the key to building a modern and cost-effective data warehouse

 

 

 

 

 

 

 

 

 

 

 

 

Data needs to be accurate and the source of the data should be trusted. Accuracy means cleansing the data of human error and other sources of discrepancies. An example would be telco detecting that Jon Gooddata and John Gooddata are very likely the same person if their records have the same address. Accuracy may also include ensuring consistency and standardization of data, thus ensuring reliability when analyzing or comparing results (e.g., validating components of an address to determine a high-value customer). Trust means establishing a chain of lineage from known reliable sources straight through to the data used in actual decisions. It also means governing access to data to prevent unauthorized disclosures and leaks, and to prevent two good data sets from being improperly combined to produce bad output.

In order for the decision to be timely, the key data must be discoverable, usable, and current. Discoverability means allowing users to find the data that fits their need in a self-service way, and to share the data and its resulting insights with their peers. Usability means providing end users with the right tool to analyze, filter, and combine the data to fit their needs. Currency means that the data has to be quickly accessible so that decisions stay in sync with changing realities in today’s turbulent environment.

From data to decisions: Making it a reality

How does this boil down to the underlying technologies that support an end to end flow of data from creation to better business decisions?

IBM DataStage is a market leading data integration solution that provides this flow of data from across the business into a catalog of data assets that users and AI then turn into better decisions. It provides in-line data quality capabilities that allows data to be standardized, cleansed and integrated, and establishes the chain of lineage that allows data to be trusted by users. DataStage transparently shares metadata with the data catalog, allowing this chain to be extended from source systems straight through to decision makers.

Data Quality: the key to building a modern and cost-effective data warehouse

IBM Watson Knowledge Catalog is an intelligent data catalog for managing enterprise data, while also automating away the discovery, classification and curation overhead of maintaining the assets. It extracts a common glossary of terms from the data sets to ensure users in different lines of business are looking at consistent information across data silos, and dynamically masks sensitive data to prevent unauthorized leaks.

But where does this trusted data actually live? IBM’s newly reborn Netezza Performance Server provides a scalable, high performance, and easy to use data warehouse for this data to reside in. It provides both the heft to deal with high volume feeds from DataStage, and the agility to support end user demands for data. Netezza lives inside of another key piece of the puzzle, one that we haven’t mentioned yet: Cloud Pak for Data.

Cloud Pak for Data delivers a trusted and modernized analytics platform built on the foundation of Red Hat OpenShift. Enhanced with in-database machine learning models, optimized high-performance analytics and data virtualization, Netezza enables you to do data science and machine learning at scale. The end to end solution, from DataStage building the single version of the truth through Netezza’s data warehouse to Watson Knowledge Catalog’s central repository of data assets, is available in a single, unified platform that makes getting started easy and can be scaled to meet the requirements of the most demanding environments. And you’re optimizing your data warehouse costs by only paying for the resources to store data that you used and trust for your business.

Want to make better decisions? Start by delivering business-ready data that is meaningful, trusted and of quality with Cloud Pak for Data. Businesses can reduce their infrastructure management time and effort by up to 85 percent with DataStage on Cloud Pak for Data. To learn more, take the IBM InfoSphere DataStage guided demo.

Accelerate your journey to AI.