Ensuring Data Warehouse Quality: A Business Mandate
New ideas for a beginning-to-end data warehouse lifecycle quality process
Data warehousing continues to gain significance as organizations become more fully aware of the benefits of data-driven business-decision making. One primary element of an efficient data warehouse system is a process that can reliably extract, transform, cleanse, and load data from source systems (see Figure 1) during normal operation without impacting overall performance, scalability, or reliability.
All too often, however, data quality defects are introduced much earlier, during the multiple stages of warehouse development. Fast-track projects that move rapidly through development and implementation can be especially prone to the unintentional injection of defects.
To address these concerns, this article presents new ideas for a beginning-to-end data warehouse lifecycle quality process.
Figure 1. Data path from source systems to warehouse, to analysis and reporting.
Causes of data errors and ideas for early elimination
The injection of data quality problems and defects commonly occurs when populating the warehouse—but they can emerge during all phases of data warehousing, including:
- Data warehouse modeling and schema design
- Coding of data loads
- Integration of data from varied sources
- The extraction, transformation, and loading (ETL) of data, along with its staging and cleansing
Data testing is often planned for the latter phases of data warehouse projects. However, most quality assurance (QA) professionals agree that establishing a successful test planning and execution effort in a project’s early phases is one of the keys to success. As with other software development tasks, the earlier data errors are detected, the less expensive it will be to find and correct them. And by carrying out testing during design but before implementation, project managers gain an effective way to regularly measure and document project progress.
Implementing initial QA efforts
Since data quality can be best measured with reference to a set of data requirements, a successful testing process begins by gathering and documenting end-user data requirements. Most end-user requirements focus on online application processing (OLAP) and the quality of data going into the analytics process. As a result, data warehouse testing should focus on both the ETL process on the back end, and reporting and OLAP on the front end.
After gathering these requirements, the process continues with analysts developing conceptual and eventually detailed schemas to represent user needs as an important reference for testing. Designers are responsible for logical schemata of data repositories and for data staging definitions that should be tested for efficiency and robustness.
A data architecture and model is necessary as a blueprint for any data warehouse. Understanding these artifacts and discovering potential defects will help the QA team grasp the bigger picture of a data warehouse. The data model aids comprehension of the methods used to establish the key relationships between the major data sources. The relationship hierarchies and the depth of data can help clarify the otherwise complex transformation rules.
To gain the greatest value from the QA team, organizations should include QA team members in quality assessments for all phases of data warehouse design and development. Each phase of a typical data warehouse development track is associated with specific work products or artifacts that can be tested by the QA team (see Figure 2). Reviews, verifications, and comments for improvement from the team can contribute greatly to the early removal of defects.
Figure 2. Data defect injection points and QA’s role in detection and removal.
Expected contributions of the QA team
The following examples show contributions from the QA team and associated benefits that can be expected during various stages of the data warehouse development lifecycle.
Data integration and ETL planning phase, using the data model and low-level warehouse design
- Testers gain an understanding of data to be reported by the application (for example, profiling) and the tables on which business intelligence (BI) and other reports will be based
- Testers examine and observe the data model to understand how data flows from source to target
- Testers review and become familiar with data movement low-level designs and mappings, and the add and update sequences for all sources of each target table
ETL planning and verification phase, using source inputs and the ETL design
- Testers participate in ETL design reviews
- The QA team gains an in-depth knowledge of ETL workflows and sessions, the order of execution, restraints, and transformations
- Testers develop an ETL test case and distribute it for reviews
- After ETLs are run, testers use checklists for QA assessments of rejections, session failures, and errors
Assessment of ETL logs: Session, workflow, and errors
- Testers review ETL workflow outputs, and source-to-target counts
- Testers verify source-to-target mapping docs with loaded tables using TOAD from Quest Software and other tools
- After ETL runs or manual data loads, assess data in every table—using TOAD and Microsoft Excel tools—with a focus on key fields (such as dirty data, incorrect formats, and duplicates)
To ensure and verify quality during data warehouse design and development, testers should plan and organize the following categories:
- Data completeness: All expected data is loaded
- Data transformation: All data is transformed correctly according to business rules or design specifications
- Data quality: The ETL system correctly rejects, corrects, or ignores and reports invalid data
- Performance and scalability: Data loads and queries perform within expected time frames and the technical architecture is scalable
- Integration testing: The ETL process functions well with other upstream and downstream processes
- User acceptance testing: The warehouse solution meets users' current requirements and anticipates their future expectations
- Regression testing: Current functionality remains intact each time a new release of code is completed
Testing to ensure user value
Of course, testing does not guarantee that there will be no data errors; there are far too many combinations and permutations, and it is not practical to test each one. However, by joining QA forces with designers and developers, and ranking the types of errors as suggested above, organizations can avoid wasting time on creating test scripts and test scenarios for less-important opportunities. Instead, they can concentrate on detecting and eliminating errors that could significantly diminish or destroy the value of the data warehouse for business users.
Please share any thoughts or questions in the comments.