Data Warehouse Testing: Part 1
Conducting end-to-end testing and quality assurance for data warehouses
Organizations today need data warehouse testing more than ever before. An increasing number of business mergers, data center migrations, and compliance regulations—along with management’s greater focus on data and data-driven decision making—are all driving demand for data warehouse testing. Organizations are focusing testing on the ETL (extraction, transformation, load) process, business intelligence infrastructures, and applications that rely on data warehouses. Organizational decisions greatly depend on the enterprise records in data warehouses. That data must be of the highest quality. Complex business rules and transformation logic, built using ETL logic, demand diligent and thorough testing.
Planning for the data warehouse testing process
A good understanding of data modeling and source-to-target data mappings help equip the quality assurance (QA) analyst with information to develop an appropriate testing strategy. During the project’s requirements analysis phase, the QA team must work to understand the technical implementation of the data warehouse. Different stages of the data warehouse implementation—source data profiling, data warehouse design, ETL development, data loading and transformations, and so on—require the testing team’s participation and expertise. Unlike some traditional testing, test execution should not start at the end of the data warehouse implementation. In short, test execution itself has multiple phases and should be staggered throughout the lifecycle of the data warehouse implementation (see Figure 1). Figure 1. End-to-end data warehouse testing in a multiphased process. A key element contributing to the success of the data warehouse solution is the ability of the test team to plan, design, and execute a set of effective tests that help identify multiple issues related to data inconsistency, data quality, data security, failures in the ETL process, performance-related issues, accuracy of business flows, and fitness for use from an end-user perspective. Overall, the primary focus of testing should be on the end-to-end ETL process. Testing should validate the loading of all required rows, the correct execution of all transformations, and the successful completion of the data cleansing operation. The team should also thoroughly test stored procedures, scripts, and other processes if they are among the tools used to build the data warehouse, such as IBM® Netezza analytics.
Recognizing the importance of testing
There are many reasons to thoroughly test the data warehouse and use a QA process that is specific to data and ETL testing. For example:
- Source data is often huge in volume and originates from a variety of data repository types.
- The quality of source data cannot be assumed and should often be profiled and cleaned.
- Inconsistency and redundancy may exist in source data.
- Many source data records may be rejected; ETL/stored procedure logs will contain messages that must be acted upon.
- Source field values may be missing where they should always be present.
- Source data history, business rules, and audits of source data may not be available.
- Enterprise-wide data knowledge and business rules may not be available to verify data.
- Because data ETLs must often pass through multiple phases before being loading into the data warehouse, ETL components must be thoroughly tested to ensure that the variety of data behaves as expected, within each development phase.
- Heterogeneous sources of data—such as mainframes, spreadsheets, and UNIX files—will be updated asynchronously through time and then incrementally loaded.
- Transaction-level traceability will be difficult to attain in a data warehouse.
- The data warehouse will be a strategic enterprise resource and heavily relied upon.
Planning the phases for testing
A data warehouse implementation needs end-to-end testing. The QA team must test loads at key points, from the identification of source data to report and portal functions. Of course, the team must also test several points in between, including data-load workflows from the source extract to staging, dimension data to the operational data store (ODS), fact data to the data warehouse, and report and portal functions extracting data for display and reporting. All data-load programs and the resulting data loads should be verified throughout the end-to-end QA process.
Planning for QA staffing
Because a data warehouse primarily handles data, a major portion of the test effort is spent on planning, designing, and executing data-oriented tests. Planning and designing most of the test cases requires the test team to have experience in SQL and performance testing. Particularly, QA staff for the data warehouse should have:
- An understanding of the fundamental concepts of databases and data warehousing
- High skill levels with SQL queries and data profiling
- Experience in the development of data warehouse test strategies, test plans, and test cases—what they are and how to develop them, specifically for data warehouses and decision-support systems
- Skills to create effective data warehouse test cases and scenarios based on business and user requirements for the data warehouse
- Skills and interest in participating in reviews of the data models, data mapping documents, ETL design, and ETL coding—as well as the ability to provide feedback to designers and developers
As organizations seek to develop, migrate, or consolidate their data warehouses, they will need to employ best practices for data warehouse testing. Testing data and systems systematically for errors, bugs, and inconsistencies before production is vital. One of the greatest risks to the success of any company implementing a business intelligence system is rushing the data warehouse into service before testing it effectively with an experienced QA ETL testing team. Whether you are expanding your data warehouse or building one from the ground up, developing a well-planned and executed data warehouse testing process can help you avoid serious risks. In part two of this article, I provide some sample checklists that can help you streamline testing and avoid frequently overlooked tasks. Let me know if you have any questions or thoughts in the comments.