Provisioning Test Data with IBM InfoSphere Optim Test Data Management: Part 1
Choosing the right strategy for using production data in testing
For many organizations, maximizing the accuracy of testing using production data for test purposes is essential. Of course, using production data requires planning. For example, the test group has to determine which parts and volumes of data are needed to satisfy the test requirements. In defining the test strategy, the test group must also take external constraints into account, such as available production windows, the storage capacity needed for that production data, and the time needed to populate the required test environments.
There are several possible strategies for provisioning test data. In most cases, the goal is to create a full or partial privatized copy of production data, which the development and test group can use to populate its environment. This copied and privatized data is known as the gold master. Depending on the requirements and constraints, each test group might choose a different strategy to achieve the goal.
In part one of this two-part article, we will explore two strategies for provisioning test data using the IBM® InfoSphere® Optim™ Test Data Management solution—cloning production data and subsetting production data. For each strategy, we will examine the steps involved, the advantages and disadvantages of using that strategy, and the contributions that InfoSphere Optim makes to the process. Part two of the article will focus on privatizing data with InfoSphere Optim.
Cloning the production data
Organizations with limited production windows during which queries can be performed may find the option to clone and privatize the production data appealing. The InfoSphere Optim solution can be used to extract all production data at once while privatizing it before landing it in a gold master file known as an extract file. This technique helps ensure that whatever production, development, and test environment constraints may exist, the gold master will contain sufficient data to address them.
The process of cloning doesn’t require any knowledge of data models or the relationship among the tables within them. You simply pull all the data (and optionally, the metadata) from the source data store in one step.
With InfoSphere Optim, administrators can create an access definition that describes how data will be extracted from the source data store. The access definition created to clone a production environment would pull data from all required entities, making each entity a reference. An entity is considered a reference when all data is being pulled and no selection criteria and/or relationships are interrogated to determine eligibility (see Figure 1).
Figure 1. The InfoSphere Optim access definition editor enables you to create the access definition used to pull a set of entities.
As shown in Figure 1, the InfoSphere Optim access definition editor lists and marks each entity as a reference. InfoSphere Optim can also show the access plan—how the data will be retrieved during the runtime operation (see Figure 2).
Figure 2. InfoSphere Optim shows the access plan for a cloning scenario.
After the access definition has been created, it is then used in an extract service, which defines what data is to be pulled and the method for doing so. The extract service provides the capability to privatize the data before landing it in gold master form. (See part two of this article for more about data privacy options.)
Advantages and disadvantages of cloning
- No prior knowledge of data models and data constraints is needed.
- When testing requirements are not yet known, the gold master has the flexibility to accommodate any volume or distribution of data.
- Data can be offloaded quickly from the production environment since no relationship traversal is performed.
- If testing requirements change, you don’t have to refresh the gold master to pick up additional data.
- If the production data set is large, the gold master is large. This approach will require additional storage capacity.
- Without extracting data based on relationships, you may have extraneous data in the gold master.
Subsetting the production data
In organizations that know the scope of testing and the amount of test data required, extracting a subset of the production data is appealing. InfoSphere Optim can be used to define what business objects are required, how those business objects relate to each other through referential integrity and customer-defined relationships, and what selection criteria are needed. Based on this information, InfoSphere Optim can extract production data that is very specifically targeted for one or more test scenarios.
Unlike cloning, creating meaningful subsets requires knowledge of the data model, as well as the best way to traverse physical and/or logical relationships. Therefore, creating the access definition is a bit more time-consuming and requires examination of the access plan steps.
To create the access definition, the administrator first determines the entity within the data store that will be used as the starting point for relationship traversal. This entity is known as the start. (See Figure 1.) Once the start entity is known, the administrator uses InfoSphere Optim to find related entities by examining physical or user-created relationships.
In addition to pulling relationally intact subsets of data, the administrator can define selection criteria for the start entity and any subsequently traversed entities to pull only data that meets specific criteria. For example, the administrator might select only customers who live in the state of New Jersey (see Figure 3).
Figure 3. With the subsetting strategy, administrators must first determine the start entity.
Figure 4. Administrators must examine the access plan steps with the subsetting strategy.
The administrator can view the access plan used to traverse through the data and change the access definition until the desired traversal steps are in place (see Figure 4). Next, the administrator defines the extract service that references the access definition. The extract service provides the capability of privatizing the data before landing it in gold master form.
Advantages and disadvantages of subsetting
- This strategy creates smaller, more manageable gold masters with data targeted to meet the needs of specific testing requirements.
- Testers are guaranteed to have small-enough sets to meet the testing requirements while ensuring the subset is relationally intact.
- Privacy can be applied differently for each gold master depending on target audience.
- A single extract can be run many times, applying different selection criteria as overrides each time.
- Creation of the access definition requires some knowledge of the source data store and the relationship among entities.
- If testing requirements change, it may be necessary to redefine the access definition in the future and re-create the gold master.
Choosing the right strategy
Depending on your starting point and specific testing requirements, one strategy or the other might be an obvious choice. In either case, InfoSphere Optim offers the capabilities to help you work through the process successfully and efficiently.
In Part 2 of this article, we will examine how InfoSphere Optim can help you privatize different types of sensitive data. Until then, which strategy is right for you?
|[followbutton username='IBMdatamag' count='false' lang='en' theme='light']|