Taming Unrestrained Data Growth in the Big Data Era

Make data lifecycle management an imperative in any big data strategy

Many organizations today are fully aware going in that the volume, variety, and velocity of data continue to grow at a nearly unprecedented rate. And yet they often attempt to handle this rising tide of data without a plan. Moreover, legacy, manual methods of discovering, governing, and correcting data are no longer practical for this tremendous growth of big data.


Keeping pace with rising data volume

Organizations need to automate information integration and governance and employ it at the point of data creation and throughout its lifecycle. An effective approach helps protect information and enhance the accuracy of the insights businesses hope to glean from their big data repositories. By neglecting to put an effective data lifecycle management strategy in place, organizations may face considerable impact to performance levels and cost as well as running the risks of not being compliant with legal and audit policies.

Performance impact

Volume, one of the familiar Vs that characterize big data, represents a challenge for both scalability and the integration points among those volumes. As increasing numbers of end users execute more queries than ever before on large data volumes, slow response times and degraded application performance become significant problems. Left unchecked, this continued data growth can potentially stretch resources beyond capacity, perhaps negatively impacting response time for critical queries and reporting processes.

One example of an organization experiencing this kind of challenge is a US-based telecommunications company with data volumes that had grown by an order of magnitude in a very short amount of time. When its customers called with questions about their account, customer service representatives had to wait up to 45 seconds while they looked up the customer’s account information. The growth in data had significantly impacted their response time capability and customer satisfaction.

Rapid data growth also makes executing testing processes highly challenging. As data warehouses and big data environments expand to petabytes and beyond, testing processes can be taxed by having to cull data sets for specific needs. The results include extensive test cycles, slow delivery time frames, and few defects identified in advance of a new release. Speeding up testing workflows and delivery of data warehouses requires organizations to automate the creation of realistic, right-sized test data while keeping appropriate security measures in place.

Cost implication

Exponential data growth also can drive up infrastructure and operational costs. This impact alone often consumes most of an organization’s budget for data warehousing or big data. While the cost of storage is dropping, the total cost of managing storage—including software, IT and legal resources, and so on—continues to be a high cost for organizations. Rising data volumes require increased capacity, and organizations often need to procure additional hardware and spend more money to maintain, monitor, and administer their expanding infrastructure. Large data warehouses and big data environments generally require large servers, appliances, and testing environments, which can also escalate software licensing costs for the database and database tooling—not to mention labor, power, and legal costs.

Compliance vulnerability

Following the “let’s keep it in case someone needs it later” mandate, many organizations already keep too much historical data. According to the Compliance, Governance, and Oversight Council (CGOC) 2012 Summit Survey, 69 percent of respondents say data has no value.1 Opening the doors to excessive storage and retention only exacerbates the situation. At the same time, organizations should ensure the privacy and security of the growing volumes of confidential information. Government and industry regulations from around the world often require organizations to protect personal information no matter where it lives, even in test and development environments.


Putting data lifecycle management into action

The data lifecycle spans multiple phases as data is created, used, shared, updated, stored, and eventually archived or defensively disposed. Data lifecycle management plays an especially key role in nearly all phases—except for data creation—including archiving, managing test data, and masking data.

Archiving data

The process of archiving data is often confused with backing up data. Data backups typically require copying production data to another environment, primarily for disaster recovery or restoration if needed. Data archives, on the other hand, preserve data by providing a long-term repository of information that can be used by litigation and audit teams. Retention policies are designed to keep important data elements for future use while deleting data that is no longer necessary to support the legal needs of an organization. Effective data lifecycle management includes the intelligence not only to archive data in its full context, which may include information across dozens of databases, but is also based on specific parameters or business rules.

Managing test data

In development, testers need to automate the creation of realistic, right-sized, and protected data that mirrors the behaviors of existing production databases. To ensure that queries can be run easily and accurately, they should create a subset of actual production data and reproduce actual conditions to help identify defects or problems as early as possible in the testing cycle. The tremendous size of big data systems creates challenges for testers, who need ways to generate test data sets that facilitate realistic functional and performance testing.

Masking data

Because production data contains information that may identify customers, organizations need to mask that information in test environments to maintain compliance and privacy. Applying data masking techniques to the test data means testers use realistic-looking but fictional data—no actual sensitive data is revealed. Organizations also need ways to mask certain kinds of sensitive data, such as credit card and phone numbers. While testing their big data environments, they should also mask sensitive data from unauthorized end users, even though those end users may be authorized to see the data in aggregate. The development team can also use test data management technologies to easily access and refresh test data, which helps speed the testing and delivery of the new data source.


Deploying a solution business stakeholders and IT can embrace

IBM® InfoSphere® Optim™ data lifecycle management software enables organizations to deploy a single data lifecycle management solution that can scale to meet the needs of enterprises. Whether they implement InfoSphere Optim for a single application, data warehouse, or big data environment, organizations can execute a consistent strategy to streamline managing the data lifecycle. The advanced relationship engine in InfoSphere Optim provides a single point of control to guide data processing activities such as archiving, subsetting, and retrieving data.

Please provide any thoughts or questions in the comments about the challenges you or your organization face for handling phenomenal data growth.

1 "Defensible disposal: You can't keep all your data forever," guest post by Deidre Paknad, LLC, July 2012.