Data Masking Everywhere

Manager of Portfolio Strategy, IBM

The new era of computing has arrived: organizations are now able to process, analyze and derive maximum value from structured, unstructured and streaming data in real time. However, in the rush to achieve new insights, are privacy concerns being neglected? How can you support business goals while also ensuring the privacy of sensitive data? With the average cost of security-related incidents in the era of big data estimated to be over USD40 million, according to this Aberdeen Group Research Brief, you can’t afford to ignore data privacy as a top requirement.

With 2.5 quintillion bytes of data created every day, now is the time to understand sensitive data and establish business-driven privacy policies to keep customer, business, personally identifiable information (PII) and other types of sensitive data safe. Remember, however, that different types of data will require different protection policies. For example, text, audio, log files and clickstreams have unique characteristics and challenges around privacy. In addition, your privacy policies need to keep up with the velocity of your data—even two minutes is too late when it comes to preventing abuse.

Protecting privacy isn’t just a nice-to-have. It is required by more than 50 international privacy laws such as Argentina’s Personal Data Protection Act and Korea’s Act on Personal Information Protection.

masks.jpgData masking provides intelligent data protection to address privacy concerns

Data masking replaces sensitive data with a nonsensitive substitute, but does so in a way that preserves the integrity of the data. This means masked data can be used to facilitate business processes without changing the supporting applications, databases or data storage facilities—which enables you to remove the risk without breaking your business.

Securosis Research has developed five laws for data masking:

  1. Masked data should not be reversible.
  2. Masked data should be representative of the original data set. The reason to mask data instead of generating random data is to provide nonsensitive data that still resembles production data. This could include geographic distributions, credit card distributions (perhaps leaving the first four numbers unchanged, but scrambling the rest) or maintaining human readability of names and addresses. The goal is to increase the utility of the information for further analysis or analytics.
  3. Masked data should maintain application and database integrity.
  4. Nonsensitive data should be masked only if it can be used to re-create or tie back to sensitive data. It isn’t necessary to mask everything—only those parts that are deemed sensitive. For example, if you scramble a medical ID but the treatment codes for a record could map back to only one record, you also need to scramble those codes.
  5. Data masking routines must be repeatable. One-off masking is both ineffective and impossible to maintain. Today’s IT environments are highly dynamic, and masking routines need to keep pace.

InfoSphere Optim™ Data Privacy provides a comprehensive set of data masking techniques to support data privacy and compliance requirements. For the first time, you can mask data across platforms, across data sources using a standard and repeatable process to ensure data privacy without impacting the stability of your applications with greater ease and unparalleled scalability and performance.

With InfoSphere Optim Data Privacy, you “mask and move” or “mask in place.” Masking and moving allows you to extract and mask data, and then insert or load the data into one or more destinations. Masking in place allows you to de-identify data and replace existing values.

InfoSphere Optim Data Privacy provides the most comprehensive set of data masking techniques on the market. The method you use will depend on the type of data you are masking and the result you want to achieve. Out-of-the-box capabilities for specific data types are included, such as random or sequential number generation, string literal substitution, concatenating expressions, arithmetic expressions, lookup values and user-defined functions, to name a few.

Some examples of situations in which masking techniques can be applied include:

  • Data at rest or data in flight
  • Relational data, flat files and data sets such as IBM IMS™ or VSAM
  • Data being transformed through an extract, transform and load (ETL) tool
  • Data accessed in SQL queries inside a database
  • Data in reports and documents
  • Data inside applications
  • Data moving to, in and from big data platforms such as Hadoop
  • Data used for testing big data environments
  • Data used for analytics applications—for example, PureData Analytics or Teradata
  • Data used for testing data warehouses

What is the benefit?

Focus on data security and privacy to deliver significant value.

  • Prevent data breaches: Avoid disclosure or leakage of sensitive data
  • Ensure data integrity: Prevent unauthorized changes to data, data structures, configuration files and logs
  • Reduce cost of compliance: Automate and centralize controls and simplify audit review process
  • Protect privacy: Prevent disclosure of sensitive information by masking or de-indentifying data in databases, applications, reports on demand across the enterprise

What to learn more? Check out these analyst reports and white papers.

Photo by Brian Snelson