Blogs

Implement Data Masking to Protect Sensitive Data: Part 2

Experience enterprise-scale data masking that complies with enterprise information security policies

Data masking implementations can vary depending on specific scenarios. The first part of this series introduces requirements and approaches to data masking, and this concluding installment takes a look at enterprise-scale implementation of data masking.

Typically, masking requirements should comply with enterprise information security policies. Consider a financial institution that was segmented into two banking entities—an original one and a new one—as a result of regulatory and business directives. All IT infrastructure and applications had to be replicated, and data pertaining to selected branches and customers had to be migrated to the new banking entity. The reason for the replication was compliance with a business requirement for creating a mini banking unit with similar offerings and services, but for a specific set of customers and branches. Masking played a key role in this scenario because the data had to be exposed to various vendors involved in the migration process.

Enterprise-scale data masking implementation

As part of its data masking approach, the financial institution identified all the sensitive data elements and then determined associated data masking strategies. For example, most of the sensitive data elements required format-preserving encryption for randomization. These elements included name; address lines except for the postal code, city, state, or country; email address; IP address; telephone number; healthcare information; criminal record information; and so on. Other elements required special handling, such as the following specifications:

  • Only the last 10 of 16 digits of credit or debit card numbers were masked.
  • Postal codes were replaced with random but valid postal code values.
  • Dates of birth required masking with a random date within a 120-day range, either before or after the actual date of birth.
  • Fixed values such as 1111 or 111 replaced personal identification numbers (PINs) or card code verification (CCV) numbers.
  • Spaces replaced all encryption keys.
  • Attachments—large object byte (LOB) data—were not extracted.

A generic tool for data masking was developed and deployed in a legacy mainframe environment. This environment provided the capability to take data file and database extraction as input, mask sensitive data elements, and produce masked output in the same format as the input. The organization used the tool extensively before any data files were transferred out of the production environment to the testing environment. Data masking can be a processor-intensive process, and the appropriate level of scheduling in production is important. In this scenario example, data files and database extracts from various non-legacy environments were brought into the mainframe environment, where masking jobs were scheduled to run throughout the day—especially during off-peak hours.

Specific data masking challenges

Enterprise-wide data masking presents particular challenges that can apply in special cases. For the financial institution, these cases included attachments or LOB data, name data, PIN and CCV or CCV2 data, and some data formats.

Many enterprises utilizing data masking should consider masking customer images. Although this form of masking was carried out by not extracting the images at all, when data was loaded to nonproduction environments, sample images were loaded instead of the original images. Similar methods were applied to most LOB files with format extensions such as pdf, doc, text, xml, and so on, unless there was a requirement to actually read the content of the LOB data while testing. XML data, in particular, proved to be tricky. At times the actual content was required. In those cases, it was first parsed, and the parsed data was masked as necessary. Then the masked, parsed data was converted to XML again before the XML was moved to the nonproduction environment.

Although PIN and CCV or CCV2 numbers could have been scrambled using random characters, replacing them instead with generic values proved to be more appropriate for usability. On the other hand, name data was well suited for random scrambling. However, consider that the name, John Lee, for example, yields different results when scrambled as a single string versus when it is scrambled as two separate strings. As a result, names were separated into first name, middle name, and last name before being masked.

In addition, some data formats required putting measures in place to handle binary, binary-coded decimal, and Unicode data. These data elements were first converted to character format, and then they were masked and converted back to the original format.

Data integrity

Data masking can facilitate successful outcomes for organizations in a wide range of industries by helping to ensure the integrity of data that may have to be migrated between production and nonproduction systems. Other challenges beyond the scope of this discussion include handling sequence generation, masking for an identity column, working with Unicode data, handling special characters, and handling variable data layouts. ASCII-to-EBCDIC conversion and vice versa can also be applicable, especially in multi-environment data masking cases. These challenges should be addressed strategically with additional processing and effort.

Despite its trade-offs, data masking—in addition to protecting sensitive data—offers the capability to preserve complex data relationships that can ensure minimal disruptions to nonproduction environments. Implementations of data masking can be completely in house, as discussed previously, or they can be product based or a mix of in-house and product-based implementations. The key to successful data masking is to ensure policies and strategies are well defined, deployed, and governed as part of any enterprise data management and governance initiative.

Please share any thoughts or questions in the comments.

[followbutton username='IBMdatamag' count='false' lang='en' theme='light']