Data Privacy: Protecting the Big Data Fabric

Big data threat and opportunity intensify the need for securing sensitive information

Program Director, Analytics Platform Marketing, IBM

When organizations begin to confront big data, there are typically two separate discussion threads. The first thread is focused on the threat of big data: Will it overwhelm organizational systems? How can organizations keep up with it? Where should they put it? Or should they store it at all? And will the good data be lost in the mire of useless data? The second thread homes in on the opportunity of big data: how can organizations tap into new information sources to enhance understanding of their customers, to understand the markets where they compete, to identify new markets, and to operate more efficiently?

Both threads are important and valid. They are sometimes intertwined, and before either conversation moves very far, the question of data protection arises: if there is confidential information somewhere in the fabric of big data, how can it be identified and how can it be protected?

Locating the sensitive information

In a very simple environment—say, a small company that is just starting to put its first systems together—determining what sensitive data is and where it resides within those initial systems is a fairly manageable affair. For example, a new business may determine that the identity of its suppliers is sensitive information, whether the specific data set contains names, addresses, phone numbers, or other identifiers. A public sector organization, on the other hand, may be required to share its supplier list openly, but may also need to protect the identity of citizens receiving specific services.

The location of the sensitive information could be simple in this start-up example, if all supplier data resides within a single database that has properly labeled fields. Protecting client phone numbers, for example, would merely require protecting the database Phone field—or would it? As part of its daily operations, the company may exchange messages with clients, and those messages could include references to phone numbers. There could be phone numbers in forms stored as PDF files or in images. So even in a very simple example in which the data is not very big, finding the sensitive data may not be as straightforward as it seems.

As organizations start to handle big data, they sometimes look at new, external data sources such as comments in social media. But more often they begin by firmly gripping data that already exists within the enterprise—the big data that lives in files, images, or content repositories, as well as in structured databases. New to the mix is content that lives on Apache Hadoop platforms. Determining where sensitive data resides across all the many traditional and nontraditional data sources and repositories is important—and it is equally important to safeguard it wherever it exists.

There are multiple dimensions to determining where it exists as well. In addition to the different types of content repositories already mentioned, another dimension is the production-nonproduction axis. To most people who think about it, sensitive data in production systems should obviously be protected. Protecting sensitive data in nonproduction systems may be less intuitive, but isn’t a phone number just as private when it’s in a test or QA system as it is when it’s in production? Should IT staff have access to employee salaries or patient medical records? Clearly the answer is “no.” Sensitive data in those nonproduction systems needs protection.

Yet another way of looking at sensitive data is as part of data relationships. A particular piece of information on its own may not be sensitive, but when it is combined with other data, the combination becomes sensitive. When previously separate data elements are combined in a new Hadoop environment, a sensitive compound may be created, increasing the risk of improperly exposed data.

Protecting sensitive big data

There are a few ways of protecting sensitive data. First, there is protection by securing access to the data repository itself—by defining who is authorized to access data, by controlling access, and then by monitoring access and flagging activity that might be suspicious. Suspicious activity could mean that someone authorized to access certain data has done so, but has spent an unusual amount of time in that activity.

Next, there is protection by de-identifying the sensitive data (see figure). In a big data world, having this capability is important, not only for data in traditional databases but also for data in Hadoop. The objective is to help reduce or eliminate the risk of use or exposure of the sensitive data, while maintaining the utility of the data for analytics and also for testing purposes in nonproduction environments.

Data Privacy: Protecting the Big Data Fabric

Sensitive data protection through de-identification

One data de-identification method is data masking—the transformation of the confidential information into values that are realistic but fictionalized. The data is protected, but the context, format, validation rules, and business value are maintained. Masking can take place in the various data sources or on the Hadoop platform where the data is moved.

Another method of de-identification is redaction—a concept familiar to anyone who has seen a news broadcast showing a document with key words apparently obliterated with a dark marker. Rather than converting the sensitive data to something realistic but fictionalized, redaction either deletes or hides the information. Automated redaction—which is safer and more efficient than the manual marker approach—is especially appropriate for images, forms, and other unstructured big data that could be either intentionally or accidentally disclosed.

Taking a comprehensive approach to big data protection

IBM® InfoSphere® Data Privacy for Hadoop software offers a holistic approach to protecting big data. This solution incorporates the following comprehensive set of capabilities for protecting big data in a Hadoop environment:

  • Sensitive data definition
  • Discovery and classification
  • Masking and redaction
  • Data activity monitoring

InfoSphere Data Privacy for Hadoop helps organizations secure and protect big data, incorporate data in Hadoop repositories within compliance initiatives focused on data protection, and share information from third parties safely for analytics purposes. And these tasks are accomplished without increasing the risk of exposure or noncompliance. Large data breaches become front-page news and damage the corporate brand. But data leaks that are not big news are happening every day, and the cost of cleanup and remediation increases the longer they go undetected.

How is your organization planning to protect its big data? Please share any thoughts or questions in the comments.