How to protect against sensitive data leakage

Manager of Portfolio Strategy, IBM

No organization can escape security threats. While social business, mobility and cloud solutions increase opportunities for commerce and collaboration, they also create the perfect hiding place for criminals: big data.

A typical large utility, for example, gets pinged one million times every day by malicious parties. That sounds like a lot. But these attacks rarely get noticed because that same utility processes more than one million messages per hour, offering plenty of cover.


Current approaches to cyber security combat only known threats—they aren’t as good at finding new associations or uncovering patterns. As a result, organizations are opening the door to advanced persistent threats (APTs), spear phishing, hacktivism and more.

Within the noise of big data, organizations need sophisticated real-time analytics to find a relatively weak signal. Without deep insight, most threats can’t be detected. Sophisticated attackers are motivated, patient, persistent and even state sponsored. The challenge organizations face is how to extend security strategies to find and neutralize these threats at a time of rising risk and increasing complexity.

Imagine you are a newly hired senior data security analyst and your boss says “Tell me who and what accesses personally identifiable information (PII) across the enterprise every second of the day. I need a report next week. We need to make sure we know how PII is used and accessed in our business.” This is a pretty tall order, no matter the industry, but security analysts don’t need to panic—big data technologies designed for security analytics will be able to help you understand how PII is used across the enterprise.

The goal of a data security strategy is to answer the “who, what, when, how and where” of sensitive data access. The core areas of the solution include:

  • Data acquisition: Capture log data without disruption to the user
  • Data filtering and storage: Enrich log data with real-time streaming data
  • Data analysis: Find patterns
  • Reporting: Display results transparently and enable interested parties, including auditor, to examine and explore results

A real-time analytic solution, such as InfoSphere Streams, can be used for log parsing and filtering. For example, detecting boundaries between individual HTTP messages and filtering any irrelevant data. Then InfoSphere Streams can be used to find PII by discovering patterns and matching patterns (after normalization) with known PII. For example, to properly identity PII, we need to know the different representations of John Smith such as J. Smith, John M. Smith or Smith, John. This process should be lightening fast, because if you discover or find a pattern too late, data might get leaked out.

Also, it’s important to truly understand the difference between legitimate PII access and suspicious PII access. We want our data security analysts to respond to the real threats. No organization has time to chase false positives and no organization can afford a false negative.

After the data is captured, filtered and understood, a landing zone is required to store and merge all data. This could be a Hadoop platform such as InfoSphere BigInsights.

Enterprise reporting, analysis and exploration will help both auditors and business users understand access patterns and where danger lies. A platform such as Watson Explorer would be helpful here.

CSI Blog October 19.png

To adapt to modern security threats, organizations need to move beyond the traditional style of “analyze first and then respond.” The trouble with this approach is that you need to start with assumptions and rules built from previous or known data leakage scenarios.   In the era of big data, we need to collect and process all information in a non-invasive way and continuously analyze in real-time, giving the data security analyst the ability to explore. We can’t bring the required data to the analyst; we need to bring the analytics to the data. In other words, move from complete analysis after collecting data to continuous, dynamic and explorative analysis in real-time. 

What to learn more? Check out these resources: