Top tips for securing big data environments

Manager of Portfolio Strategy, IBM

Why big data doesn’t have to mean big security challenges

What is big data?

Tips for securing big data environmentsBig data spans four dimensions: volume, velocity, variety and veracity.

  • Volume: Every day 2.5 quintillion bytes of data are generated from new and traditional sources including climate sensors, social media sites, digital pictures and videos, purchase transaction records, cell phone GPS signals and more.
  • Velocity: Sometimes 2 minutes is too late. For time-sensitive processes such as detecting fraud, a real-time response is required.
  • Variety: Big data is any type of data—structured and unstructured—such as text, sensor data, audio, video, clickstreams, log files and more.
  • Veracity: 1 in 3 business leaders don’t trust the information they use to make decisions. How can you act upon information if you don’t trust it? Establishing trust in big data presents a huge challenge as the variety and number of sources grows.

Big data environments help organizations process, analyze and derive maximum value from these new data formats, as well as traditional structured formats, in real time or for future use to make more informed decisions cost effectively. Hadoop-based systems, IBM Netezza and data warehouses are currently being used to manage big data.

In terms of security, there are two distinct areas to consider.

  1. Security from big data
  2. Security for big data

Let’s discuss “security from big data” first. Big data environments provide the ability to harness data for real time decision making. When it comes to stopping cyberattacks real time is the difference between safety and disaster.

Examples of big data security projects include:

  • Scrutinizing 5 million trade events created each day to identify potential fraud
  • Monitoring 100’s of live video feeds from surveillance cameras to identify security threats
  • Catching unauthorized data changes (like log doctoring) as they happen
  • Turning on or turning off security policies (like data masking) based on data access patterns

A recent blog by Gartner’s Neil McDonald, warns that SIEMs (security information and event management) alone aren’t the answer to real-time security analytics. It should be a combination of SIEMs with deep data mining capabilities to support the near real-time requirements. The end goal is to improve security decision-making based on prioritized, actionable insight and identify when an advanced targeted attack has bypassed traditional security controls and penetrated the organization.

Now let’s discuss “security for big data”

As big data environments ingest more data, organizations will face significant risks and threats to the repositories containing this data. A paradox exists. Organizations are generating more data now as compared to any other point in human history, and yet they don’t understand its relevance, context or how to protect it.

Big data environments create significant opportunities. However, organizations must come to terms with the security challenges they introduce, for example:

  • Schema-less distributed environments, where data from multiple sources can be joined and aggregated in arbitrary ways, make it challenging to establish access controls
  • The nature of big data–high volume, variety and velocity–makes it difficult to ensure data integrity
  • Aggregation of data from across the enterprise means sensitive data is in a repository
  • Big data repositories present another data source to secure, and most existing data security and compliance approaches will not scale

Big data environments allow organizations to aggregate more and more data, much of which is financial, personal, intellectual property or other types of sensitive data. Most of the data is subject to compliance regulations such as Sarbanes-Oxley Act (SOX), Health Insurance Portability and Accountability Act (HIPAA), Payment Card Industry Data Security Standard (PCI-DSS), Federal Information Security Management Act (FISMA) and the EU Data Privacy Directive. Sensitive data is also a primary target for hackers.

The risk of lax data security is well known and documented. Corporations and their officers may face fines from $5,000 USD to $1,000,000 USD per day, and possible jail time if data is misused. According to the 2011 Cost of Data Breach Study conducted by the Ponemon Institute (published March 2012), the average organizational cost of a data breach is $5.5M USD.

So what needs to be protected? Many organizations deploy Hadoop alongside their existing database systems, allowing them to combine traditional structured data and new unstructured data sets in powerful ways. Hadoop consists of reliable data storage using the Hadoop Distributed File System (HDFS), a column-oriented database management system called HBase that runs on top of HDFS, and a high-performance parallel data processing technique called MapReduce. Hadoop environments need to be protected using the same rigorous security strategies applied to traditional database systems, such as databases and data warehouses.

Security strategies include:

  • Sensitive data discovery and classification
  • Data access and change controls
  • Real-time data activity monitoring and auditing
  • Data protection such as masking or encryption.
  • Data loss prevention
  • Vulnerability management
  • Compliance management

Organizations need to be able to answer questions like:

  1. Who is running specific big data requests?
  2. Are users authorized to make requests?
  3. What analytics requests are users running?
  4. Are users trying to download sensitive data or is the request part of a job requirement for example a marketing query?

The rush for big data benefits is not an excuse for overlooking security.

Conclusion: Build security into big data environments

Organizations don’t have to feel overwhelmed when it comes to securing big data environments. The same security fundamentals for securing databases can be applied to securing big data environments. This about preventing leaks from databases, data warehouses and Hadoop based systems, ensuring the integrity of information and automating compliance controls.

Want to learn more? Read this recent research perspective by Ventana Research and listen to this Ventana podcast .