Controlled Explosion: Keeping Big Data Contained with Security, Governance and Information Lifecycle Management

Big Data Evangelist, IBM

Big data is exploding all around us. But, like the Big Bang whence the universe sprung, your enterprise’s adoption of big data needn’t sprawl into pure chaos. Rather, big data can and should be harnessed and contained within the time-space coordinate system known as enterprise controls.

Big data is a complex, tricky thing to control as a unified business resource. The more complex and heterogeneous your big-data environment, the more difficult it can be to crack a tight whip of oversight and enforcement. Life-cycle controls, driven by explicit policies, can keep your big data environment from degenerating into an unmanageable mess and becoming a security and compliance vulnerability to boot.

Yes, you can control big data in a coherent manner, just as you oversee data and analytics investments at smaller scales. Doing so demands that you thoroughly reassess your big-data infrastructure, tooling, practices, staffing, and skillsets in the following key areas:

Security controls: These enforce policies of authentication, entitlement, encryption, time-stamping, non-repudiation, privacy, monitoring, auditing, and protection on all access, usage, manipulation, and other information management functions.

Governance controls: These enforce policies of discovery, profiling, definition, versioning, manipulation, cleansing, augmentation, promotion, usage, and compliance of official system-of-record data, metadata, and analytic models.

Information lifecycle management (ILM) controls: These enforce policies on the creation, updating, deleting, storage, retention, classification, tagging, distribution, discovery, utilization, preservation, purging, and archiving of information from cradle to grave.

If you’re an established data professional, this is probably all motherhood and apple pie to you. What you want to know is whether big data has substantially altered best practices in any of these areas. In fact, it has. Enterprise-level controls – implemented through tools, infrastructure, and practices – are useless if they don’t keep pace with the scale, complexity, and usage patterns of the resource being managed. Where security, governance, and ILM are concerned, what’s new under big data are the following:

New platforms: You may have mature security, governance, and ILM in your established enterprise data warehouse (DW), online transaction processing (OLTP), and other databases. But big data has probably brought a menagerie of new platforms into your computing environment, including Hadoop, NoSQL, in-memory, and graph databases. The chance that your existing tools work out of the box with any of all of these new platforms is slim, and, to the extent that you’re doing big data in a public cloud, you may be required to use whatever security, governance, and ILM features – strong, weak, or middling – are native to that environment. You might be in luck if your big data platform vendor is also your incumbent DW/OLTP DBMS provider (IBM, perhaps) and if, either by itself or in conjunction with its independent software vendor partners, it has upgraded your security and other tools to work seamlessly with both old and new platforms. But don’t count your proverbial chickens on that matter. In the process of amassing your end-to-end big data environment, you will also need to sweat over the nitty-gritty of integrating the disparate legacy security tools, policies, and practices associated with each new platform you adopt.

New domains: Many of your traditional RDBMS-based data platforms are where you manage official system-of-record data on customers, finances, transactions, and other domains that demand stringent security, governance, and ILM. But these system-of-record data domains may have very little presence on your newer big-data platforms, many of which focus instead on handling unstructured data from social, event, sensor, clickstream, geospatial, and other new sources. The key difference from “small data” is that many of the newer data domains are disposable to varying degrees and are not linked directly to any “single version of the truth” records. Instead, your data scientists and machine-learning algorithms typically distill the unstructured feeds for patterns and subsequently discard the acquired source data (which quickly become too voluminous to retain cost-effectively anyway). Consequently, you probably won’t need to apply much if any security, governance, and ILM to many new unstructured big-data domains.

New scales: Where life-cycle controls are concerned, big data’s “bigness” is not necessarily a vulnerability. There is no inherent trade-off between bigness – the volume, velocity, or variety of the data set – and your ability to monitor, manage, and control it. To the extent you’ve scaled out your big-data initiative on a mature EDW platform with massively parallel processing (MPP), you can leverage the security, governance, and ILM tooling for that platform as you zoom into “3 Vs” territory. You might even find big data to be a boon, facilitating some control-relevant workloads – such as real-time encryption, cleansing, and classification – that proved less feasible on “small-data” platforms. And in fact, the extreme scalability and agility of big-data platforms will prove essential for enforcing tight security, governance, and ILM on new unstructured data types, as these needs emerge.

New stakes: The ramped-up business stakes driving the big-data revolution go well beyond the “small-data” world of operational business intelligence. Many companies have pinned their marketing, customer engagement, dynamic pricing optimization, and other key initiatives on decision automation, next best action, and other real-time always-on business infrastructures that demand big data. At the heart of many such initiatives rest enterprise data-science programs, under which teams of quants and domain specialists produce a never-ending stream of new statistical, predictive, segmentation, behavioral, and other advanced analytic models. Comprehensive control over of this precious intellectual-property asset demands tight model governance and security. Key controls include model check in/check-out, change tracking, version control, and collaborative development and validation. Automation is imperative in this regard. As big data ramps up, your data scientists will produce increasing volumes of models in a wide variety of tools and languages, will jam more variables into the models, will score the models with more data from more sources, will update them models to more stringent schedules, and will use them to drive optimizations in more business processes.

New mandates: Big data has become a culturally sensitive topic on many fronts, most notably among privacy watchdogs who point to its potential for use in intrusive target marketing and other abuses. As storage prices continue plummet and more data is archived in big-data clouds, the more likely those huge data stores are to be accessed for compliance, surveillance, e-discovery, intrusion detection, anti-fraud, and other applications dictated under new legal and regulatory mandates in many countries. Besides, even if no new big-data-specific mandates hit the books, litigators will seek to subpoena these massive archives first when they’re looking for evidence of corporate malfeasance. Consequently, in most industries--regulated and otherwise--big-data archives will increasingly be managed under stringent mandates for security, governance, and ILM. This will ensure that the historical record is preserved in perpetuity, free from tampering and unauthorized access.

The end-to-end enterprise-control infrastructure for big data is still emerging. One wildcard is whether solution providers, users, and other interested parties will step forward to define security, governance, and ILM standards that span the exploding cosmos of big-data platforms, approaches, and applications, old and new.

To find out more about managing big data, join IBM for a free event: