Gauging the Maturity of Big Data Initiatives

Why enterprise data warehousing best practices are a good yardstick for evaluating big data efforts

Big Data Evangelist, IBM

A big data initiative is maturing when a platform, tooling, and operations that the business can rely on are in place. But what exactly is maturity where information technologies are concerned?

Fundamentally, it's a measure of whether an IT investment is fully production-ready and fit for enterprise prime-time deployment. Assessing whether your big data initiative is mature involves proving—at the very least—that it has the robustness and bona fides, discussed below, that businesses demand of their core infrastructure.

Is there an existing maturity yardstick against which you can gauge the production-readiness of your big data platform? Yes. Look to your established enterprise data warehousing (EDW) environments, such as those built on IBM® PureData™ System for Analytics or other massively parallel platforms. Most likely, this environment is a field-proven operation that meets the criteria for production-readiness through a rich combination of capabilities, tools, and best practices. These elements include:

  • High availability and fault tolerance so your systems are always on, 24x7, with five-nines availability
  • Cluster, capacity, and mixed-workload management to help monitor and manage all resources and jobs executing on the platform
  • Elastic provisioning features that enable the environment to scale in, up, and out rapidly and cost-effectively
  • Performance optimization capabilities that allow your IT staff to tune the performance of all applications, jobs, and processes to meet expectations and service-level requirements
  • Platform and data virtualization capabilities for administering, accessing, and utilizing all resources via a unified interface
  • Data, metadata, and model governance that defines and enforces controls over the creation, development, promotion, and usage of official system-of-record data, metadata, and analytic models
  • Data security measures that define and enforce controls over authentication, permissions, encryption, time-stamping, nonrepudiation, and other security functions
  • Replication features that facilitate reliable, bidirectional replication of all types of data with high throughput
  • Backup and restore functionality for rapidly and reliably backing up and restoring data
  • Auditing features that log all network, system, application, usage, and other events
  • Archiving functionality that moves older data to offline stores for later on-demand query and retrieval
  • Disaster recovery capabilities that rapidly restore the platform to operational status when disaster strikes and service is interrupted

Prepare for a cold slap of reality: these management capabilities that you've taken for granted on your EDW are probably not all available in some of the newer big data platforms you've deployed. Just as sobering is the fact that the management tooling you have almost certainly doesn't span the full hodgepodge of big data platforms you've deployed or are considering. If you've deployed tactical, siloed big data platforms for distinct applications—such as Hadoop, NoSQL, in-memory databases, graph databases, cloud databases, and so on—you often have to grapple with their siloed management features.

You should concern yourself with whether your chosen big data platforms—individually and as an end-to-end infrastructure—meet the production-readiness criteria.

For sure, Hadoop and other emerging big data platforms are rapidly ramping up the maturity curve, as their ecosystems reinvent all the requisite capabilities, tools, and best practices that were pioneered in the EDW area. And in fact, as anybody who's watching the development of these markets can see, those ecosystems are rapidly spawning all the necessary maturity tooling by evolving much of what was pioneered for EDWs. But the maturation of these other big data approaches will take several years to come to fruition.

Maturation of best practices in the newer big data niches will come as some approaches succeed in the marketplace and become integral to standard operating procedures of users everywhere. If the industry can accelerate this maturation through standardization, users will be able to standardize their own practices that much faster. By the middle of this decade, we're likely to see a significant, widely recognized body of best practices emerge in Hadoop and in-memory databases, at minimum—reflecting the pace at which users are investing in these approaches and bringing them in line with established EDW management practices.

Look for big data best practices to crystallize first in high availability, database security, data governance, and cluster management that spans multitier topologies including Hadoop, EDW, and in-memory nodes. As they do, it will become evident that maturity is coming to the entire big data space and to each niche covered by those best practices.

For the next several years, much of the new product development in the big data arena will be geared toward playing catch-up with the EDW marketplace. The new big data platform niches are developing the tools for robust availability, reliability, security, governance, management, disaster recovery, optimization, and other enterprise-grade features we take for granted with EDW. In addition, all of the traditional EDW middleware and application ecosystem offerings—including data integration, data quality, virtualization, business intelligence, and predictive analytics—are being retooled or rethought entirely with these new big data platforms in mind.

How production-ready are your big data investments? What are you doing to make them ready? What tools and capabilities would you like to see IBM and other solution providers offer to help you make them ready? Let me know in the comments.

[followbutton username='jameskobielus' count='false' lang='en' theme='light']
[followbutton username='IBMdatamag' count='false' lang='en' theme='light']