The Maturing of Big Data: From Herding Cats to Taming Tigers

Big Data Evangelist, IBM

You can’t declare something mature until it has stopped developing. By that criterion, the big-data market is far from mature. It continues to foster an impressive amount of innovation into new types of databases, analytical approaches and applications that infuse data-driven optimization into every aspect of our existence.

Today’s big-data space is a sprawling menagerie of innovative database architectures for the new world of Internet-centric computing. Accent on the “sprawling.” Big data’s lack of any clear organizing theme other than providing an evolutionary path for analytic databases has made it a catch-all category supreme. If you can herd these cool new cats into tidy conceptual categories in your mind and your data strategy, then I salute you.

Several newer species of big-data cat, most notably Hadoop and NoSQL, are geared to handling the dizzying variety of new data types and advanced analytics. Most new species boast such big-data bonafides as massive parallelism, elastic scaling, schema flexibility, open-source licensing and commodity hardware support. Under the big-data big top, NoSQL is a three-ring circus in its own right. One popular NoSQL categorization scheme consists of simple key-value stores (e.g., Memcached, Dynamo), document stores (e.g., MongoDB, couchDB), next-gen columnar (e.g., HBase, Cassandra, BigTable), and graph store (e.g., Neo4J). Confusing the situation is the considerable overlap between NoSQL and big data’s rockstar, Hadoop: most notable is the inclusion of HBase under both paradigms and in support for MapReduce by various NoSQL platforms.

How mature is this unruly cavalcade of cats? Maturity is fundamentally a measure of whether an approach – implemented through enabling technologies and distinct practices – is fully production-ready and fit for enterprise primetime in some core application domain. If you wish to transform these young cats into mature adults that can perform impressive feats alongside their established cousins – especially the trusty enterprise data warehouse (EDW) and online transaction processing (OLTP) database – you must crack an enterprise whip.

In practice, what that calls for is taming all these beasts within a common enterprise discipline that includes key best practices and infrastructure in several areas:

  • High availability and fault tolerance:  Is our end-to-end big-data platform always on?
  • Cluster, capacity and mixed-workload management:  Are we monitoring and managing all resources and jobs within our big-data platform?
  • Elastic provisioning:  Are we able to scale up and out rapidly and cost-effectively?
  • Performance optimization:  Can we tune the performance of all applications, jobs and processes within our big-data platform to meet expectations and service-level requirements?
  • Platform and data virtualization:  Are we able to administer, access and utilize all resources within our big-data platform via a unified interface?
  • Data, metadata and model governance:  Are we able to define and enforce controls over creation, development, promotion and usage of official system-of-record data, metadata and analytic models within our big-data solution?
  • Data security:  Are we able to define and enforce controls over authentication, permissions, encryption, time-stamping, nonrepudiation and other security functions within our platform?
  • Replication:  Are we able to replicate all types of data bidirectionally, reliably and with high throughput into, out of and within our big-data platform?
  • Backup and restore:  Are we able to rapidly and reliably backup and restore data to our big-data platform?
  • Auditing:  Are we able to audit and log all network, system, application, usage and other events across our big-data solution?
  • Archiving:  Are we able to archive any and all data from our big-data platform?
  • Disaster recovery:  Are we able to rapidly restore our big-data platform to operational status when disaster strikes and service is interrupted?

In other words, your big-data tigers will need to jump through the same hoops – always-on reliability, five-9 availability, guaranteed service levels, etc. – that businesses demand of all core IT infrastructure.


View the complete Taming Big Data infographic

If you’ve established a mature EDW and OLTP database infrastructure and operations, you don’t need to invent these practices from scratch in your evolution toward big data. The EDW and OLTP DBMS markets have long addressed all of these requirements through a rich ecosystem of capabilities, tools and best practices. Where the EDW is concerned, you may have already proven out all of these capabilities on “small data,” thereby facilitating their adaptation for the larger scales and more intensive analytic workloads associated with big data.

Over the next year or so, the newer cats in the big-data arena will be playing a bit of catch-up with the EDW arena, in terms of adding the features necessary for production-grade enterprise deployment. In addition, the industry is rapidly retooling and rethinking many of the traditional EDW middleware and application ecosystem offerings – including data integration, data quality, virtualization, business intelligence and predictive analytics – with these new platforms in mind.

Rest assured that Hadoop and other emerging big-data platforms will meet these maturity criteria as their ecosystems inexorably reinvent all the requisite EDW/OLTP-grade capabilities, tools and best practices.

Of course, the foregoing analyses assume that the new big-data platforms are aiming for the exact same ecological niches as their predecessors, but that assumption may not be 100% valid. Though they are addressing such core EDW use cases as extract-load-transform and in-database analytic processing, the Hadoop and NoSQL markets have not been rushing to address other traditional EDW functions, such as operational reporting, ad-hoc query and master data management. Instead, the newer platforms have been sinking their teeth into functions that EDWs are seldom deployed for, such as unstructured data analytics and exploratory data-science sandboxing.

Consequently, it’s quite likely that future big-data environments will be multiplatform-oriented, with EDWs still handling some traditional functions alongside their new cousins who are addressing the newer, more sophisticated big-data challenges. The fully built-out big-data ecosystem of the near future will encompass all cats – established and emerging, big-data and small-data – with each species maturely fitted to specific deployment modes and use cases. By the same token, each platform will have its own mature body of enterprise-grade best practices within a comprehensive data management discipline.

By the end of this decade, enterprises will use the same maturity whip to keep all of big data’s felines in line while they perform amazing feats for all to see.

To find out more about managing big data, join IBM for a free event: