Building a big data center of excellence

Director for Watson and AI applications, IBM

In today’s business world big data can be a vital competitive differentiator for organizations. Traditionally, businesses needed to limit the scope of the data that they could use to make the critical decisions for driving successful business outcomes. Big data solutions have eased many of those limitations, so organizations can look at far more data and therefore make better decisions. Big data solutions and technologies such as Apache Hadoop and Apache Spark enable this fundamental shift.

I often ask customers what inhibits big data initiatives in their organization. Frequent answers include: no compelling business need, or difficulty identifying use cases; lack of data science skills; not enough staff to support them; and the complexity of collecting and managing the data. The concept of a center of excellence (CoE) for big data, which I attempt to demystify here, helps ensure these responses are not inhibitors in any organization.

The key to a data-driven business is in bringing data and insight to all workflows in the business and integrating it into the decision making at every step. This approach enables organizations to take advantage of the longitudinal analytics available with new technology advances such as Hadoop and Spark as well as machine learning for past-, present- and future-looking analytics simultaneously.

Defining big data centers of excellence

A big data CoE is a framework that takes an organization from zero knowledge to having a fully functional practice of Hadoop, Spark and emerging open source technologies to deliver robust business results. A CoE is where organizations identify new technologies, learn new skills and develop appropriate processes that are then deployed into the business to accelerate adoption.

A centralized big data CoE can be the bedrock for establishing a data-driven company that treats data as a strategic asset. The big data CoE can partner with the business to identify data that is invaluable, explore use cases that differentiate its products and services in the market and help jump-start the business with insights that can yield real-time client value. Data’s strategic importance is the value it represents for the business, but success with big data is not just about data. The people and the organization also play a vital role in that success:As a strategic asset, the quality of data scientists, data engineers and data architects is paramount for creating a successful big data CoE. These data professionals need to be experienced practitioners who have a history of working within Hadoop and Spark ecosystems and possess vertical knowledge. Their goal is to prepare the business to build, run and operate production-quality big data applications on their own by maximizing their ability to leverage data. Data engineers, in particular, have to be committed to helping ensure the data—from acquisition to advocacy—functions as the organization’s highly strategic asset.

And the other vital element for successful business outcomes is identifying strong leaders to lead the transformation. Strong leadership includes a business sponsor for the overall big data mission along with business stakeholders for individual use cases.

Building big data success stories

In many cases, the business comes up with the use cases, but the CoE has the responsibility of facilitating this work. The CoE needs to assume a leadership role in understanding which applications and use cases can be driven with available sets of data sources.

Sometimes businesses can be more proactive by bringing use cases to the CoE because the list of use cases can be overwhelming and put a strain on available resources. A transparent process for prioritizing these use cases is important and should be adopted. The CoE needs to prioritize use cases based on parameters such as ease of data availability, data quality, business revenue–based value and impact, costs and risks.

Applying agile methodology—the fail-fast approach

Agility and the ability to fail fast are essential to reaching the potential of big data. A lightweight agile process provides tools to deliver outcomes quickly and transparently, typically within two- to three-week sprints. The ability to fail fast is a key big data opportunity; business and technical roadmaps for delivering value need to change more often than in a traditional waterfall environment.

Data itself is also highly agile when it is collected in native form and transformed potentially many times to meet the needs of different use cases. Using the basic ideas of agile development methodology, a CoE can provide the leadership across the organization to ensure business users can quickly gain value from the data.

Developing financial models

At the heart of a big data CoE is creative financial models that support the innovation. The charge-back strategy can be a function of data as a service, insights as a service or analytics as a service.

As is often the case with shared services, a charge-back model is necessary to properly handle the maintenance and growth of the emerging technologies, which in this case can be Hadoop and Spark clusters. An organization needs to develop a charge-back model for the business units that will be engaging with the CoE for project, personnel, infrastructure and application resources. Some important questions need to be considered when determining the charge-back model for business units: 

  • How many users will access the application and cluster?
  • How much data will be ingested initially?
  • How much data growth is expected over time?
  • What is the data retention policy? 

Business leaders and decision makers acknowledge that creating a data-driven organization requires a change of culture. Big data CoEs can be the key to this culture change. An important recommendation for building a CoE framework is starting with a small, secure data lake—a Hadoop- or Spark-based service—that can store and process data from various internal groups to support multiple use cases. When building a data lake, organizations learn and employ operational best practices for a number of processes: 

  • Cluster build out
  • Data exploration
  • Data ingestion and processing
  • Disaster recovery
  • General operations and maintenance
  • Hadoop and Spark development
  • Infrastructure integration
  • Model building and testing
  • Multitenancy and security
  • Third-party software evaluation and integration
  • Use-case evaluation 

A leading telecommunications firm, for example, began by developing a CoE that asked each business division to come up with business use cases that would generate powerful insights through analytics. It then established regular training boot camps in which business users learned how to use data with self-service tools, and it created a community of data scientists and data engineers to support line-of-business managers in their analyses and to validate findings. As a result, this CoE enabled big data as a shared service that opened up the conversation for creative financial models that involve charge backs and show backs.

Leveraging big data centers of excellence

I foresee creative CoE adaptations such as the one just described helping businesses move beyond the hope of becoming a data-driven organization enabled by big data to the reality of an organization using a data-ingrained business model. If you’re a working data scientist, data engineer or data application developer, register to attend the IBM DataFirst Launch Event that takes place 27 September 2016 in New York, New York. You’ll have the opportunity to engage with open source community leaders and practitioners and learn how to leverage your data analytics CoE to accelerate your transformation to a cognitive business.