The 3 Cs of big data
We’re all familiar with big data’s varying number of Vs: volume, variety, velocity and veracity. However, taking into consideration the purpose for which insight can be derived from big data is highly important and likely more useful for engineering information systems. This purpose is often characterized by using data to inform enhanced decision making, and business leaders need to trust the data before they use it.
Looking at big data’s three Cs
As a result, I propose we discuss what’s needed for that trust in terms of big data’s three Cs: confidence, context and choice.
The objectives of early management information initiatives were to report on financial and sales performance. These objectives demand a high degree of accuracy. A CFO who doesn’t have confidence—the first C of big data—in a report’s financial performance figures is forced to look elsewhere.This scenario is different in the case of a CMO, who is looking to offer a promotion to an individual customer at a point in time. Such an action draws upon data from many systems and external data sources, perhaps including social media.
Organizations bring data together into a single, comprehensive view of a customer so that they can maximize the opportunity of engaging with that customer for increased sales or improved customer service, for example. Combining data from multiple systems requires matching records—something that is imprecise because of data-quality issues, varying data formats and other characteristics of the way data is stored and managed by those systems. Consequently, the matching of records from multiple systems into a single view of a customer cannot be achieved with certainty. A score can be calculated that determines the level of confidence that can be placed in the combined view.
A CMO who is undertaking marketing decisions may sufficiently have a lower degree of confidence in the data than a CFO requires for reporting financial performance. The level of confidence that is acceptable is a judgment that a business needs to make based on the risk and effect of actions. And that judgment is a balance between what might result from poor decisions that arise from inaccurate data and the cost of making improvements in the provision of data.
Measures of confidence are not limited to merging data, but they also apply to data sources themselves. For example, a city’s buildings may distort location data from Global Positioning System (GPS) sensors. Temperature sensors that work with defined tolerance levels and the use of social media data require caution. Understanding the provenance of data being used to make decisions is important.
The increasing desire to exploit data is widening the use of statistics. Data science techniques including predictive analytics and machine learning are all producing results from analyzing data with a degree of accuracy that is not absolute. A measurable level of confidence exists. Consumers of those results need to make sure they understand what that level of confidence means as they use them to make decisions.
The second C of big data is context. Understanding context requires understanding who is asking the question and why. And part of that grasp includes the role of the person, where that person is asking the question, what the questioner is trying to do and the purpose to which the results will be applied.
People undertaking comparative analysis of remuneration for roles in their organization against similar roles in the market, for example, require access to salary data; whereas, analysis of employee career progression does not. These two activities may or may not be carried out by the same person, but the purpose is clearly different. Understanding the context to provide the appropriate authorized access is essential, even if the same person is carrying out the two activities. This authorization is an example of information governance—defining and enforcing policies. The requirement is critical not only in regulated industries, but also more widely as organizations become increasingly data driven.
Context is also important in time-critical situations. Fields such as public safety, defense and even sport utilize context in the continuous monitoring of operations. Producing an alert is of no use, for example, if a commander is not also provided sufficient information about the wider context to be able to judge the situation correctly. This wide context needs to be provided with the alert, in real time, and it needs to avoid providing superfluous and distracting information—noise—that is not relevant at that time. The commander is likely to be under a lot of pressure, and too much information may result in missing the key information and making the wrong decision—in the same way that too little information is of no use. Understanding the context in which the commander is operating is essential to getting this balance right in that moment.
Opting for a particular technology platform and analytics tools represents the third C of big data—choice. Many organizations have deployed Apache Hadoop systems in support of big data initiatives and are attracted by cost-effective infrastructure. Even though the importance of information governance was highlighted previously, sadly, it is often not considered early enough in such initiatives. Information governance is important because businesses can soon become reliant on such systems. They place increasing demands on systems as they realize that easier access to data offers new opportunities. Exploratory ad hoc analytics begin to compete with regularly run analytics for system resources, and the problem of hitting capacity limits is compounded because no platform is optimized for different types of analytical workloads.
Inbound marketing decision making and operational decision making in public safety situations, for example, both require high performance to produce results from analytics, in context and in near-real time. Being too slow means that customer engagement has ended and the marketing opportunity is missed, or the public safety situation might have escalated. In these cases, a Hadoop system is probably not the best analytics platform to meet the business need.
Business users performing specific functions often run similar types of queries repeatedly; providing access to data on a platform optimized to meet their needs can support them better than competing for resources on a platform designed for and used by everyone. Analysts in an organization can use data to enable timely and effective decision making, and the role of IT is to provide the platforms and tools to enable them to succeed. As a result, they need to select platforms that are fit for purpose and provide the technologies to manage the information flows among them that adhere to information governance policies.
Engineering trusted analytics platforms
A fourth C for big data—cognitive—is worth mentioning. As human beings, we naturally take into account our surroundings and make judgments from what we observe in everything that we do, and human reasoning systems are a step change in analytics systems. Nevertheless, confidence, context and choice still apply: results are ranked using scores that are based on training and thereby offer a measurable level of confidence. Wider context can be both an input and an access to source content in the results, which gives context to the output, and many technologies can be employed that underpin the diversity of cognitive services assembled into applications.
Analytics platforms need to be properly engineered to support the information architecture to meet the variety of business needs. Business users can then benefit from being confident in the data, have access to it in context and know that technology choices have been made for them. To find out more on how to engineer a data driven organization, see the IBM Redbooks publication Designing and Operating a Data Reservoir that focuses on the data lake. And then explore how you can achieve all these objectives on a trusted big data platform.