Governance is not optional in big data analytics
Consider this scenario…
A busy executive notes that a store’s sales growth seem to be consistently stronger for some locations than for others. She has several theories as to why: local staff management skills, regional differences in receptivity to national advertising campaigns, relative affluence of the local customer bases and so on. She would like to find a way to help the underperforming locations improve their sales, but she needs to know which of her theories is closer to the truth so she can make wise investment choices. Therefore she tasks an analyst to “dig into the numbers” and “find out what is really going on.”
The analyst builds a model to try to weigh the relative impact of the different elements of the executive’s different theories. With the model in hand, he now needs to find data to feed into the model. Where does he look? He asks his IT department for sales data by store for the past several quarters. He goes online to look for Twitter feeds that reference the national chain and specific locations. He calls the store managers and corporate HR to provide employee satisfaction surveys and performance review data. He, of course, meets resistance from an already overburdened IT team who tells him it will take up to two weeks to assemble the data. When the data does arrive, it is missing fields and he has no idea how useful or current the data even is.
HR presents another roadblock as they tell him that employee satisfaction data and especially performance review data are highly sensitive and cannot be shared. Even Twitter is no panacea: does a viral tweet about an ad campaign mean people love the campaign, or that they love the funny tweet they are all retweeting? Even if he could get all this data, it would come in different forms, and he would spend considerable time simply entering information into his analytical application in order to populate his model.
Does that sound familiar to you? It should—it’s business as usual every day, in every organization.
A better way
One problem with the scenario is the inability of data to flow that “last mile” from the source to the analyst’s application. In addition, it’s very difficult for the analyst to link the data from various sources properly, and in context. For example, how can he tie employee satisfaction survey results to the proper store sales records if the different sources have different store codes? Finally, the analyst will never be able to test his model if he cannot figure out a way to avoid viewing personally identifiable, private data.
What if the analyst could pull data directly into his application in an accurate and secure manner? Imagine being able to build into his analytical application the ability to find the different data he needs from the sources he wants. Further, imagine all of that data being automatically masked, so he never actually sees any recognizable store level information. After all, his project is to discover root causes in the aggregate, not to pick apart any one location. What if he could somehow know that the information he is pulling in is the latest available and that it is accurate?
Finally, what if peer recommendations, or comments on the data, were available from people who had worked with the data set in the past to let him know if the data is even useful? That would give him confidence that the model he built will be tested using the best available data and that the analytical information he shares with the executive is as accurate as possible.
These are all different aspects of governance. Governance is simply the process of managing, monitoring and protecting all data. A recent paper by Bloor Research, states:
“At Bloor Research we are firmly of the opinion that in order to make business decisions based on the analysis of data then you need to be sure that the information upon which you are making those decisions is trustworthy. Of course this applies to all sorts of business intelligence whatever the type of data being analysed but in this paper we will be concentrating on the analysis of so-called ‘big data’. understand the merits of having confidence in their data, when it comes to conventional, structured data there seems to be fallacy going round (no doubt fostered by vendors—often those that sell hardware—for whom quantity is everything) that the lessons we have learned about trusting traditional data do not apply when it comes to big data. As we shall discuss this is very far from being the truth.”
Simply put, confidence in data is essential regardless of the bigness of the data being analyzed. Governance cannot be ignored.
For more information, read the entire Bloor Research report and explore IBM’s Information Integration and Governance capabilities.
Also, consider joining IBM subject matter experts and industry colleagues at IBM Insight 2014 during the Information Management track sessions and #makedatawork better. Keep an eye out for IBM DataWorks too—coming soon!