Cloud data preparation: Reshaping the business analyst experience

Portfolio Marketing Manager, Information Integration & Governance, IBM

Obtaining good data for use in business analytics has always presented challenges. From figuring out which data is needed to finding it to validating its accuracy, analysts spend considerable time just preparing for analysis rather than doing the analysis. A recent Forbes study revealed that while more than 80 percent of all organizations are prioritizing data analytics in their budgets, more than 40 percent of those same organizations report legacy system bottlenecks and poor data quality. That poor quality impedes the progress of those high-priority analytical projects.

This revelation indicates a significant return on investment (ROI) problem. The historical challenges an analyst faces intensify exponentially in a big data world containing multiple internal and external data sources of varying formats and data definitions, extremely high volumes of data to review and ongoing concerns about data security and privacy.

Experiencing the problem through the analyst’s eyes

Consider a scenario in which a busy executive notes that store sales growth seems to be consistently stronger for some locations than it is for others. The executive has several theories why: local staff management skills, regional differences in receptivity to national advertising campaigns, relative affluence of the local customer bases and so on. The executive would like to find a way to help the underperforming locations improve their sales but needs to know which of those theories is closer to the truth to make wise investment choices. Therefore, the executive tasks an analyst to dig into the numbers and find out what is really going on.

The analyst builds a model to try to weigh the relative impact of the different elements of the executive’s theories and then needs to find data to feed into the model. Where does the analyst look? The analyst asks the IT department for sales data by store for the past several quarters and goes online to look for Twitter feeds that reference the national chain and specific locations. Store managers and corporate human resources (HR) personnel are also contacted to provide employee satisfaction surveys and performance review data. The analyst of course meets resistance from an already overburdened IT team and is told that assembling the data will take up to two weeks. When the data does arrive, it is missing fields, and the analyst has no idea how useful or current the data is.

HR presents another roadblock as the analyst is told that employee satisfaction data and especially performance review data are highly sensitive and cannot be shared. Even Twitter is no panacea; after all, does a viral tweet about an ad campaign mean people like the campaign, or does it mean that they love the funny tweet they are all retweeting? Even if the analyst could get all this data, it would come in different forms, and considerable time would be necessary to simply enter information into an analytical application to populate the model.

Does that scenario sound familiar to you? It should. It’s business as usual every day in every organization.

Taking a direct path to the data

One problem with this scenario is the inability of data to flow that last mile from the source to the analyst’s application. In addition, linking the data from various sources properly and in context is very difficult for the analyst. For example, how can employee satisfaction survey results be tied to the proper store sales records if the different sources have different store codes? The analyst will never be able to test this model if figuring out a way to avoid viewing personally identifiable, private data is not possible.

What if the analyst could pull data directly into the application in an accurate and secure manner? Imagine being able to build into the analytical application the ability to find the different data needed from desired sources. Further, imagine all that data being automatically masked, so the analyst never actually sees any recognizable store-level information. After all, the project is to discover root causes in the aggregate, not to pick apart any one location. What if the analyst could somehow know that the information being pulled in is the latest information available and that it is accurate?

And what if peer recommendations or comments on the data were available from people who had worked with the data set in the past to let the analyst know whether the data is even useful? That information would give the analyst the confidence that the model will be tested using the best available data and that the analytical information to be shared with the executive is as accurate as possible.

Leveraging the cloud to refine data for the analyst in this scenario, there is a better way. Analysts today have access to cloud-based data services to transform raw data into relevant and actionable information. Such services enable users to find relevant, easily consumable data quickly, without relying on IT, and they can use that information anywhere to drive the business. This transformation is accomplished through a set of capabilities designed to plug directly into analytical applications and use the cloud for the behind-the-scenes data refining—for example, data cleansing, data masking and so on. The data refining is necessary for confidence, but today is too slow or becomes a gating factor when curating data for analysis.

In addition to the capabilities already described, analysts also need the ability to comment on and rate different data sets or sources of information within their applications. This ability means time can be saved later when trying to determine whether the data is good data. It also means easier collaboration across analytical teams when sharing insights and best practices for handling different sets of data.

Why do these capabilities matter for an analyst? Busy executives do not want to wait for days or weeks for an answer to a question. In today’s business world, however, they are oftentimes obliged to wait, or they receive incomplete information or analysis with spurious results. Asking for data from third parties and then waiting for it to arrive no longer exists. Instead, the data can be immediately available and ready for analysis. Gone forever will be the days of analysts spending a significant amount of their time looking for and preparing data.

Another noteworthy benefit of having information access directly integrated with an analytical application is that analysts will not need to switch between different applications, search engines and so on to pull data together. This simple and seamless user experience means data from various sources can be analyzed jointly and in context. Ultimately, this direct access provides significantly more confidence than ever in the information being passed along to executives or to other analysts within the organization.

Working with an expert perspective

IBM offers business analysts powerful solutions to save significant time cost-effectively when curating data through seamless integration of data services into their analytical applications. Besides saving time and money, organizations can also enjoy enhanced confidence that the information its analysts produce reflects the latest and best data, thereby enhancing confidence in the decisions they make.

IBM has been positioned as a leader in the Gartner Magic Quadrant for Data Quality Report and in The Forrester Wave™: Data Quality Solutions report. Contact your IBM representative for more information on IBM Information, Integration and Governance technologies.