Do data scientists need data management?

Big Data Industry Architect, IBM

Data scientists are a little different. But they can be integrated into an analytics team and managed when their needs are well understood. Data scientists typically tend to use different analytical tools, work with data that is formatted differently, follow different work patterns and have different educational backgrounds and mind-sets from other team members.

The tools of the trade for data scientists are generally R and Python, which have significant open source libraries of analytical functions, and IBM SPSS Modeler, an enterprise-quality, data-mining platform designed for integration into corporate systems. Typically, the data miner needs the data in a single, flat-record format. Many data-mining tools operate on a record-by-record basis—not on complex data models. This one huge analytical record should contain aggregate information such as “total sales month 1,” “total sales month 2,” “items sold month 1,” “items sold month 2” and so on. Each record would have all the pertinent information to predict or classify the inputs.

When attempting to spot who may make a good data scientist, look for those individuals who are curious about the data and interested in finding out new patterns and relationships. They may also have education in statistical techniques that help identify patterns.

Working as a team analytics team should compose distinct roles. It includes a business analyst to understand the data and the value of the analysis to the business, and IT professionals and database administrators (DBAs) to help collect and manage the data and deploy any results. The analytics team also includes a data scientist to apply tools and techniques to the data for discovering patterns. The team should work together to present the results, and not rely on the data scientist alone to prepare outputs for the business users.

The analytics team should expect the data scientist to work directly with IT and DBAs to access and organize the data. Data manipulation is key to the work data scientists do, and much of their time is spent reformatting data to feed to the algorithms—creating the one big record. Help in the data management area, especially when handling big data, is important for success because many data scientists are not proficient with big data. Statisticians and data miners are taught their skills in universities, using small sets of data on workstations and not huge data sets on multi-node Apache Hadoop clusters.

Using big data sources and data sets of varied granularity is sometimes new to data scientists. The IT and DBA team members should expect to be pulling together large volumes of data and merging multiple formats to create the one big analysis record for analytics. And the team should expect to do this process over and over again until a useful set of data is prepared.

Data scientists should not be expected to be industry experts. Moreover, not having preconceived ideas about the business relationships in the data and allowing the data to speak for itself is generally a better approach. This type of undirected analysis helps to remove institutional bias that may exist in an analysis.

The business analyst needs to help the data scientist understand the meaning of the data and the relevance of any discovered relationships. In the first stages of an analysis, the data scientist is expected to discover obvious relationships that need to be weeded out to uncover more meaningful patterns. For example, discovery may reveal that a lot of sunscreen is sold in the summer—an expected outcome—but also that certain stores may sell more than their share of sunscreen in the winter—something unexpected to investigate.

Ensuring organizational success 

Keeping any activities of the data scientist aligned to the overall business goals is critical. The managers and business analysts can help keep the team on track by looking at the outputs and focusing on those patterns that can lead to positive change. Discovering that summer is a good season for sunscreen, for example, isn’t that helpful; discovering the percentage of increase in sunscreen sales because of heightened advertising can help optimize advertising spend and grow profits.

The output produced by the data scientist should be usable in the operations of the organization. For example, a fraud-detection algorithm may be very accurate when based on many months of historical data. However, months of historical data may not always be available. Designing a fraud-detection model that is still accurate using historical data from only a few days would be of more use and more practical to implement.

Designing for change 

The work of a data scientist is iterative in nature. Constant feedback and updates to the data and findings of the analytics team should be provided. Analyzing the results is critical to ensure changes in patterns are discovered early. This feedback keeps the analysis current and greatly enhances the value of the outputs past their initial use. The analytical system should be designed to be circular by employing feedback loops from the beginning, not as an afterthought. Having many iterations and continuous updates is not a common architecture or business process and needs to be well planned by the team to ensure smooth implementation.

Building a team around the data scientist role to let the data scientist focus on applying tools and techniques to prepared data allows for personnel interchange. A trained data scientist may not always be available or may move around to numerous unrelated projects. Having a structure in place to carry on the work and processes helps ensure continued value from the analysis.

Learn more about Apache Spark as a power tool for the modern data scientist. And be sure to register for IBM Insight 2015, October 26–29, in Las Vegas, Nevada, to hear a presentation and learn more about this topic.

This post was coauthored by Elizabeth Dial and Pat O'Sullivan.