Post a Comment

The delicate art of data science project prioritization and triage

June 19, 2014

The world is awash in correlations and the data with which their significance will be revealed.

If you're a reasonably curious data scientist, you may be tempted to keep on mining more and more data until you uncover the brightest nuggets of insight. But that would be foolhardy in the extreme. You and your team have only so many hours in the day, dollars in the budget and smart people available to chase the wild geese that may or may not lay these golden eggs. Also, you're probably all working for somebody else, who doesn't have infinite patience nor the desire to fund your open-ended curiosity.

Image courtesy of Openclipart and used with permission

Prioritizing data mining projects is a delicate art, equivalent to the decisions that research and development managers face every single day. Data mining is all about searching for non-obvious patterns within large, complex collections of information. Consequently, data scientists wade into unexplored data all the time, never quite knowing whether they will find heretofore unknown correlations of value to the business that employs them.

How should you prioritize your data mining efforts and allocate your limited resources most effectively? Most important, how do you decide what NOT to work on? In this recent post, David Nettleton provides detailed guidance, with crisp criteria for squelching less-promising project proposals before they become a major resource hog and distraction. Here is my paraphrase of how he describes the threshold criteria that must be satisfied for a project proposal to survive the cut:

  • Objectives: Do you have specific quantitative metrics (absolute or relative) of the business objectives likely to be realized by the data mining project?
  • Coverage: With respect to what's available, easily accessible and/or affordable during the expected period of the initiative, is data sufficient to cover the scope of the project as envisaged?
  • Reliability: In terms of its quality and consistency, is the data reliable enough to satisfy the objectives of the project?
  • Correlation: On the face of it, and in terms of the patterns that are most likely to be discovered, does the data correlate sufficiently with the business metrics to be explored and predicted under the project?
  • Volatility: When the results of the project are expected to be generated, will the volatility of the environment in which they are to be used (for example, at a later phase of the current business cycle) render them invalid, useless or trivial?
  • Execution: Is the project likely to be executed according to plan, or do various execution constraints (such as short staffing, wrong tools and more) and risks (such as scope creep and excessive time to gather required data) make it unwise to start the project at all?

Behind all these project prioritization criteria are even more fundamental ones that address what, if anything, data mining will deliver over and above the current business baseline. The criteria consist of the following risk factors:

  • Overkill: Nettleton buries this criterion in a bullet in the middle of the article, but I think it's worth calling out: "Can the problem be solved by other techniques or methods?" Here's how I interpret that: if the project's objectives can be adequately satisfied by existing analytics tools (reports, spreadsheets, expert intuition and the like) then there's no point in putting expensive data scientists and their power tools on the problem.
  • Underwhelm: This is high up in Nettleton's discussion. If the expected returns (reduced customer churn, boosted acceptance rates, to name a few) from a data mining project aren't statistically significant over and the above the company's current performance on those metrics, then there's not enough "bang" to justify spending your employer's data mining "buck." Or, as Nettleton frames it: "The percentage improvement should always be considered with regard to the current precision of an existing index as a baseline. Also, the new precision objective should not get lost in the error bars of the current precision."

It's always good for data scientists to hum the mantra "disruptive" as they sift through project proposals. If there's little likelihood of a project delivering significantly valuable business insights, it belongs on the "no-go" list.