Data science is a human craft, demanding just a much nuanced judgment and intuitive technique as you’d expect from any skilled artisan.
One of the downsides of using the word “science” in this context is that people think that statistical analysis is just some sort of cut-and-dried laboratory procedure that you follow step-by-step to arrive at the “truth.” That’s not true in the slightest, though, as with any establish discipline, there’s a fairly standard methodology that most professionals adopt. In the case of data science, the Cross Industry Standard for Data Mining has been in existence since the mid-’90s, developed by, among others, SPSS (now an IBM product group).
CRISP-DM is a great high-level framework for describing basic processes in the lifecycle of any business-oriented data-science model. But, IMHO, it doesn’t quite capture the specific outcomes that a particular data scientist is trying to achieve at every stage of the process. In that regard, I recently came across a good article discussing the basics that data scientists must heed to avoid the chief predictive modeling mistakes. This, plus a companion piece on data preparation mistakes, spurred me to conceptualize it all in a crisper way (no pun intended).
At the risk of occasionally belaboring the obvious, here’s my take on the most basic best practices of data science (the parallel construction is no accident):
- Select the right analytic problem: As a data scientist, you should be engaging only in projects that have some clear alignment with key business imperatives, such as differentiating in the competitive arena or enhancing customer loyalty. If you choose to explore problems that sound vaguely “cool,” you’ve moved away from applied data science. Once you start climbing that ivory tower, your continued employment is in jeopardy, as it’s only a matter of time before your boss figures out that you’re contributing nothing to the bottom line.
- Select the right subject population: If you’re modeling customer influence patterns, it helps to have behavioral data both on customers who are very influential and those who may be less influential but are more susceptible to being influenced. Over- or under-representing either group in your population will skew your model and may cause you to overlook key variables found in the under-represented segment.
- Select the right data sources: If you’re arbitrarily limiting yourself only to customer data sourced from internal applications, you might be overlooking external data—such as social media activity—that contains the key behavioral variables you need to build into your churn, cross-sell, or influence model. Even if you’re looking at the right population, building your training set from the wrong sources means you may be inadvertently skewing your model to the most convenient variables, not the most valid variables.
- Select the right data samples: You may have a powerful big-data platform that enables you to train your model from the entire population data set. Typically, though, you’ll prefer to train it from a much smaller sample. Your sampling needs might be simple, focused on ensuring that you extract a representative subset of the total population, so as not to introduce bias. Or your sampling design might be more complex—involving stratified, cluster and multistage approaches—in order to required level of precision without introducing skew.
- Select the right data and model versions: You can’t have much confidence in your models if you’re training them from old, inaccurate and inconsistent versions of data. Likewise, your analytic-driven business applications will provide faulty decision-support if you’re driving them from yesteryear’s predictive models and operational data. Consequently, you need to incorporate strong data and model governance into every aspect of your data-science operations.
- Select the right predictive variables: Identifying the best predictors from a wide range of independent variables is at the heart of the art of data science. You may have just a few variables to choose from, or thousands, and you need to immerse yourself in the body of statistical-modeling best practices known as “variable selection,” “attribute selection,” or “feature selection.” You should explore such variable-selection approaches as decision trees, clustering, association rule learning and outlier analysis.
- Select the right modeling approach and algorithms: This is also at the core of data science. If you have continuous variables, you’ll be doing regression modeling of one sort of another and have many algorithmic approaches to choose from. If you’re modeling discrete or categorical variables in addition to or in lieu of continuous variables, you’ll need to explore other algorithms. Depending on the particulars of your model, you might explore such algorithms as neural networks, genetic algorithms and support vector machines.
- Select the right model-validation frequency: If you imagine that the predictive model you’ve just built will always fit observational data perfectly, think again. Model fitness or quality can vanish in an instant. Depending on how fast the statistical relationships you’ve modeled are changing, you may need to score your models with fresh data every month, week, day, or even hour—and iterate the revised next version just as often. Choosing the scoring and iteration frequency is essential if your models are going to retain their predictive validity over time.
- Select the right model-fitness approaches: You have many options for assessing the predictive fitness of your statistical models. You should consider validating model fitness using multiple metrics and approaches, including model quality scores (K-S, Gini, ROC, etc.), goodness-of-fit charts, lift charts and comparative model evaluation.
- Select the right visualizations: You will be diluting the value of even the best statistical models if you don’t choose the right type of visualization(s) to guide data exploration, model development and results presentation. The right visualization makes all the difference between literally seeing a significant statistical pattern and never realizing it’s there. Your finest statistical algorithms won’t jump out at you and scream “Eureka!” the way that a well-chosen visualization will. There are many excellent references in visualization best practices; one of my favorite books on the topic is “Data Points” by Nathan Yau.
Avoiding the most common mistakes in all of these areas demands human judgment that comes from a blend of skills, aptitudes, education, training, experience, intuition and other things that can’t all be reduced to an automated set of procedures. It also requires a collaborative environment with governance, documentation and procedural safeguards to make sure that any data scientists’ mistakes are caught and corrected in time by their peers.
Data scientists are paid to make expert choices in each of these areas. Clearly, there are a wide range of options at each stage, and all are critical to realizing business value from your data science initiative.
If you make the wrong choice at any step of the process, you are likely to produce “intelligence” that is irrelevant, stale, biased, skewed, incomprehensible or otherwise useless.
Continue the discussion & check out these resources
- Here is an IBM webpage defining the data scientist.
- More blog posts, videos, podcasts and presentations about data scientists on IBM Big Data Hub
- Presentation of the "Myths and Mathemagical Powers of Data Scientists"