Datapalooza: Harmonize your data engineering

Big Data Evangelist, IBM

Data science is a human craft. It demands expert judgment every step of the way. Contrary to what some may think, data science is not some sort of cut-and-dried laboratory procedure to be followed step by step to arrive at the truth.

Nevertheless, a fairly standard data science methodology exists that many professionals adopt for harmonizing the disparate data engineering steps that are fundamental to this discipline. This methodology is the Cross Industry Standard for Data Mining (CRISP-DM), which has been in existence since the mid 1990s and developed by, among others, SPSS—now an IBM product group. CRISP-DM is a great high-level framework for describing basic processes in the lifecycle of any business-oriented data science model.

A strategy for avoiding predictive analytics mistakes

But in my opinion, this framework doesn’t quite capture the specific outcomes that a particular data scientist is trying to achieve at every stage in the process. In that regard, a Data Science Central website article, “Six Predictive Modeling Mistakes,” discusses the basics that data scientists must heed to avoid chief predictive modeling mistakes. This article and a companion blog on data preparation—acquisition, integration, wrangling, munging and pipelining—mistakes, spurred me to conceptualize it all in a crisp way—no pun intended. Essentially, the core tasks of the data scientist working on any project consist of selecting these appropriate stages:

  • The problem
  • The population
  • Variables
  • Sources
  • Samples
  • Data versions
  • Algorithms
  • Modeling approach
  • Scoring frequency
  • Model-fitness metrics
  • Results visualizations

A wide range of options are available at each of these critical stages. If the data scientist makes the wrong choice at any step in the process, they are likely to produce intelligence that is irrelevant, stale, biased, skewed, incomprehensible or otherwise useless. Avoiding the most common mistakes in all of these areas demands human judgment that comes from a blend of skills, aptitudes, education, training, experience, intuition and other things that can’t all be reduced to an automated set of procedures. It also requires a collaborative environment that includes governance, documentation and procedural safeguards to make sure that data scientists’ mistakes are caught and corrected in time by their peers.

To harmonize these many steps in diversified teams, data scientists need to acquire grounding in core concepts of data science, analytics and data management. They also need to gain a common understanding of the data science lifecycle, as well as the typical roles and responsibilities of data scientists in every phase. In addition, they need to be instructed on the various roles of data scientists and how they work in teams and in conjunction with business domain experts and stakeholders. And they need to learn a standard approach for establishing, managing and operationalizing data science projects in the business. paths to enrich understanding

For a deeper dive into all these areas, register for the first Datapalooza, 10–12 November 2015, at Galvanize in San Francisco, California. This event is sponsored by the Spark Technology Center. Datapalooza enables you to take your data science skills to the next level. Gain hands-on experience, enjoy one-on-one coaching and learn how to build a practical data science product in just three days. In doing so, you’ll be addressing real-world data science challenges that require creative-pattern thinking, machine learning, cognitive computing, natural-language processing and stream computing. And be sure to explore many of these IBM analytics topics.