Data Mining and Statistical Modeling: Is there a real difference?

Data Scientist, IBM Analytics, IBM

A recurring question and point of debate in the realm of analytics is whether there exists any meaningful difference between data mining and statistics.  (Text mining or text analytics is not addressed here, although this area of unstructured or semi-structured data analysis has certain similarities as well as points of integration with data mining, the latter dealing with structured data.)  Some regard statistics as referring to hypothesis-driven analysis of smaller data sets, while data mining refers to discovery-driven analysis of large databases.  Others view the two terms as simply different names for extracting useful information and deriving conclusions from data.  Brieman describes two “cultures” or viewpoints about data analysis, with statisticians assuming that observed data are generated by a given data model while data miners make no assumptions about the data generation mechanism and instead rely on algorithms to search for patterns in usually large and complex data sets. 

As we consider this question, let’s summarize some ways in which data mining and statistical analysis are similar.  First, both provide analytical means to gain valuable, actionable insights into behavioral systems to facilitate decision-making or to increase knowledge about a domain of interest.  Second, both can be used to explore the elements of a data set and describe their characteristics.  Third, both can use those data elements as variables in unsupervised or supervised learning processes to build a model for grouping individuals across a spectrum of attributes, classifying them according to a target attribute, predicting outcomes, or generating forecasts. 

Now let’s look at a couple of important ways in which data mining and statistical analysis differ.  Statistics is concerned (in part) with explicit specification of main effects and interaction terms, and hypothesis tests thereof, in predictive models.  Data mining relies on algorithms to find patterns in data without requiring explicit model specification.  (The latter does not imply, however, that domain expertise is irrelevant for guiding the data mining analytical approach, defining variable transformations, and interpreting analytic results for useful insights and problem solution.  On the contrary, tossing everything including the kitchen sink into a data mining algorithm with little or no thought as to what makes sense in light of the problem is a good way to obtain useless results.  As usual, there is no free lunch.)

Furthermore, in addition to supervised (predictive) modeling building methods such as classification or regression algorithms of various flavors, data mining also offers many unsupervised methods such as clustering, associations, and sequences that focus on discovering previously unknown or unsuspected relationships in a data set.  This is a particularly important distinction between statistical analysis and data mining.  A given data set may represent a variety of complex and quite different behaviors within subspaces of the overall data space, and explicit specification of those localized and often-unknown relationships may be difficult or impossible. 

Finally, while data mining can be used effectively to analyze small but complex data sets, data mining methods lend themselves quite well to executing algorithms in-database with very large datasets in an information warehouse environment.  With statistical modeling, we usually think of taking samples and building models with much smaller datasets. 

Given the similarities and differences between statistical analysis and data mining, let’s consider how we might use these two approaches collaboratively to gain more understanding and enlightenment for problem solutions.  When presented with a large number of explanatory variables to predict some dependent variable, a data mining algorithm (particularly decision tree classification) can quickly identify the relatively few variables that are most important in explaining the behavior of interest.  (In many situations such as insurance underwriting, a technique such as principal components or factor analysis may be undesirable because the specific contribution of each explanatory variable is not obvious).  The identified variable subset can be then incorporated into a statistical model such as a logistic regression model.  Thus, when used as a variable reduction technique for statistical modeling, data mining can substantially accelerate predictive model development. 

In some cases, finding qualitative differences among groups of individuals may be insufficient to make an important decision.  For example, when deciding on scarce resource allocation for treating high-risk medical patients, we may want to be sure that a significant between-group difference really exists in a data mining model before we decide to treat one group but not the other based on their relative risk.  In such a case, we can apply statistical diagnostic measures to the data mining results to assess and validate the significance of the findings.

In summary, we see that fundamental differences do exist between data mining and statistical analysis, as they represent distinctly different approaches or “cultures” in seeking analytical solutions to business problems.  Yet they can be used very effectively in collaborative ways to improve the efficiency and effectiveness of quantifying and verifying complex relationships in data, enhancing the value of extracted information in making better business decisions.